April 24, 2014

Cool Google Spreadsheet XML/XPath Mojo

Google Spreadsheet sure isn’t as responsive for a power user like me, but I love the ability to share information with others and cooperatively edit a workbook. It’s become our main tool for planning the Gestalt IT Tech Field Day events. I was thrilled to discover that Google’s spreadsheet supports the importXML tag, which allows it to automatically gather information from other web sites. Let’s take a look at how it works!

HTML == XML?

Most web users have heard of HTML, but XML is a much geekier thing. Like HTML, XML is a way of “marking up” a text document to provide hidden clues about the content and how to display it. In fact, modern HTML is (sort of) a subset or category of XML, and HTML documents (the kind you access with your web browser) can be accessed by many XML tools.

Both HTML and XML enclose text in tags. For example, a paragraph would be surrounded by <p> and </p> tags, while a table starts and ends with <table> and </table>. These can be nested within each other, and are commonly deeply nested indeed. Web and XML documents typically have many layers of <div> and <span> tags, for example, and include lists within lists as well.

This somewhat-organized mess creates a “path” through the document leading to pieces of information. This concept is called XPath. For example, a twitter profile page embeds your name deep in the <html><body> path under <div id=”side”>, <div id=”profile” class=”section profile-side”>, <address>, <ul class=”about vcard entry-author”>, <li>, and <span class=”fn”>.

Raw HTML is often organized like this, and this can work to our advantage. If a computer program wanted to pull the full name of a Twitter user out of their profile page, it could look for a <span> element with the class property set to “fn”. It could also look for the <address> tag and pull out some information about the user in the vcard microformat.

importXML

All this becomes very interesting indeed when considering the importXML function in Google Spreadsheet. It will parse a URL as an XML file automatically, and you can tell it to look in an XPath for data. This is very powerful indeed!

Twitter Followers

I think an example will make it clearer. Let’s look up the Twitter follower count of a user. We create a spreadsheet and enter a twitter username in cell A1. Then we put the following formula in cells B1 and C1 and the count magically appears!

B1:

=if(C1<>"",right(C1,len(C1)-10),"")

C1:

=if(A1<>"",importXML("http://mobile.twitter.com/"&A1,"//a[@ href='http://mobile.twitter.com/"&lower(A1)&"/followers']"),0)

This formula looks for a <a href> with the URL including the user and twitter followers, which is a unique HTML element in every mobile Twitter profile. This returns a string like “Followers:1234″, so we use another formula to strip that part out.

Updated 1/22/12 after Twitter screwed up the main page. Good thing the mobile site still works!

LinkedIn Connections

Let’s try something else. How would we pull the number of connections a person has in LinkedIn using importXML? Here’s a function!

=value(substitute(importXML(A3,"//dd[@class='overview-connections']/p/strong"),"500+","500"))

This is a little more complicated. We’re doing the same thing, taking a url from cell A3 that corresponds to the person’s public LinkedIn page, and outputting the content of the <dd class=”overview-connections”> tag. But we’re also using the SUBSTITUTE() function to take the plus sign off a “500+” response and converting it into a value for calculation.

Updated 1/22/12 for new LinkedIn Format

Alexa and Klout

Here are a few more examples. I bet you can follow along now.

Alexa traffic rank:

=value(importXML("http://www.alexa.com/search?q="&E3,"//div[@class='row']/span/a[@href][1]"))

Klout score:

=value(substitute(importXML("http://klout.com/"&C3,"//span[@class='value']"),"klout score",""))

Limitations in Google Spreadsheets

Before you go thinking you can run off and create awesome web applications like this, know that there are some serious limitations. First, Google limits the use of importXML to 50 per workbook. This means you can’t import from hundreds of sources in the same spreadsheet, or even in multiple sheets in the same workbook. Next, importXML is pretty opaque in everyday use. You have to do a lot of trial and error to get the XPath right, and it fails often with #N/A, breaking calculations.

But it’s still pretty useful when creating a spreadsheet that needs to pull in information from outside sources. You can grab all sorts of data this way, from current stock quotes to weather or sports metrics. You have the whole Internet at your disposal – let’s get creative!

  • Kevin K

    Brilliant! I’ve been messing with xpath queries from google spreadsheets for pricing books for sale on Amazon for hours, and looking at your examples in this post made it happen for me in minutes.

  • http://twitter.com/epalmetto ePalmetto

    brilliant…! i wonder if there is a way to importXML of a Ning group like this one: http://landsurveyorsunited.com/group/sokkiasupportgroup and if so, importing all the groups for 50 US states would be very useful on my network. any ideas?

  • El

    Great ! Helped me a lot for getting the basics with no problems.

  • Pingback: Web Scraping, Hacking und Monitoring mit Google Docs – Teil 1

  • http://www.freakyrivet.com/ iyas

    OK – so I’m late to the party on this post. But really useful – thank you so much.

  • http://LawFirmSearchEngine.com/ LawFirmSearchEngine

    Thank you for posting this! I was using your formula for importing Alexa data into Google Sheets, but it’s not working now. It worked when I checked it a couple months ago. Not sure why . . .