Cool Google Spreadsheet XML/XPath Mojo

July 2, 2010 By Stephen 9 Comments

Google Spreadsheet sure isn’t as responsive for a power user like me, but I love the ability to share information with others and cooperatively edit a workbook. It’s become our main tool for planning the Gestalt IT Tech Field Day events. I was thrilled to discover that Google’s spreadsheet supports the importXML tag, which allows it to automatically gather information from other web sites. Let’s take a look at how it works!

HTML == XML?

Most web users have heard of HTML, but XML is a much geekier thing. Like HTML, XML is a way of “marking up” a text document to provide hidden clues about the content and how to display it. In fact, modern HTML is (sort of) a subset or category of XML, and HTML documents (the kind you access with your web browser) can be accessed by many XML tools.

Both HTML and XML enclose text in tags. For example, a paragraph would be surrounded by <p> and </p> tags, while a table starts and ends with <table> and </table>. These can be nested within each other, and are commonly deeply nested indeed. Web and XML documents typically have many layers of <div> and <span> tags, for example, and include lists within lists as well.

This somewhat-organized mess creates a “path” through the document leading to pieces of information. This concept is called XPath. For example, a twitter profile page embeds your name deep in the <html><body> path under <div id=”side”>, <div id=”profile” class=”section profile-side”>, <address>, <ul class=”about vcard entry-author”>, <li>, and <span class=”fn”>.

Raw HTML is often organized like this, and this can work to our advantage. If a computer program wanted to pull the full name of a Twitter user out of their profile page, it could look for a <span> element with the class property set to “fn”. It could also look for the <address> tag and pull out some information about the user in the vcard microformat.

importXML

All this becomes very interesting indeed when considering the importXML function in Google Spreadsheet. It will parse a URL as an XML file automatically, and you can tell it to look in an XPath for data. This is very powerful indeed!

Twitter Followers

I think an example will make it clearer. Let’s look up the Twitter follower count of a user. We create a spreadsheet and enter a twitter username in cell A1. Then we put the following formula in cells B1 and C1 and the count magically appears!

B1:
=if(C1<>"",right(C1,len(C1)-10),"")

C1:
=if(A1<>"",importXML("http://mobile.twitter.com/"&A1,"//a[@ href='http://mobile.twitter.com/"&lower(A1)&"/followers']"),0)

This formula looks for a <a href> with the URL including the user and twitter followers, which is a unique HTML element in every mobile Twitter profile. This returns a string like “Followers:1234”, so we use another formula to strip that part out.

Updated 1/22/12 after Twitter screwed up the main page. Good thing the mobile site still works!

LinkedIn Connections

Let’s try something else. How would we pull the number of connections a person has in LinkedIn using importXML? Here’s a function!

=value(substitute(importXML(A3,"//dd[@class='overview-connections']/p/strong"),"500+","500"))

This is a little more complicated. We’re doing the same thing, taking a url from cell A3 that corresponds to the person’s public LinkedIn page, and outputting the content of the <dd class=”overview-connections”> tag. But we’re also using the SUBSTITUTE() function to take the plus sign off a “500+” response and converting it into a value for calculation.

Updated 1/22/12 for new LinkedIn Format

Alexa and Klout

Here are a few more examples. I bet you can follow along now.

Alexa traffic rank:

=value(importXML("http://www.alexa.com/search?q="&E3,"//div[@class='row']/span/a[@href][1]"))

Klout score:

=value(substitute(importXML("http://klout.com/"&C3,"//span[@class='value']"),"klout score",""))

Limitations in Google Spreadsheets

Before you go thinking you can run off and create awesome web applications like this, know that there are some serious limitations. First, Google limits the use of importXML to 50 per workbook. This means you can’t import from hundreds of sources in the same spreadsheet, or even in multiple sheets in the same workbook. Next, importXML is pretty opaque in everyday use. You have to do a lot of trial and error to get the XPath right, and it fails often with #N/A, breaking calculations.

But it’s still pretty useful when creating a spreadsheet that needs to pull in information from outside sources. You can grab all sorts of data this way, from current stock quotes to weather or sports metrics. You have the whole Internet at your disposal – let’s get creative!

You might also want to read these other posts...

Comments

Kevin K says

November 3, 2010 at 2:03 am

Brilliant! I’ve been messing with xpath queries from google spreadsheets for pricing books for sale on Amazon for hours, and looking at your examples in this post made it happen for me in minutes.
ePalmetto says

December 20, 2010 at 12:13 am

brilliant…! i wonder if there is a way to importXML of a Ning group like this one: http://landsurveyorsunited.com/group/sokkiasupportgroup and if so, importing all the groups for 50 US states would be very useful on my network. any ideas?
El says

August 19, 2011 at 4:28 am

Great ! Helped me a lot for getting the basics with no problems.
Triode says

January 23, 2012 at 2:38 pm

My compilation of ImportXML recipes: http://www.seerinteractive.com/blog/importxml-cookbook

Thought it would be helpful to anyone who found this blog post.
iyas says

September 3, 2013 at 11:16 am

OK – so I’m late to the party on this post. But really useful – thank you so much.
LawFirmSearchEngine says

January 3, 2014 at 10:53 pm

Thank you for posting this! I was using your formula for importing Alexa data into Google Sheets, but it’s not working now. It worked when I checked it a couple months ago. Not sure why . . .
Ben Collins says

January 12, 2016 at 3:51 pm

I share your enthusiasm for Google’s IMPORT functions! They’re powerful and versatile but yes, they are somewhat unstable!

I’ve written a recent (Jan 2016) article for importing social media data into your Google spreadsheet with ImportXML, which might be helpful to folks reading here: http://www.benlcollins.com/spreadsheets/import-social-media-data/

GPS Time Rollover Failures Keep Happening (But They’re Almost Done)

This is week “1111111111” in the GPS system. Tomorrow morning it will roll over to week “0000000000”. How well will various systems handle this change? Not well, judging by what we’ve seen so far!

The 2018 iPad Pro is a Beast!

The third-generation iPad Pro is a great machine but also a bellwether of change at Apple. It will be very hard for the rest of the mobile and client computing industry to keep up with this kind of progress!

Storage Changes in VMware vSphere 5

July 16, 2011

Once again, VMware added a ton of new storage enhancements to vSphere. With storage rapidly becoming the limiting factor in scalability and performance of virtual machine environments, this is no surprise. Also not surprising is the fact that major features like Policy-Driven Storage and Storage DRS (along with SIOC) are exclusive to “Enterprise Plus” licenses.

Fasting to Mitigate Jet Lag: Surprise! It Works!

February 11, 2013

I’m a frequent traveler, and thus a frequently suffer from moderate jet lag. It’s just so hard to adjust to a new time zone! But I recently stumbled on a simple method many claim helps your internal clock re-calibrate to travel. After trying it out on my trip to Australia last week, I’m convinced it can help!

Generation 3 drobo: Fall In Love All Over Again

April 9, 2015

I remain a huge fan of drobo generally, and the third-generation drobo remains the best choice for home storage. It’s the perfect storage device for the long haul, and the performance improvements make it a no-brainer. Get one.

The Terrifying True Story Of Virtual Machine Mobility

December 22, 2011

Virtualization of server, network, and storage services illuminates the link between physical resources and functional applications. A running virtual machine can instantly move from one server, network adapter, HBA, or LUN to another. And when it happens, traditional components have no idea how to react.

ZFS Is the Best Filesystem (For Now…)

July 10, 2017

ZFS should have been great, but I kind of hate it: ZFS seems to be trapped in the past, before it was sidelined it as the cool storage project of choice; it’s inflexible; it lacks modern flash integration; and it’s not directly supported by most operating systems. But I put all my valuable data on ZFS because it simply offers the best level of data protection in a small office/home office (SOHO) environment. Here’s why.

What You See and What You Get When You Follow Me

May 28, 2019

Social media ought to be social, not just a broadcast platform. That’s my feeling at least. It’s been a while since I’ve ranted about “write-only” social media accounts, so I thought I might as well do it again. And at the same time, I thought I would update you on my promise to the people who read, follow, and interact with me online.

Why You Should Never Again Utter The Word, “CIFS”

February 16, 2012

CIFS is not the network storage protocol used by Microsoft Windows, and many other clients. The protocol used to share files over a LAN by the majority of personal computers is called SMB. I wish everyone in the industry would get that through their heads.

Scaling Storage Is Hard To Do

June 4, 2013

Data storage isn’t as easy as it sounds, especially at enterprise or cloud scale. It’s simple enough to read and write a bit of data, but much harder to build a system that scales to store petabytes. That’s why I’m keenly focused on a new wave of storage systems built from the ground up for scaling!