• Skip to main content
  • Skip to primary sidebar
  • Home
  • About
    • Stephen Foskett
      • My Publications
        • Urban Forms in Suburbia: The Rise of the Edge City
      • Storage Magazine Columns
      • Whitepapers
      • Multimedia
      • Speaking Engagements
    • Services
    • Disclosures
  • Categories
    • Apple
    • Ask a Pack Rat
    • Computer History
    • Deals
    • Enterprise storage
    • Events
    • Personal
    • Photography
    • Terabyte home
    • Virtual Storage
  • Guides
    • The iPhone Exchange ActiveSync Guide
      • The iPhone Exchange ActiveSync Troubleshooting Guide
    • The iPad Exchange ActiveSync Guide
      • iPad Exchange ActiveSync Troubleshooting Guide
    • Toolbox
      • Power Over Ethernet Calculator
      • EMC Symmetrix WWN Calculator
      • EMC Symmetrix TimeFinder DOS Batch File
    • Linux Logical Volume Manager Walkthrough
  • Calendar

Stephen Foskett, Pack Rat

Understanding the accumulation of data

You are here: Home / Everything / Cool Google Spreadsheet XML/XPath Mojo

Cool Google Spreadsheet XML/XPath Mojo

July 2, 2010 By Stephen 8 Comments

Google Spreadsheet sure isn’t as responsive for a power user like me, but I love the ability to share information with others and cooperatively edit a workbook. It’s become our main tool for planning the Gestalt IT Tech Field Day events. I was thrilled to discover that Google’s spreadsheet supports the importXML tag, which allows it to automatically gather information from other web sites. Let’s take a look at how it works!

HTML == XML?

Most web users have heard of HTML, but XML is a much geekier thing. Like HTML, XML is a way of “marking up” a text document to provide hidden clues about the content and how to display it. In fact, modern HTML is (sort of) a subset or category of XML, and HTML documents (the kind you access with your web browser) can be accessed by many XML tools.

Both HTML and XML enclose text in tags. For example, a paragraph would be surrounded by <p> and </p> tags, while a table starts and ends with <table> and </table>. These can be nested within each other, and are commonly deeply nested indeed. Web and XML documents typically have many layers of <div> and <span> tags, for example, and include lists within lists as well.

This somewhat-organized mess creates a “path” through the document leading to pieces of information. This concept is called XPath. For example, a twitter profile page embeds your name deep in the <html><body> path under <div id=”side”>, <div id=”profile” class=”section profile-side”>, <address>, <ul class=”about vcard entry-author”>, <li>, and <span class=”fn”>.

Raw HTML is often organized like this, and this can work to our advantage. If a computer program wanted to pull the full name of a Twitter user out of their profile page, it could look for a <span> element with the class property set to “fn”. It could also look for the <address> tag and pull out some information about the user in the vcard microformat.

importXML

All this becomes very interesting indeed when considering the importXML function in Google Spreadsheet. It will parse a URL as an XML file automatically, and you can tell it to look in an XPath for data. This is very powerful indeed!

Twitter Followers

I think an example will make it clearer. Let’s look up the Twitter follower count of a user. We create a spreadsheet and enter a twitter username in cell A1. Then we put the following formula in cells B1 and C1 and the count magically appears!

B1:

=if(C1<>"",right(C1,len(C1)-10),"")

C1:

=if(A1<>"",importXML("http://mobile.twitter.com/"&A1,"//a[@ href='http://mobile.twitter.com/"&lower(A1)&"/followers']"),0)

This formula looks for a <a href> with the URL including the user and twitter followers, which is a unique HTML element in every mobile Twitter profile. This returns a string like “Followers:1234”, so we use another formula to strip that part out.

Updated 1/22/12 after Twitter screwed up the main page. Good thing the mobile site still works!

LinkedIn Connections

Let’s try something else. How would we pull the number of connections a person has in LinkedIn using importXML? Here’s a function!

=value(substitute(importXML(A3,"//dd[@class='overview-connections']/p/strong"),"500+","500"))

This is a little more complicated. We’re doing the same thing, taking a url from cell A3 that corresponds to the person’s public LinkedIn page, and outputting the content of the <dd class=”overview-connections”> tag. But we’re also using the SUBSTITUTE() function to take the plus sign off a “500+” response and converting it into a value for calculation.

Updated 1/22/12 for new LinkedIn Format

Alexa and Klout

Here are a few more examples. I bet you can follow along now.

Alexa traffic rank:

=value(importXML("http://www.alexa.com/search?q="&E3,"//div[@class='row']/span/a[@href][1]"))

Klout score:

=value(substitute(importXML("http://klout.com/"&C3,"//span[@class='value']"),"klout score",""))

Limitations in Google Spreadsheets

Before you go thinking you can run off and create awesome web applications like this, know that there are some serious limitations. First, Google limits the use of importXML to 50 per workbook. This means you can’t import from hundreds of sources in the same spreadsheet, or even in multiple sheets in the same workbook. Next, importXML is pretty opaque in everyday use. You have to do a lot of trial and error to get the XPath right, and it fails often with #N/A, breaking calculations.

But it’s still pretty useful when creating a spreadsheet that needs to pull in information from outside sources. You can grab all sorts of data this way, from current stock quotes to weather or sports metrics. You have the whole Internet at your disposal – let’s get creative!

You might also want to read these other posts...

  • What You See and What You Get When You Follow Me
  • Ranting and Raving About the 2018 iPad Pro
  • Electric Car Over the Internet: My Experience Buying…
  • Liberate Wi-Fi Smart Bulbs and Switches with Tasmota!
  • Powering Rabbits: The Mean Well LRS-350-12 Power Supply

Filed Under: Everything, Personal Tagged With: Alexa, Google, Google Apps, Google Spreadsheet, HTML, importXML, Klout, LinkedIn, social media, Twitter, XML, XPath

Primary Sidebar

This is our mission: To be the Daleks of God

Shriekback

Subscribe via Email

Subscribe via email and you will receive my latest blog posts in your inbox. No ads or spam, just the same great content you find on my site!
 New posts (daily)
 Where's Stephen? (weekly)

Download My Book


Download my free e-book:
Essential Enterprise Storage Concepts!

Recent Posts

Electric Car Over the Internet: My Experience Buying From Vroom

November 28, 2020

Powering Rabbits: The Mean Well LRS-350-12 Power Supply

October 18, 2020

Tortoise or Hare? Nvidia Jetson TK1

September 22, 2020

Running Rabbits: More About My Cloud NUCs

September 21, 2020

Introducing Rabbit: I Bought a Cloud!

September 10, 2020

Remove ROM To Use LSI SAS Cards in HPE Servers

August 23, 2020

Test Your Wi-Fi with iPerf for iOS

July 9, 2020

Liberate Wi-Fi Smart Bulbs and Switches with Tasmota!

May 29, 2020

What You See and What You Get When You Follow Me

May 28, 2019

GPS Time Rollover Failures Keep Happening (But They’re Almost Done)

April 6, 2019

Symbolic Links

    Featured Posts

    MacBook Users: Encrypt Your Drive with OS X FileVault! It’s Easy and Free!

    December 20, 2012

    Datacenter History: Through the Ages in Lego

    October 22, 2013

    What You See and What You Get When You Follow Me

    May 28, 2019

    The Fat Middle: Today’s Enterprise Storage Array

    August 31, 2014

    Why Big Disk Drives Require Data Integrity Checking

    December 19, 2014

    My Core i7 Macintosh SE

    May 25, 2017

    Generation 3 drobo: Fall In Love All Over Again

    April 9, 2015

    Scaling Storage At The Client

    November 25, 2013

    From LAN Manager and SMB to CIFS: The Evolution of Prehistoric PC Network Protocols

    March 22, 2012

    Introducing Rabbit: I Bought a Cloud!

    September 10, 2020

    Copyright © 2021 · Log in