April 18, 2014

Google Spreadsheet Regular Expression Functions Are Insanely Useful

I’m a UNIX command line guy, so it shouldn’t come as a surprise that I love regular expressions. This simple language allows one to slice and dice text strings with abandon, greatly simplifying data extraction and replacement tasks. Although it’s been there since 2010, I was thrilled recently to discover that Google Spreadsheets now supports in-cell regular expressions! It’s too bad Microsoft never added this.

What Are Regular Expressions?

Regular expressions (often abbreviated as “regex”) are a way to search for text within a string. Regex implementations are found throughout text-based computing, from common UNIX editors and tools (ed and grep) to Internet-friendly software (Perl, Python, JavaScript, PHP) and even some modern apps (Skype!)

A regular expression is usually implemented inside a simple text command, with the forward slash (“/”) marking the start and end of the expression. Systems then execute a given command wherever the expression is matched.

The most basic expressions are simply some text to match exactly, but things really get powerful when metacharacters and classes are employed. For example, you could search for “abc” by typing “/abc”. But what if you wanted to find any three letters? In Perl, and many similar implementations, /[\w]{3}/ matches any three Unicode letters in a row. Whoa!

Trivia time: Did you know that the popular UNIX tool, grep, gets its name from a regular expression? g/re/p is the command to search by regular expression, “re“, in the editor, ed!

Some Basic Regular Expressions

There are many, many metacharacters, modifiers, and operators in most regular expression implementations. I’ve been using the common Perl set for over 20 years and still don’t have everything memorized! But it’s easy to learn the basics!

  • Any plain text string can be used for an exact match. Want to find “Stephen” in a file? Just use “Stephen” as your regular expression!
  • You can group characters in square brackets. For example, “[az]” matches “a” or “z”. The system will even assume common sequences, so “[a-z]” matches any letter in the alphabet from “a” through “z”.
  • A single period (“.”) matches any character, so “Gr…” matches “Grant”, “Grace”, or “Greta”.
  • The caret and dollar-sign are anchors for the beginning and end of line, respectively. So “^Gr[aecnt]{3}$” matches a line containing “Grant”, “Grace”, or “Greta”!
  • You can make a match optional with “?” or “*”, which match “none or one” or “any”, respectively. I use “.*” (matching any number of characters) and “.?” (matching none or one character) a lot!

There is much, much more you can do, but this ought to get you started!

Here’s a great Regex reference!

Regex Functions in Google Spreadsheets

With that out of the way, let’s move on to Google Spreadsheets!

As of mid-2010, Google added three Regular Expression functions to Google Docs (now part of Google Drive). These are not compatible with Microsoft Excel (which only supports regex in VBA) or Apple Numbers. But they’re awesome anyway!

  • =REGEXEXTRACT(“text”;”regex”) – This is my favorite. It allows you to extract a bit of a string from text using parenthesis. It’s replaced a whole horde of LEFT(), MID() and FIND() in my spreadsheets! =REGEXEXTRACT(A2,” \(K([A-Z]{3})\)$”) will fill a cell with three capital letters after “K” in parenthesis after a space in the last word of a string. Let’s see you do that with MID()! Note: Those forward slashes say “literally a parenthesis character”, as opposed to the parenthesis that marks what I want to extract.
  • =REGEXMATCH(“text”;”regex”) – Just like MATCH(), this uses a regular expression to locate the position of text in a cell.
  • =REGEXREPLACE(“text”,”regex”,”replacement”) – This is the inverse of REGEXEXTRACT(). It spits out the entire content but with the regular expression matched content replaced. =REGEXREPLACE(A2,”apple”,”orange”) will fill a cell with the same text as A2, but with the apple becoming an orange!
Note that Google’s documentation is terrible. You’ll need to spend a fair amount of time experimenting to learn these functions!

Stephen’s Stance

If you do any sort of spreadsheet work, you’ll find these functions incredibly useful. I know regular expressions can be intimidating, and I hate that they’re not compatible outside Google Spreadsheets, but it’s worth learning these regex functions!

  • http://twitter.com/quanghoc H.a.w.k P.h.i.l

    Dude, you are out of touch for years. Microsoft Excel has RegEx long time ago. Please do some Google search before writing this.

  • http://blog.fosketts.net sfoskett

    No Microsoft Excel function supports regular expressions, and as noted, these functions are not Excel compatible. Also as noted, you can add regular expressions using VBA or other add-ins, but many people are unwilling to do that for fear of breaking general compatibility. Thank you for your comment, however.

  • Jenny Robertson

    I really want to grep my Google Drive documents from the search box showing my list of files. I want it to search my files for a string (or regex) and would return a list of search results with the few lines of context for each file (or the row in a spreadsheet). Do you know how to do this?

  • David Benson

    Words cannot describe how helpful this post was. You are not lying when you say that Google’s documentation for spreadsheets is awful. Didn’t even know you had this ability until I found your post.

    An additional question for you. Do you have any insight about how to use =RegexMatch to count the number of occurrences of a match, both in the same cells and in multiple cells? I’ve been currently using ArrayForumla to print out a list of Regex Matches for multiple cells, then using a Countif function, but I’m sure there is a better way. I’d love to get your thoughts on this challenge, since you’re the only person I’ve found with any insight on this subject.

    Thanks a heap!

    -db

  • Morgrim X

    In your Google example of REGEXEXTRACT, you explain the meaning of the forward slashes, but your example has backward slashes, not forward ones.

    I have a string for XY coordinates separated by a colon, e.g. 429:827. I want to break those up into an X value in one cell and the Y value in a 2nd cell.

    I have done the first string using the regular expression “[0-9]+”. I am having troubles with the Y coordinate. I tried “:[0-9]+”, but that results in the first character being the colon, e.g. :827. I suppose there’s more than one way to skin that cat, but to better understand regular expressions, I’d like to create one that does not result in the colon being there.

  • Morgrim X

    NVM, lol. I used REGEXREPLACE to remove the first number and the colon with nothing, leaving the Y coords there. But if there is another way, I’d like to learn all of them