Format and clean your data with Google Refine

When we first learn how to deal with data in school, it’s nicely formatted and fits perfectly into a rectangular spreadsheet. Then when we start to deal with real data, we find missing values, inconsistencies, and for some reason it doesn’t plug straight into our software. What the heck?

The caveman way to fix this problem is to open Excel and manually edit everything. Some ad hoc code can often fix your problems, but still that takes time and can be a pain. Google Refine, the Googley evolution of Freebase Gridworks, can help you.

Google Refine is a power tool for working with messy data sets, including cleaning up inconsistencies, transforming them from one format into another, and extending them with new data from external web services or other databases. Version 2.0 introduces a new extensions architecture, a reconciliation framework for linking records to other databases (like Freebase), and a ton of new transformation commands and expressions.

The project is open-source, and you don’t have to upload your data to some other server. Refine runs on your desktop.

Watch the demo in the video below. Then download the application for yourself. I’m looking forward to taking it for a spin.

Have you tried it out (or the previous Freebase version)? Leave your thoughts in the comments.

[Google via Forums]

12 Comments

  • Oh man, Google bought Freebase Gridworks as well? Things don’t last very long these days before they get assimilated into the Googlesphere.

  • I loved it so much I started committing code to the project!

    Look forward to hearing about your experience with it.

    P.S. When you say “Freebase Frameworks” do you mean “Freebase Gridworks”, which is the old name for Google Refine?

  • Hi have tested it with CSV files and works pretty well, but it’s not valid for large quantities of data.

    If you trie to load in your computer a CSV it will work with 200MB of data, but if you try 500MB for instance, you’ll see the out of memory errors when creating the project. Also, some operations might take very long time to process on large datasets.

    Despite all, is a perfect tool that I use to normalize data, specially grouping categories of rows that usually have ending spaces, bad encodings and so on

  • glenn mcdonald November 16, 2010 at 6:02 am

    I second/third/whatever the Refine endorsement. In some ways you can think of it as a reimagning of Excel from a data-cleaner’s perspective instead of an accountant’s!

    And if you need to clean data that doesn’t quite fit the Refine spreadsheet model, or want to currate and analyze and publish it at once, or need to scrape it off the web first, there’s also http://www.needlebase.com (my project), which is slated to be another Google property pending DOJ review of the ITA acquisition…

  • wow. thanks a million for this. I hadn’t heard of this project/tool – and after watching the videos, and downloading, I’m already impressed with what it can do. There’s about an 80% chance that this gets used almost daily (finally and Export that is flexible!!) I’ll be curious to see how much data I’ll be able to run at once – I tend to hit the limits on that measure – but I’ll be happy to chunk my data (especially since you can re-run your data transformations – that’s a brilliant little addition…)

  • This looks very useful. It bridges the gap between pure scripting (ala R), databases, spreadsheets, Tableau/JMP. Often, I find myself integrating these types of tools separately in managing, cleaning and manipulating data, at least when reproductivity is not an issue (otherwise I stick to R, Python or SAS). Now, if only they would explicitly incorporate Google Charts (I suppose it could be scripted, but…), they might have a killer app for general data cleaning and reporting. I doubt I’d use it for serious stats or to replace a db though.

  • Incredibly useful. I’m left wondering whether there is any way for one’s data to be compromised. I understand that the program is installed and run locally, and the data are stored locally as well (http://code.google.com/p/google-refine/wiki/FaqWhereIsDataStored); however, is there any point at which the data are processed anywhere except the local machine? I happen to work at a place where Refine could take care of the great majority of our data processing problems, but confidentiality is often a priority.

    • Answered my own question. In the last seconds of introduction video, we’re told the whole thing runs locally, which keeps confidential data safe.

    • @Luiz As the intro video says, everything runs locally. And the source is open if you wish to verify this.

      There are a couple of features which do allow you to call external APIs (and are clearly marked as such) – the intro videos show this in action with Google Translate and Freebase.

  • Discovered Freebase Gridworks couple of months ago and do really like the tool. I do save a lot of time and by using it and it helps me to get immediate insight in data quality.
    Hope the take over by Google will boost performance and functionality of the tool in the right direction!