Practical tips for scraping data

Posted to Coding  |  Tags: ,  |  Nathan Yau

It’s an unpleasant feeling when you have an idea for a project and the data you need is sitting right in front of you on a bunch of random-looking webpages instead of a nice, delimited file. You could either forget about your idea (which is what most people do), you can record manually, or you can take an automated route with a bit of scraping know-how.

I often find myself taking the tedious, manual route out, but sometimes scraping is clearly the best option. David Eads from the NPR Visuals Team describes how they use a model-control approach to scraping data.

Step 1: Find the data and figure out the HTML and/or JavaScript format and pattern. Step 2: Setup a way to parse and spit out the formatted data. Step 3: Optimize.

Oh, and before all that, make sure it’s legal.

Favorites

A Day in the Life of Americans

I wanted to see how daily patterns emerge at the individual level and how a person’s entire day plays out. So I simulated 1,000 of them.

How to Spot Visualization Lies

Many charts don’t tell the truth. This is a simple guide to spotting them.

Interactive: When Do Americans Leave For Work?

We don’t all start our work days at the same time, despite what morning rush hour might have you think.

Reviving the Statistical Atlas of the United States with New Data

Due to budget cuts, there is no plan for an updated atlas. So I recreated the original 1870 Atlas using today’s publicly available data.