Practical tips for scraping data

Posted to Coding  |  Tags: ,  |  Nathan Yau

It’s an unpleasant feeling when you have an idea for a project and the data you need is sitting right in front of you on a bunch of random-looking webpages instead of a nice, delimited file. You could either forget about your idea (which is what most people do), you can record manually, or you can take an automated route with a bit of scraping know-how.

I often find myself taking the tedious, manual route out, but sometimes scraping is clearly the best option. David Eads from the NPR Visuals Team describes how they use a model-control approach to scraping data.

Step 1: Find the data and figure out the HTML and/or JavaScript format and pattern. Step 2: Setup a way to parse and spit out the formatted data. Step 3: Optimize.

Oh, and before all that, make sure it’s legal.

Favorites

Most popular porn searches, by state

We’ve seen that we can learn from what people search for, through the eyes of Google suggestions: state stereotypes, national …

Think Like a Statistician – Without the Math

I call myself a statistician, because, well, I’m a statistics graduate student. However, the most important things I’ve learned are less formal, but have proven extremely useful when working/playing with data.

Divorce Rates for Different Groups

We know when people usually get married. We know who never marries. Finally, it’s time to look at the other side: divorce and remarriage.

The Most Unisex Names in US History

Moving on from the most trendy names in US history, let’s look at the most unisex ones. Some names have …