Grab Data with templatemaker

Posted to Software  |  Nathan Yau

Adrian Holovaty released templatemaker yesterday. Adrian is probably best known as the guy, featured on YouTube, who played the MacGyver theme song. So clearly, he a man a many talents.

Anyways, templatemaker is a Python script to extract data from text, um, HTML. For example, you could pass a review page from a site like Yelp, or several pages, and the script will “learn” the template. Once a template is established, you can extract the stuff that changes (e.g. ratings, restaurant name). Here, in Adrian’s words:

You can give templatemaker an arbitrary number of HTML files, and it will create the “template” that was used to create those files. (“Template,” in this case, means a string with a number of “holes” in it, where the holes represent the parts of the page that change.) Once you’ve got the template, you can then give it any HTML file that uses that same template, and it will give you the raw data: “The value for hole 1 is ‘July 6, 2007’, the value for hole 2 is ‘blue’,” etc.

It’s under the BSD license, so all the more reason to use it. I haven’t used it yet, but looking forward to it.

Favorites

A Day in the Life of Americans

I wanted to see how daily patterns emerge at the individual level and how a person’s entire day plays out. So I simulated 1,000 of them.

Where Bars Outnumber Grocery Stores

A closer look at the age old question of where there are more bars than grocery stores, and vice versa.

Graphical perception – learn the fundamentals first

Before you dive into the advanced stuff – like just about everything in your life – you have to learn the fundamentals before you know when you can break the rules.

Jobs Charted by State and Salary

Jobs and pay can vary a lot depending on where you live, based on 2013 data from the Bureau of Labor Statistics. Here’s an interactive to look.