Grab Data with templatemaker

Posted to Software  |  Nathan Yau

Adrian Holovaty released templatemaker yesterday. Adrian is probably best known as the guy, featured on YouTube, who played the MacGyver theme song. So clearly, he a man a many talents.

Anyways, templatemaker is a Python script to extract data from text, um, HTML. For example, you could pass a review page from a site like Yelp, or several pages, and the script will “learn” the template. Once a template is established, you can extract the stuff that changes (e.g. ratings, restaurant name). Here, in Adrian’s words:

You can give templatemaker an arbitrary number of HTML files, and it will create the “template” that was used to create those files. (“Template,” in this case, means a string with a number of “holes” in it, where the holes represent the parts of the page that change.) Once you’ve got the template, you can then give it any HTML file that uses that same template, and it will give you the raw data: “The value for hole 1 is ‘July 6, 2007’, the value for hole 2 is ‘blue’,” etc.

It’s under the BSD license, so all the more reason to use it. I haven’t used it yet, but looking forward to it.

Favorites

Real Chart Rules to Follow

There are rules—usually for specific chart types meant to be read in a specific way—that you shouldn’t break. When they are, everyone loses. This is that small handful.

The Changing American Diet

See what we ate on an average day, for the past several decades.

Top Brewery Road Trip, Routed Algorithmically

There are a lot of great craft breweries in the United States, but there is only so much time. This is the computed best way to get to the top rated breweries and how to maximize the beer tasting experience. Every journey begins with a single sip.

19 Maps That Will Blow Your Mind and Change the Way You See the World. Top All-time. You Won’t Believe Your Eyes. Watch.

Many lists of maps promise to change the way you see the world, but this one actually does.