Grab Data with templatemaker

Posted to Software  |  Nathan Yau

Adrian Holovaty released templatemaker yesterday. Adrian is probably best known as the guy, featured on YouTube, who played the MacGyver theme song. So clearly, he a man a many talents.

Anyways, templatemaker is a Python script to extract data from text, um, HTML. For example, you could pass a review page from a site like Yelp, or several pages, and the script will “learn” the template. Once a template is established, you can extract the stuff that changes (e.g. ratings, restaurant name). Here, in Adrian’s words:

You can give templatemaker an arbitrary number of HTML files, and it will create the “template” that was used to create those files. (“Template,” in this case, means a string with a number of “holes” in it, where the holes represent the parts of the page that change.) Once you’ve got the template, you can then give it any HTML file that uses that same template, and it will give you the raw data: “The value for hole 1 is ‘July 6, 2007’, the value for hole 2 is ‘blue’,” etc.

It’s under the BSD license, so all the more reason to use it. I haven’t used it yet, but looking forward to it.


Marrying Age

People get married at various ages, but there are definite trends that vary across demographic groups. What do these trends look like?

Famous Movie Quotes as Charts

In celebration of their 100-year anniversary, the American Film Institute selected the 100 most memorable quotes from American cinema, and …

Years You Have Left to Live, Probably

The individual data points of life are much less predictable than the average. Here’s a simulation that shows you how much time is left on the clock.

Reviving the Statistical Atlas of the United States with New Data

Due to budget cuts, there is no plan for an updated atlas. So I recreated the original 1870 Atlas using today’s publicly available data.