• In her TED talk, Emily Oster challenges our conception of AIDS and suggests other covariates that we need to look at (e.g. export volumes of coffee). Until we get out of the mindset that poverty and health care are the only causes/predictors of AIDS, we won’t be able to find the best way to fight the disease. Another great use of data.

    I do have one small itch to scratch though. Emily had a line plot that shows export volumes and another line, on the same grid, of HIV infections, both over time. It reminds me of the plots that Al Gore uses with carbon dioxide levels and temperature. Anyways, using the plot, Emily suggests a very tight relationship between export volumes and HIV infections. Isn’t export volume pretty tightly knit to poverty? I don’t know. She’s the economist, so she would know (A LOT) better than me. I guess I just wish she talked a little bit about the new and different data she has that compels us to change our conceptions.

  • Gas Prices over TimeWhile on the subject of gas prices, Foreign Policy has a graph of the prices per gallon of gasoline from 2000 to 2006. With the US at the lower tier, I feel like a bit of a whiner (“Waa waa waa, it costs 30 dollars to fill my tank”). At the lower end, it seems Venezuela seems the place to be, with some major government subsidizing going on.


  • A very simple graph from The Economist (spiced up a bit with a picture of a delicious gasoline droplet) that quickly gets its point across. The United States uses a lot of petrol compared to other countries, while at the same time, it costs less to fill up a Honda Civic in the US than most other places.

    However, the left graph is based on 2003 data. I wonder what the graph looks like now? Similar, I’m sure, but still something to look at.

    Anyways, something really interesting here — even though Venezuela has crazy low gas prices, the average petrol consumption per day over there is still quite low. Whether this is a cultural thing or just some weird supply and demand thing (that I have no clue about) might be worth some investigating.

    In any case, just because we have lower gas prices (that we still complain about) than a lot of the world, we’re still consuming a lot. What’s our excuse?

  • As Jon Udell has mentioned, there’s a ton of data online, but it’s not often we can find it, often hidden in the deep, dark basement of some website. He has proposed that people book mark public datasets on del.icio.us under the tag “publicdata”. I think this is a great idea. In turn, you can subscribe to the feed with the url http://del.icio.us/tag/publicdata.

    I’ve been doing this already for a while, but I had been just tagging with “data”. So I’m going to join in on the party and start tagging with publicdata, and I hope others will too. Until sites like Many Eyes and Swivel get more wind beneath their wings, I think it’s necessary.

  • Ratatouille Visualization

    By now, I’m sure everyone has heard of Pixar’s most recent movie, Ratatouille. If you haven’t seen it, I HIGHLY recommend it. Not only is it beautiful animation and a nice story, but it’s about food. I love Pixar. There are a few scenes in the movie when the main character, Remy, and his brother, Emile, are eating and experiencing the taste of some exquisite cheese.

    There was pretty taste visualization going on done by Michel Gagne.

    Around 1400 drawings were created for the animation. Each one was scanned, painted and composited using two softwares: Animo and Photoshop.

    That’s a lot of hand drawings, but quite nice results. Good job, Michel.

  • PedometerIt’s really easy to be lazy when you work from home. I can tell you this first-hand.

    Twenty-six steps from my bedroom to the kitchen; 6 steps from bedroom to study room; 29 steps from study room to kitchen; 24 steps from kitchen to bathroom. Do some back and forth, go through the rotation a few times, and that’s my day. I can easily go a whole day walking (or dragging my feet) only 300 steps. That’s sad.

    Just how sad is it? The Walking Site (um, yes, there really is a walking site :) recommends 10,000 steps per day. Wow, only 9,700 steps away! I’m pretty sure I’m slowly getting fatter due to my sloth-like behavior.

    In efforts to avoid the gut, I’ll be wearing my trusty pedometer to shoot for 10,000 steps per day. Of course I’ll be logging this data online, and we can all see how un-lazy I can become. Who knows?

    I can tell you this though. I used to wear this nifty step counter a few months back, and it certainly made me more aware of my laziness. I started walking more and took the long route, around campus, from my office to the car. Sometimes, we just need to see proof to change. As if a pot belly and excessive sweating wasn’t enough.

  • I just added the Browser Statistics add-on to my Firefox browser. On the bottom left corner, it shows the number of kilobytes downloaded for the current page, total number of kilobytes downloaded since the last start of the browser, and number of pages loaded. I’m going to try to log these numbers each day and try to make use of the data (uh, if laziness doesn’t get the best of me). If only there were some automated data logging.

  • After parsing Weather Underground pages to grab temperature data, it’s time to look at the data. Can’t download all that data and not do anything with it!

    First off, in my initial pass of my parsing script, I accidentally cut the month range short, so I didn’t get any data for December from 1980 to 2005. It should be noted that these plots don’t show this missing data. Um, there’s no axes or labels either. Sorry, I got a little lazy, but that’s not the point now anyways.
    Read More

  • Weather Underground is a useful site and a fun place for weather enthusiasts.…

  • It’s easy to see how Statistics got this bad wrap because it’s so easy to lie with data, charts, and graphs. Sometimes it’s on purpose — someone might try to present “good” results that actually suck. Sometimes it’s accidental — someone might have misread or didn’t read the documentation that came with the data. In the case of Swivel’s most recently featured graph, it was the latter. A case of mistaken identity so to speak.

    The data about doping tests in sports came from here. Now the graph on Swivel would have you believe that the data represent the number of doping cases found in each sports; however, according to the USADA report, the data is actually the number of tests the association conducted inside and outside competition during the first quarter of this year. The report contains no data on the USADA’s findings.

    What We Learn

    What can we learn from this? It’s great to visualize data, but you have to be careful. Read the documentation. Find out what the data is about, because without context, the visualization or any findings are practically useless. Statistics isn’t to lie. In fact, it’s the exact opposite. Statistics came about and exists today to reveal the truth.

  • Xtimeline allows you to explore all sorts of user-created timelines from the US war in Iraq to the life of Angelina Jolie to the history of pornography. I think the site is still pretty new since the most viewed timelines for the month, past 3 months, and year are still all the same, but nevertheless, from the looks of things, a nice community seems to be developing over there.

    The timelines are (I think) in javascript and what you see is a timeline of user-entered events. As you click and drag through time, events are displayed on the right. You can click on the events for more details where events can be anything from text, a picture, or a Flash-embedded video.

    One suggestion — it looks like timelines can only be ended by a single user. It would be cool if multiple users could contribute to a single timeline, because I think it’s hard to remember all the dates (especially the months) for certain events. We can’t all be like Victor, who seems to know an awful lot about Britney Spears.

    *UPDATE* I just read the xtimeline blog. Yup, xtimeline did in fact, just open up to the public July 1.

  • Adrian Holovaty released templatemaker yesterday. Adrian is probably best known as the guy, featured on YouTube, who played the MacGyver theme song. So clearly, he is a man a many talents.

    Anyways, templatemaker is a Python script to extract data from text, um, HTML. For example, you could pass a review page from a site like Yelp, or several pages, and the script will “learn” the template. Once a template is established, you can extract the stuff that changes (e.g. ratings, restaurant name). Here, in Adrian’s words:

    You can give templatemaker an arbitrary number of HTML files, and it will create the “template” that was used to create those files. (“Template,” in this case, means a string with a number of “holes” in it, where the holes represent the parts of the page that change.) Once you’ve got the template, you can then give it any HTML file that uses that same template, and it will give you the raw data: “The value for hole 1 is ‘July 6, 2007’, the value for hole 2 is ‘blue’,” etc.

    It’s under the BSD license, so all the more reason to use it. I haven’t used it yet, but looking forward to it.

  • Maps of War

    As a representation of history over time and space, Maps of War does a pretty good job of displaying the information in the form Flash animations. It’s quite simple really. The animation starts centuries back (e.g. 2000BC) and moves to geographic regions. In the above map, I watched who has controlled the middle east, beginning 3000BC up through 2006.

  • Okay, so this video has been posted probably on thousands of blogs already, but you know what, I don’t care. Hans Rosling gives an amazing talk on poverty and life around the world, and he uses his interactive exploratory tool, Trendalyzer (acquired by Google), to show the different levels of health, education, and money around the world. Trendalzyer: useful, yes, but not the main point of the talk. Watch Rosling’s talk all the way through. You won’t be disappointed.

  • There was a Sharp Rise Seen in Applications for Citizenship, as reported in The Times today, and of course there was a graphic to complement that article that showed the rise in applications over the years as well as a by-country breakdown for 2006.

    Surge Seen in Applications for Citizenship

    Graphics in The Times always site the source, which was Department of Homeland Security in this case. I thought, “Do they have some kind of source who they actually call to get this data?” Thinking such a thing, I feel pretty dumb now. In fact, I always see that source on all of the graphics, and have just assumed that there was some connection between The Times and the source.

    Wrong.

    So lazy me finally decided to look into things, and you know what, the Department of Homeland Security has a whole section on their website for Immigration Statistics. There are freely available spreadsheets, reports, publications, and even a little something on data standards and definitions, prepared by none other than the — Office of Immigration Statistics. Very pleased.

    It’s kind of sad that this is just now news to me, but better now than never, eh?

  • Twittervision 3D

    Twittervision is a Google Maps mashup using the Twitter RSS feed. As people post to Twitter, you see the map move from location to location all around the world. It’s really simple, but there’s something entertaining about it that I can’t quite put my finger on. Maybe we just like to peak into other people’s lives. Anyways, I don’t know how recent this is, but Twittervision now has a third dimension which is equally as entertaining as the original.

  • Swarm Theory, by Peter Miller, talks about how some animals, as individuals, aren’t smart, but as a group or a swarm, they can do amazing things. The above is a flock of starlings that can change shapes even though no single bird is the leader.

    Can we apply swarm theory to social data analysis? As individuals, we might not be able to hold onto or understand a dataset, but as a group, we can come at a dataset from different perspectives, look at very small parts, and then as an end result — extract real, worthwhile meaning.

    That’s how swarm intelligence works: simple creatures following simple rules, each one acting on local information. No ant sees the big picture. No ant tells any other ant what to do. Some ant species may go about this with more sophistication than others. (Temnothorax albipennis, for example, can rate the quality of a potential nest site using multiple criteria.) But the bottom line, says Iain Couzin, a biologist at Oxford and Princeton Universities, is that no leadership is required. “Even complex behavior may be coordinated by relatively simple interactions,” he says.

    It reminds me of that common saying, or maybe it’s a quote, about how if you put a bunch of monkeys in a room with typewriters, you’ll eventually get the works of Shakespeare via the magic of probability. While the whole monkey thing is a bit far-fetched, swarm theory is certainly worth my attention.

  • We need to interact with others. We crave connections with friends and strangers. Something inside makes us need to converse with others so that we don’t go crazy. As I work from home, I’ve begun to understand this a bit more, and I’ve found myself checking Facebook and Twittering perhaps just a little too much. I think that it’s these connections is what has made social networks so popular.

    How can we visualize these ever so important connections. An obvious option is with, well, lines.

    Pretty, yes. Useful? Umm, hmm, not really. The number nodes grows to greater than 20, and it becomes this cloud/blob-type thing. What meaning can we take away from visualization like this other than, there’s a lot of nodes and links, and they’re all interconnected (other than a few outsiders)?

    Okay, so here’s another option — instead of using lines to show connections between nodes, we can use clustering. Nodes that are similar, appear closer together.

    Clustering Social Networks

    We can see some patterns now with the clustering and coloring, but when the network groes to thousands, it’s easy to see how things can get kinda gross. I think the natural next step here is to sample, provide an overview, and if the user wants to go deeper, sample some more.

    The big question: how do we know what to sample? What weight can we give each sample? How can we get a sample that properly represents the entire network (or a small, specific part of it)?

  • Akamai: Network Performance Comparison

    Akamai is a technology company that deals with routing and online business. They optimize routing over the Internet using the data they collect from servers setup in 71 countries. Or I guess, in their words

    Akamai’s technology – at its core, applied mathematics and algorithms – has transformed the chaos of the Internet into a predictable, scalable, and secure platform for business and entertainment. The Akamai EdgePlatform comprises 20,000 servers deployed in 71 countries that continually monitor the Internet – traffic, trouble spots and overall conditions. We use that information to intelligently optimize routes and replicate content for faster, more reliable delivery. As Akamai handles 20% of total Internet traffic today, our view of the Internet is the most comprehensive and dynamic collected anywhere.

    Wait, that’s not the good part. They use Flash-based visualization to display how good they really are. I did a network performance comparision for a route from New York to Hong Kong, and in turn, the viz showed the public internet path and a much-improved Akamai path. Less packet loss and lower latency for Akamai. It’d be interesting to know how those routes are depicted, because I imagine, the routes aren’t really always straight line vs parabola, Akamai vs public internet. Very pretty though.

  • Weight loss is a difficult task for many, further complicated with so many diets — Atkins, Jenny Craig, etc — and lack of motivation. Fatsecret aims to make weight loss easier by providing the tools to track your weight loss, write about it, see what others are doing, and share your progress.

    There’s a couple of graphs (built by Flash) on the homepage. The first, a pie chart, shows the proportions of fatsecret users on certain types of diets. You can see the proportions for this week, this month, or all time.

    Then towards the bottom — a bar chart showing the average weight loss of fatsecret users for specified diets. Again, you can see for this week, this month, and all time.

    fatsecret: avg weight loss

    Every user has her own homepage which shows a line graph of her progress as well as the average weight loss of fatsecret members on the user’s same diet.

    Fatsecret seems like quite of an active site with plenty of posting, tips, and member interactions, which makes me pretty happy. Next step: interactive tools.