• Using data from Beer Advocate, in the form of 1.5 million reviews, yhat shows how to build a recommendation system in R.

    The goal for our system will be for a user to provide us with a beer that they know and love, and for us to recommend a new beer which they might like. To accomplish this, we’re going to use collaborative filtering. We’re going to compare 2 beers by ratings submitted by their common reviewers. Then, when one user writes similar reviews for two beers, we’ll then consider those two beers to be more similar to one another.

    The simple recommender is at the end of the article. Select a beer you like, a type of beer you want to try, and you get a handful of beers you might like.

    Obviously, the method isn’t exclusive to beer reviews, and this is just a start to a more advanced system that you can tailor to your own data. The good news is that the code to scrape data and recommend things is there for your disposal. [via @drewconway]

  • MapBox, along with Gnip and Eric Fischer, mapped 3 billion tweets and a handful of variables.

    This is a look at 3 billion tweets — every geotagged tweet since September 2011, mapped, showing facets of Twitter’s ecosystem and userbase in incredible new detail, revealing demographic, cultural, and social patterns down to city level detail, across the entire world. We were brought in by the data team at Gnip, who have awesome APIs and raw access to the Twitter firehose, and together Tom and data artist Eric Fischer used our open source tools to visualize the data and build interfaces that let you explore the stories of space, language, and access to technology.

    You’ll probably recognize some of the maps, as they build on Fischer’s previous projects, such as languages of Twitter and locals versus tourists. The originals were static images though. The interaction provides an exploratory view that lets you poke around the areas you’re interested in, and maybe best of all, it was built with open source software.

  • NOAA visualized global vegetation over a year, and the result is beautiful:

    We’ve seen forestry maps before, some quite detailed, but this is the first I’ve seen it at this granularity over a period of time.

    Although 75% of the planet is a relatively unchanging ocean of blue, the remaining 25% of Earth’s surface is a dynamic green. Data from the VIIRS sensor aboard the NASA/NOAA Suomi NPP satellite is able to detect these subtle differences in greenness. The resources on this page highlight our ever-changing planet, using highly detailed vegetation index data from the satellite, developed by scientists at NOAA. The darkest green areas are the lushest in vegetation, while the pale colors are sparse in vegetation cover either due to snow, drought, rock, or urban areas. Satellite data from April 2012 to April 2013 was used to generate these animations and images.

    The changes are especially obvious as the season moves to summer, going from snow-covered to deep green.

  • Stuff happens, and people tweet about it. Something major happens, and a lot of people tweet about it. Masters student Stanislav Nikolov and his adviser Devavrat Shah are working on ways to algorithmically detect the latter.

    People acting in social networks are reasonably predictable. If many of your friends talk about something, it’s likely that you will as well. If many of your friends are friends with person X, it is likely that you are friends with them too. Because the underlying system has, in this sense, low complexity, we should expect that the measurements from that system are also of low complexity. As a result, there should only be a few types of patterns that precede a topic becoming trending. One type of pattern could be “gradual rise”; another could be “small jump, then a big jump”; yet another could be “a jump, then a gradual rise”, and so on. But you’ll never get a sawtooth pattern, a pattern with downward jumps, or any other crazy pattern.

    And with that, the algorithm compares current patterns to the ones above. If they look like a trending pattern, the algorithm marks something as a trend with some probability. In testing with past trending topics, the algorithm was able to pick correctly over 90 percent of the time.

    The best part is that this method can be applied to other time series data. “We can try this on traffic data to predict the duration of a bus ride, on movie ticket sales, on stock prices, or any other time-varying measurements.”

  • When you go to a conference, there are typically several talks going on at the same time, and you can always tell there’s a popular paper coming up when you see people leave a bunch of rooms at once and head straight into one. There’s also the unfortunate case when someone speaks, and there’s only a handful of people in the room, all in the back staring at their laptops. Open Data City visualized this activity during the German internet conference re: publica.

    Open Data City used MAC addresses and access point connections to keep track of where devices went. So a person might be in a room connected to the nearest access point, disconnects as he leaves, and then reconnects as he reenters another room, which provides the flow.

    It’s fun to watch the conference play out even if you didn’t attend. Each dot represents an attendee, and as the animation plays the dots migrate from room to room. Click and drag over the dots to select specific people. [Thanks, Michael]

  • As data grows cheaper and more easily accessible, the people who analyze it aren’t always statisticians. They’re likely to not even have had any statistical training. Biostatistics professor Jeff Leek says we need to adapt to this broader audience.

    What does this mean for statistics as a discipline? Well it is great news in that we have a lot more people to train. It also really drives home the importance of statistical literacy. But it also means we need to adapt our thinking about what it means to teach and perform statistics. We need to focus increasingly on interpretation and critique and away from formulas and memorization (think English composition versus grammar). We also need to realize that the most impactful statistical methods will not be used by statisticians, which means we need more fool proofing, more time automating, and more time creating software. The potential payout is huge for realizing that the tide has turned and most people who analyze data aren’t statisticians.

    Yep.

    Those who disagree tend to worry what might happen — what kind of data-based decisions will be made — by non-statisticians, and that should definitely be a priority as we move forward. Non-statisticians often make incorrect assumptions about the data, forget about uncertainty, and don’t know much about collection methodologies.

    However, as a statistician (or someone who knows statistics), you can shoo everyone else away from the data and gripe when they come back, or you can help them get things right.

  • Curious about how people use “geek” and “nerd” to describe themselves and if there was any difference between the two terms, Burr Settles analyzed words used in tweets that contained the two. Settles used pointwise mutual information (PMI), which essentially provided a measure of the geekness or nerdiness of a term. The plot above shows the results.

    In broad strokes, it seems to me that geeky words are more about stuff (e.g., “#stuff”), while nerdy words are more about ideas (e.g., “hypothesis”). Geeks are fans, and fans collect stuff; nerds are practitioners, and practitioners play with ideas. Of course, geeks can collect ideas and nerds play with stuff, too. Plus, they aren’t two distinct personalities as much as different aspects of personality. Generally, the data seem to affirm my thinking.

    Or maybe pop culture (geek) versus education (nerd).

  • It’s just metadata. What can you do with that? Kieran Healy, a sociology professor at Duke University, shows what you can do, with just some basic social network analysis. Using metadata from Paul Revere’s Ride on the groups that people belonged to, Healy sniffs out Paul Revere as a main target. Bonus points for writing the summary from the point of a view of an 18th century analyst.

    What a nice picture! The analytical engine has arranged everyone neatly, picking out clusters of individuals and also showing both peripheral individuals and—more intriguingly—people who seem to bridge various groups in ways that might perhaps be relevant to national security. Look at that person right in the middle there. Zoom in if you wish. He seems to bridge several groups in an unusual (though perhaps not unique) way. His name is Paul Revere.

    You can grab the R code and dataset on github, too, if you want to follow along.

  • A few years ago I downloaded speed dating data from experiments conducted by…

  • Damien Hirst is an artist known for a number of works, one of those being his large production of spot paintings. There are over a thousand of them painted by him and his assistants, varying in size, number of dots, density, and color. Amanda Cox of The New York Times plotted paintings sold from 1999 to present, topping out at $3.4 million. That’s a whole lot of dottage.

  • The Onion tackles data privacy:

    “As a law-abiding resident of this nation, I have the right to do whatever I want without a shadowy organization recording my every move, unless of course it’s part of an electronic campaign designed to figure out, based on all of my emails and phone conversations, what types of clothes, shoes, and houseware products I like. Then it’s fine.” Sources later confirmed that Landler had posted a Facebook rant on the issue, which had generated a pop-up ad from a company that restores lost PC data.

  • It seems like the technical side of map-making, the part that requires code or complicated software installations, fades a little more every day. People get to focus more on actual map-making than on server setup. Map Stack by Stamen is the most recent tool to help you do this.

    We provide access to different parts of the map stack, like backgrounds, roads, labels, and satellite imagery. These can be modified using straightforward controls to change things like color, opacity, and brightness. So within a few minutes you can have a map of anywhere in the world with dark green parks and blue buildings. You can get very precise with image overlays and layer effects, using layers as cut-out masks for other layers. Or just make a regular-looking map in the colors you want.

    The idea is to make it radically simpler for people to design their own maps, without having to know any code, install any software, or even do any typing.

    It’s completely web-based, and you edit your maps via a click interface. Pick what you want (or use Stamen’s own stylish themes) and save an image. For the time being, the service is open only from 11am to 5pm PST, so just come back later if it happens to be closed.

    See here for a taste of what others have done so far.

  • OpenStreetMap, the free wiki world map that offers up high quality geographic data, has grown a lot in the past eight years. The OpenStreetMap Data Report shows all these changes. Says the report: “The database now contains over 21 million miles of road data and 78 million buildings.”
    Read More

  • June 10, 2013

    Topic

    News  / 

    With all the stuff going on with surveillance and data privacy — especially the past week — it’s worthwhile to revisit this essay by Daniel J. Solove, a professor of law at George Washington University, on why privacy matters even if you “have nothing to hide.”

    “My life’s an open book,” people might say. “I’ve got nothing to hide.” But now the government has large dossiers of everyone’s activities, interests, reading habits, finances, and health. What if the government leaks the information to the public? What if the government mistakenly determines that based on your pattern of activities, you’re likely to engage in a criminal act? What if it denies you the right to fly? What if the government thinks your financial transactions look odd—even if you’ve done nothing wrong—and freezes your accounts? What if the government doesn’t protect your information with adequate security, and an identity thief obtains it and uses it to defraud you? Even if you have nothing to hide, the government can cause you a lot of harm.

    “But the government doesn’t want to hurt me,” some might argue. In many cases, that’s true, but the government can also harm people inadvertently, due to errors or carelessness.

    You might not have anything to hide right now, but maybe a random string of choices that was completely harmless looks a lot like something else a few years from now, to someone sniffing around the archives. The patterns when there are no patterns sort of thing. Personal data without the person. [via @hmason]

  • The Brewers Association just released data for 2012 on craft beer production and growth. The New Yorker mapped the data in a straightforward interactive.

    As of March, the United States was home to nearly two thousand four hundred craft breweries, the small producers best known for India pale ales and other decidedly non-Budweiser-esque beers. What’s more, they are rapidly colonizing what one might call the craft-beer frontier: the South, the Southwest, and, really, almost any part of the country that isn’t the West or the Northeast.

    Most articles and lists on craft beer tend to focus on total production and breweries, so California, a big state with a lot of people, always ends up on top. And as a Californian, I’m more than happy with my access to all the fine brews around here, but clearly, there are many more states to visit. RV trip anyone? [via @kennethfield]

  • Because every day is a good day to listen to Hans Rosling talk numbers. In this short video, Rosling uses Lego bricks to explain population growth and the gaps in wealth and carbon footprint.

  • When you talk to different people across the United States, you notice small differences in how people pronounce words and phrases. Sometimes different terms are used to describe the same thing. Bert Vaux’s dialect survey tried to capture these differences, and NC State statistics graduate student Joshua Katz mapped the data.
    Read More