• FlowingData turned six years old last week. I didn’t realize it until after though as I flipped through my calendar. I missed its birthday last year too, and to my surprise, the last time I remembered was its third year. I suddenly feel like a parent who’s forgotten his child’s birthday. I don’t feel that bad though, because well, it’s a blog, not a human being. If anything it’s an extension of me, and I lost track of my age a couple of years ago.

    Still though, six years is a long time on the internet.

    FlowingData started as a personal site to document projects related to my early-stage research and then grew into something more. Somewhere along the timeline and over 3,000 posts and a couple of books later, it became my full-time job, which is pretty cool. The internet is awesome.

    Thanks to all the sponsors over the years who helped me pay for the ever-increasing hosting bills, and a big thank you to everyone who reads and shares. And of course, thank you to those who bought books and became members. Your support is huge.

    This year on FlowingData will be different from all other years of its existence, because it will be the first year that I don’t have to work on my dissertation. With these new found hours in the day, in addition to more tutorials for members, I hope to spend more time with analysis on interesting datasets and to improve my visualization skills, especially of the interactive variety. Hopefully that transfers to more interesting stuff here on FlowingData.

    Six good years. The best years are ahead.

  • Shan Carter and Kevin Quealy for The New York Times updated their housing prices graphic from a couple of years ago.
    Read More

  • Illustrator Ron Miller imagined what Earth’s skies would look like if we had Saturn’s rings.

    Now, Miller brings his visualizations back to Earth for a series exploring what our skies would look like with Saturn’s majestic rings. Miller strived to make the images scientifically accurate, adding nice touches like orange-pink shadows resulting from sunlight passing through the Earth’s atmosphere. He also shows the rings from a variety of latitudes and landscapes, from the U.S. Capitol building to Mayan ruins in Guatemala.

    Miller has a large portfolio of space-related illustrations also worth a look. [via @golan]

  • There’s a fun CrossValidated thread on statistics jokes. Here’s the one with the top votes:

    A statistician’s wife had twins. He was delighted. He rang the minister who was also delighted. “Bring them to church on Sunday and we’ll baptize them,” said the minister. “No,” replied the statistician. “Baptize one. We’ll keep the other as a control.

    This line by George Burns is my favorite though:

    If you live to be one hundred, you’ve got it made. Very few people die past that age.

    Any other good ones?
    Read More

  • I’ve been poking around grocery store locations, courtesy of AggData, the past few…

  • The Boy Who Loved Math: The Improbable Life of Paul Erdős, written by Deborah Heiligman and illustrated by LeUyen Pham, is a kids’ book on the life of the prolific mathematician and a boy’s love of numbers.

    Most people think of mathematicians as solitary, working away in isolation. And, it’s true, many of them do. But Paul Erdos never followed the usual path. At the age of four, he could ask you when you were born and then calculate the number of seconds you had been alive in his head. But he didn’t learn to butter his own bread until he turned twenty. Instead, he traveled around the world, from one mathematician to the next, collaborating on an astonishing number of publications. With a simple, lyrical text and richly layered illustrations, this is a beautiful introduction to the world of math and a fascinating look at the unique character traits that made “Uncle Paul” a great man.

    Heck yeah. [via Boing Boing]

  • We go places. They have names. What do these names mean though? The Atlas of True Names by cartographers Stephan Hormes and Silke Peust can help you with that, replacing place names with the meaning of place names. California becomes the Land of the Successors, Texas is the Land of Friends, but forget all that. Who’s up for a visit to Illinois, the Land of Those Who Speak Normally?

    See more detail for the United States here. There are also versions for the British Isles, Europe, and the world, all available for purchase to adorn your walls. [via Slate]

  • June 24, 2013

    Topic

    Maps  / 

    Alexey Papulovskiy collected flight data from Plane Finder for a month, which essentially gives you a bunch of points in space over time. Then he mapped the data in Contrailz.

    Turns out, besides Flight Levels (FL) (which are indicated on my map by dots’ color: red ones stand for lower altitudes and blue — for higher) planes have pretty specific “roads” and “highways” as well as “intersections” and “junctions”. You can see this for yourself by taking a look at the Russian part of the map: it’s less “crowded”, so the picture is as clear as it gets. The sky above Moscow area looks particularly interesting: civil flights are allowed there only since March 2013 and only with an altitude of 27.000 ft or higher.

    Aaron Koblin’s Flight Patterns always comes to mind immediately when I see flight data, and Contrailz of course looks similar, but the latter brings in European flight patterns, too, which makes it worth a gander.

    By the way, you should also check out Plane Finder if you haven’t seen that yet. It shows planes currently in flight, and there’s a lot of them. [Thanks, Alexey]

  • In a different take on the income inequality issue, the Economic Policy Institute, in collaboration with Periscopic, created Inequality Is.

    The Inequality.is website brings clarity to the national dialogue on wage and income inequality, using interactive tools and videos to tell the story of how we arrived at the state of inequality we find today and what can be done to reverse course and ensure workers get their fair share.

    Inequality is: real, personal, expensive, created, and fixable. These are the categories the interactive takes you through to explain the subject. The first part reminds you of the video we saw on wealth distribution, which showed what people thought was an ideal distribution of wealth, what they thought it was in real life, and then what it actually was. However, in this interactive, you’re the one answering, which sort of sets the stage for the rest of the interactive. The goal is to make the data more relatable.

    Be sure to go through the whole piece. It rounds off nicely with a video explanation with public policy professor Robert Reich and ways to shift the inequality in the other direction.

  • Using data from Beer Advocate, in the form of 1.5 million reviews, yhat shows how to build a recommendation system in R.

    The goal for our system will be for a user to provide us with a beer that they know and love, and for us to recommend a new beer which they might like. To accomplish this, we’re going to use collaborative filtering. We’re going to compare 2 beers by ratings submitted by their common reviewers. Then, when one user writes similar reviews for two beers, we’ll then consider those two beers to be more similar to one another.

    The simple recommender is at the end of the article. Select a beer you like, a type of beer you want to try, and you get a handful of beers you might like.

    Obviously, the method isn’t exclusive to beer reviews, and this is just a start to a more advanced system that you can tailor to your own data. The good news is that the code to scrape data and recommend things is there for your disposal. [via @drewconway]

  • MapBox, along with Gnip and Eric Fischer, mapped 3 billion tweets and a handful of variables.

    This is a look at 3 billion tweets — every geotagged tweet since September 2011, mapped, showing facets of Twitter’s ecosystem and userbase in incredible new detail, revealing demographic, cultural, and social patterns down to city level detail, across the entire world. We were brought in by the data team at Gnip, who have awesome APIs and raw access to the Twitter firehose, and together Tom and data artist Eric Fischer used our open source tools to visualize the data and build interfaces that let you explore the stories of space, language, and access to technology.

    You’ll probably recognize some of the maps, as they build on Fischer’s previous projects, such as languages of Twitter and locals versus tourists. The originals were static images though. The interaction provides an exploratory view that lets you poke around the areas you’re interested in, and maybe best of all, it was built with open source software.

  • NOAA visualized global vegetation over a year, and the result is beautiful:

    We’ve seen forestry maps before, some quite detailed, but this is the first I’ve seen it at this granularity over a period of time.

    Although 75% of the planet is a relatively unchanging ocean of blue, the remaining 25% of Earth’s surface is a dynamic green. Data from the VIIRS sensor aboard the NASA/NOAA Suomi NPP satellite is able to detect these subtle differences in greenness. The resources on this page highlight our ever-changing planet, using highly detailed vegetation index data from the satellite, developed by scientists at NOAA. The darkest green areas are the lushest in vegetation, while the pale colors are sparse in vegetation cover either due to snow, drought, rock, or urban areas. Satellite data from April 2012 to April 2013 was used to generate these animations and images.

    The changes are especially obvious as the season moves to summer, going from snow-covered to deep green.

  • Stuff happens, and people tweet about it. Something major happens, and a lot of people tweet about it. Masters student Stanislav Nikolov and his adviser Devavrat Shah are working on ways to algorithmically detect the latter.

    People acting in social networks are reasonably predictable. If many of your friends talk about something, it’s likely that you will as well. If many of your friends are friends with person X, it is likely that you are friends with them too. Because the underlying system has, in this sense, low complexity, we should expect that the measurements from that system are also of low complexity. As a result, there should only be a few types of patterns that precede a topic becoming trending. One type of pattern could be “gradual rise”; another could be “small jump, then a big jump”; yet another could be “a jump, then a gradual rise”, and so on. But you’ll never get a sawtooth pattern, a pattern with downward jumps, or any other crazy pattern.

    And with that, the algorithm compares current patterns to the ones above. If they look like a trending pattern, the algorithm marks something as a trend with some probability. In testing with past trending topics, the algorithm was able to pick correctly over 90 percent of the time.

    The best part is that this method can be applied to other time series data. “We can try this on traffic data to predict the duration of a bus ride, on movie ticket sales, on stock prices, or any other time-varying measurements.”

  • When you go to a conference, there are typically several talks going on at the same time, and you can always tell there’s a popular paper coming up when you see people leave a bunch of rooms at once and head straight into one. There’s also the unfortunate case when someone speaks, and there’s only a handful of people in the room, all in the back staring at their laptops. Open Data City visualized this activity during the German internet conference re: publica.

    Open Data City used MAC addresses and access point connections to keep track of where devices went. So a person might be in a room connected to the nearest access point, disconnects as he leaves, and then reconnects as he reenters another room, which provides the flow.

    It’s fun to watch the conference play out even if you didn’t attend. Each dot represents an attendee, and as the animation plays the dots migrate from room to room. Click and drag over the dots to select specific people. [Thanks, Michael]

  • As data grows cheaper and more easily accessible, the people who analyze it aren’t always statisticians. They’re likely to not even have had any statistical training. Biostatistics professor Jeff Leek says we need to adapt to this broader audience.

    What does this mean for statistics as a discipline? Well it is great news in that we have a lot more people to train. It also really drives home the importance of statistical literacy. But it also means we need to adapt our thinking about what it means to teach and perform statistics. We need to focus increasingly on interpretation and critique and away from formulas and memorization (think English composition versus grammar). We also need to realize that the most impactful statistical methods will not be used by statisticians, which means we need more fool proofing, more time automating, and more time creating software. The potential payout is huge for realizing that the tide has turned and most people who analyze data aren’t statisticians.

    Yep.

    Those who disagree tend to worry what might happen — what kind of data-based decisions will be made — by non-statisticians, and that should definitely be a priority as we move forward. Non-statisticians often make incorrect assumptions about the data, forget about uncertainty, and don’t know much about collection methodologies.

    However, as a statistician (or someone who knows statistics), you can shoo everyone else away from the data and gripe when they come back, or you can help them get things right.

  • Curious about how people use “geek” and “nerd” to describe themselves and if there was any difference between the two terms, Burr Settles analyzed words used in tweets that contained the two. Settles used pointwise mutual information (PMI), which essentially provided a measure of the geekness or nerdiness of a term. The plot above shows the results.

    In broad strokes, it seems to me that geeky words are more about stuff (e.g., “#stuff”), while nerdy words are more about ideas (e.g., “hypothesis”). Geeks are fans, and fans collect stuff; nerds are practitioners, and practitioners play with ideas. Of course, geeks can collect ideas and nerds play with stuff, too. Plus, they aren’t two distinct personalities as much as different aspects of personality. Generally, the data seem to affirm my thinking.

    Or maybe pop culture (geek) versus education (nerd).

  • It’s just metadata. What can you do with that? Kieran Healy, a sociology professor at Duke University, shows what you can do, with just some basic social network analysis. Using metadata from Paul Revere’s Ride on the groups that people belonged to, Healy sniffs out Paul Revere as a main target. Bonus points for writing the summary from the point of a view of an 18th century analyst.

    What a nice picture! The analytical engine has arranged everyone neatly, picking out clusters of individuals and also showing both peripheral individuals and—more intriguingly—people who seem to bridge various groups in ways that might perhaps be relevant to national security. Look at that person right in the middle there. Zoom in if you wish. He seems to bridge several groups in an unusual (though perhaps not unique) way. His name is Paul Revere.

    You can grab the R code and dataset on github, too, if you want to follow along.