• Your income versus what it feels like

    May 22, 2014  |  Statistical Visualization

    Income and cost of living

    Incomes and the cost of living vary across the country. Some areas might have high median income, but the cost of living is also high. Similarly, areas might have low median income, but the cost of living is relatively low. So what happens when you take the income from the former and then move to the latter? The Bureau of Economic Analysis released estimates that help make that comparison.

    Quoctrung Bui for NPR made that data more accessible with a slope graph. On the left is median income, and on the right is what it feels like. Enter your metro area to focus on your point of interest.

  • Machine learning a cappella on overfitting

    May 21, 2014  |  Statistics

    From the machine learning course on Udacity, an a cappella group sings a Thriller parody on overfitting. At first you're like, "Is this real? Am I dreaming?" Then you're like, "Oh my god, he has a gold glove on." And then you're like, "Yes! This is real! Oh internets, I adore you so."

  • Military infographic fascination

    May 21, 2014  |  Infographics

    Military concepts

    Paul Ford describes his fascination with military infographics. Here's what he has to say about the graphic above:

    Take some time with that graphic. After a while you realize that this image could be used anywhere in any paper or presentation and make perfect sense. This is a graphic that defines a way of describing anything that has ever existed and everything that has ever happened, in any situation. The United States Military is operating at a conceptual level beyond every other school of thought except perhaps academic philosophy, because it has a much larger budget.

    Never mind the aesthetics and readability. It's the content and the scale at which these graphics are presented that make them fascinating. Okay, and maybe the aesthetics and readability lend to the entertainment value, too.

  • Beaker allows data exploration in various languages

    May 20, 2014  |  Software

    Beaker Notebook

    Currently in beta, Beaker lets you work and experiment with data with different languages, but in one environment.

    Beaker is a code notebook that allows you to analyze, visualize, and document data using multiple programming languages including Python, R, Groovy, Julia, and Node. Beaker's plugin-based polyglot architecture enables you to seamlessly switch between languages and add support for new languages.

    Sounds like a good place to tuck away your snippets or development in the early stages of larger projects.

  • The United States of Metrics isn’t such a bad thing

    May 19, 2014  |  Self-surveillance

    Bruce Feiler for The New York Times describes his concern and distaste for data collection and analysis.

    In the last few years, there has been a revolution so profound that it's sometimes hard to miss its significance. We are awash in numbers. Data is everywhere. Old-fashioned things like words are in retreat; numbers are on the rise. Unquantifiable arenas like history, literature, religion and the arts are receding from public life, replaced by technology, statistics, science and math. Even the most elemental form of communication, the story, is being pushed aside by the list.

    The results are in: The nerds have won. Time to replace those arrows in the talons of the American eagle with pencils and slide rules. We've become the United States of Metrics.

    That's how the full article reads. Grouchy.

    Feiler jumps into a handful of examples that could've easily been used as positives, had they been in an article about the boom of data. For instance, he scoffs at a project from New York University and Hudson Yards that aims for a "smart community" that tracks pedestrian traffic, air quality, and energy consumption. Is it better to not know these things? Should we rely entirely on word of mouth for every problem in a city that can easily be fixed? That's a tough sell.

    He does suggest that we need balance between data-informed and data-only decisions, and yes to this absolutely, but he also suggests that we've already reached a maximum for the amount of data we want in our lives.

    The underlying premise is that if we observe, journal, and experiment our lives in data, we take away from the joy of living. Sports is less fun to watch and food doesn't taste as good. That's another tough sell.

    Here's how I see it: I strongly believe in going with your gut instincts. It's led me in the right direction more often than not. But, sometimes I move in the wrong direction, or I don't know enough about a subject and all I have is uncertainty. If there's data there to help then all the better.

  • Drought map shows extreme shortages

    May 19, 2014  |  Mapping

    Drought map

    From the U.S. National Drought Monitor.

    The entire state of California is in some level of drought, much of it extreme to exceptional. Snowpack is 50 percent of normal in many locations in the West, and Svoboda noted that a lot of snow has completely melted before it normally would.

    Drought has had a serious impact on fruit and vegetable agriculture in California, and news reports sounded the alarm for grains and livestock in the Plains and South Central West. At least 54 percent of the nation’s wheat crop is affected by some level of drought, as is 30 percent of corn, and 48 percent of cattle.

    Hey, Californians, if you could dial your sprinklers down a couple notches so that we can bathe this summer, that'd be great. Thanks.

  • A majority of your email in Gmail, even if you don’t use it

    May 16, 2014  |  Statistics

    Gmail over time

    For reasons of autonomy, control, and privacy, Benjamin Mako Hill runs his own email server. After a closer look though, he realized that much of the email he sends ends up in Gmail anyway.

    Despite the fact that I spend hundreds of dollars a year and hours of work to host my own email server, Google has about half of my personal email! Last year, Google delivered 57% of the emails in my inbox that I replied to. They have delivered more than a third of all the email I've replied to every year since 2006 and more than half since 2010. On the upside, there is some indication that the proportion is going down. So far this year, only 51% of the emails I've replied to arrived from Google.

    Factor in the other services such as Yahoo, Hotmail, etc, I imagine that majority percentage goes up quite a bit. If you want to look at your own inbox Gmail count, Hill posted the scripts for your perusal.

    This tutorial on downloading email metadata might be helpful too, if you're looking for a more general script.

  • Newborn false positives

    May 15, 2014  |  Mistaken Data

    Shutterfly sent promotional emails that congratulate new parents and encourage them to send thank you cards. The problem: a lot of people on that list weren't new parents.

    Several tipsters forwarded us the email that Shutterfly sent out in the wee small hours of this morning. One characterized the email as "data science gone wrong." Another says that she had actually been pregnant and would have been due this month, but miscarried six months ago. Is it possible that Shutterfly analyzed her search data and just happened to conclude, based on that, that she would be welcoming a child around this time? Or is it, as she hoped via email, "just a horrible coincidence?"

    Only Shutterfly knows what actually happened (They insist it was a random mistake.), but it sounds like a naive use of data somewhere in the pipeline. Maybe someone remembered the Target story, got excited, and forgot about the repercussions of false positives. Or, maybe someone made an incorrect assumption about data points with certain purchases and didn't test thoroughly enough.

    In any case, this slide suddenly takes on new meaning.

  • Alcohol consumption per drinker

    May 15, 2014  |  Statistical Visualization

    We've seen rankings for alcohol consumption per capita around the world. These tend to highlight where people drink and abstain, but what about consumption among only those who drink? The Economist looked at this sub-population. Towards the top, you see countries where much of the population abstains but those who do drink appear to drink at higher volumes.

    Drinking among drinkers

    Of course, it's better to take this with a grain of salt until you see the standard errors on these estimates.

  • Share your traces with a stranger

    May 14, 2014  |  Self-surveillance

    The MIT Media Lab Playful Systems group is working on an experiment in data sharing, on a personal level. It's called 20 Day Stranger. You install an app on your phone that tracks your location and what you're doing, and that information is anonymously shared with a stranger. You also see what that stranger is doing.

    I can't decide if this is creepy or touching, or somewhere in between. I put myself on the waiting list to find out, but I imagine the experience has a little bit to do with the app and much more to do with the stranger on the other side.

  • Job Board, May 2014

    May 14, 2014  |  Job Board

    Looking for a job in data science, visualization, or statistics? There are openings on the board.

    Senior UX Designer, Data Visualization for Integral Ad Science in New York, New York.

    Data Visualization Front-End Developer for the Mintz Group in New York, New York.

    Data Visualizer for Datalabs Agency in Melbourne, Australia.

  • Responsive data tables

    May 13, 2014  |  Coding

    responsive table

    Alyson Hurt for NPR Visuals describes how they make responsive data tables for their articles. That is, a table might look fine on a desktop but then it might be illegible on a mobile device. This is a start in making tables that work in more places.

  • NBA basketball fans by ZIP code

    May 13, 2014  |  Mapping

    NBA fan map from NYT

    After the popularity of The Upshot's baseball fandom map, it's no surprise the same group followed up with an NBA map of the same ilk. Same Facebook like data but for basketball. And as before, although the national map is fun, the regional breakdowns is the best part.

  • Random things that correlate

    May 12, 2014  |  Statistics

    Divorce rate in Maine vs margarine

    This is fun. Tyler Vigen wrote a program that attempts to automatically find things that correlate. As of writing this, 4,000 correlations were found so far (and actually over 100 more when I finished). Some of the gems include: the divorce rate in Maine versus per capita consumption of margarine, marriage rate in Alabama versus whole milk consumption per capita, and honey produced in bee colonies versus labor political action committees. Many things correlate with cheese consumption.

  • Type I and II errors simplified

    May 9, 2014  |  Statistics

    Type I and II errors

    "Type I" and "Type II" errors, names first given by Jerzy Neyman and Egon Pearson to describe rejecting a null hypothesis when it's true and accepting one when it's not, are too vague for stat newcomers (and in general). This is better. [via]

  • Name popularity by state, animated by year

    May 9, 2014  |  Mapping

    Using baby name data from the Social Security Administration, Brian Rowe made this straightforward interactive that lets you search a name to see how its regional popularity changed over over time.

    Name by state

  • Optimizing your R code

    May 9, 2014  |  Coding

    Hadley Wickham offers a detailed, practical guide to finding and removing the major bottlenecks in your R code.

    It's easy to get caught up in trying to remove all bottlenecks. Don't! Your time is valuable and is better spent analysing your data, not eliminating possible inefficiencies in your code. Be pragmatic: don't spend hours of your time to save seconds of computer time. To enforce this advice, you should set a goal time for your code and only optimise only up to that goal. This means you will not eliminate all bottlenecks. Some you will not get to because you've met your goal. Others you may need to pass over and accept either because there is no quick and easy solution or because the code is already well-optimized and no significant improvement is possible. Accept these possibilities and move on to the next candidate.

    This is how I approach it. Some people spend a lot of time optimizing, but I'm usually better off writing code without speed in mind initially. Then I deal with it if it's actually a problem. I can't remember the last time that happened though. Obviously, this approach won't work in all settings. So just use common sense. If it takes you longer to optimize than it does to run your "slow" code, you've got your answer.

  • Naked Statistics

    May 8, 2014  |  Statistics

    Naked Statistics by Charles Wheelan promises a fun, non-boring introduction to statistics that doesn't leave you drifting off into space, thinking about anything that is not statistics. From the book description:

    For those who slept through Stats 101, this book is a lifesaver. Wheelan strips away the arcane and technical details and focuses on the underlying intuition that drives statistical analysis. He clarifies key concepts such as inference, correlation, and regression analysis, reveals how biased or careless parties can manipulate or misrepresent data, and shows us how brilliant and creative researchers are exploiting the valuable data from natural experiments to tackle thorny questions.

    Naked StatisticsThe first statistics course I took—not counting the dreadful high school stat class taught by the water polo coach—actually drew me in from the start. Plus, I needed to finish my dissertation, so I didn't pick it up when it came out last year.

    I saw it in the library the other day though, so I checked it out. If anything, I could use a few more anecdotes to better describe statistics to people before they tell me how much they hated it.

    Naked Statistics is pretty much what the description says. It's like your stat introduction course with much less math, which is good for those interested in poking at data but well, slept through Stat 101 and have an irrational fear of numbers. You get important concepts and plenty of reasons why they're worth knowing. Most importantly, it gives you a statistical way to think about data, flaws and all. Wheelan also has a fun writing style that makes this an entertaining read.

    For those who are familiar with inference, correlation, and regression, the book will be too basic. It's not enough just for the anecdotes. However, for anyone with less than a bachelor's degree (or equivalent) in statistics who wants to know more about analyzing data, this book should be right up your alley.

    Keep in mind though that this only gets you part way to understanding your data. Naked Statistics is beginning concepts. Putting statistics into practice is the next step.

    Personally, I skimmed through a good portion of the book, as I'm familiar with the material. I did however read a chapter out loud while taking care of my son. He might not be able to crawl yet, but I'm hoping to ooze some knowledge in through osmosis.

  • Downloading Your Email Metadata

    May 7, 2014  |  Tutorials

    Downloading Email Metadata

    We spend a lot of attention on how we interact with social networks, because so many people use Twitter, Facebook, etc every day. It's fun for developers to play with this stuff. However, if you want to look at a history of your own interactions, there isn't a much better place to look (digitally) than your own email inbox.

    Before you can explore though, you have to download the data. That's what you'll learn here, or more specifically, how to download your email metadata as a ready-to-use, tab-delimited file.
    Continue Reading

  • Crystal clusters of world data

    May 7, 2014  |  Data Art

    Artist Scott Kildall generates what he calls World Data Crystals by mapping data on a globe with cubes and clustering them algorithmically. He then produces the result in physical form for something like the piece below, which represents world population.

    World population crystal

Copyright © 2007-2014 FlowingData. All rights reserved. Hosted by Linode.