• GDP rises in the UK after spending on illegal activities counted

    June 5, 2014  |  Statistics

    The gross domestic product for the United Kingdom rose by 5% seemingly overnight, after spending on cocaine and prostitution was (roughly) accounted for. Naturally there's been a bit of fuss over the new estimate. Tim Harford explains why the new count isn't such a travesty.

    We need to understand three things about gross domestic product statistics. First, GDP itself is ineffable — an attempt to synthesise, for practical purposes, something that defies description. Second, the national accounts are not designed to give a round of applause to the good stuff and a loud raspberry to the bad stuff. They are supposed to measure economic transactions. And, third, anyone who thinks politicians try to maximise GDP has not been paying much attention to politicians.

    After you read that, it's also worth listening to the Planet Money podcast on GDP from a few months back. Fuzzy estimate.

  • What pregnant women want

    May 26, 2014  |  Statistics

    What pregnant women search for

    In another take on the game of what Google suggests while searching, Seth Stephens-Davidowitz for The New York Times looked at queries related to pregnant women. Some searches were similar across countries, whereas others varied culturally.

    Start with questions about what pregnant women can do safely. The top questions in the United States: Can pregnant women "eat shrimp," "drink wine," "drink coffee" or "take Tylenol"?

    But other countries don't look much like the United States or one another. Whether pregnant women can "drink wine" is not among the top 10 questions in Canada, Australia or Britain. Australia's concerns are mostly related to eating dairy products while pregnant, particularly cream cheese. In Nigeria, where 30 percent of the population uses the Internet, the top question is whether pregnant women can drink cold water.

    Stephens-Davidowitz's analysis is mostly anecdotal but a fun read.

    I want to see something like this direct from Google, with more rigor. Now that would be interesting.

  • Strava Metro aims to help cities improve biking routes

    May 23, 2014  |  Data Sharing

    Strava Metro MelbourneLast month, Strava, which allows users to track their bike rides and runs, launched an interactive map that shows where people move worldwide. That seems to be a lead-in to their larger project Strava Metro. Here's the pitch:

    Strava Metro is a data service providing "ground truth" on where people ride and run. Millions of GPS-tracked activities are uploaded to Strava every week from around the globe. In denser metro areas, nearly one-half of these are commutes. These activities create billions of data points that, when aggregated, enable deep analysis and understanding of real-world cycling and pedestrian route preferences.

    Strava had a handful of clients before the official launch, such as the Oregon Department of Transportation. From Bike Portland:

    Last fall, the agency paid $20,000 for one-year license of a dataset that includes the activities of about 17,700 riders and 400,000 individual bicycle trips totaling 5 million BMT (bicycle miles traveled) logged on Strava in 2013. The Strava bike "traces" are mapped to OpenStreetMap.

    This is what I was getting at with those running maps, so it's great to see that Strava was already on it.

    It'll be interesting to see where this goes, not just business-wise, but with data sharing, privacy, and how users react to their (anonymized) data being sold.

  • Machine learning a cappella on overfitting

    May 21, 2014  |  Statistics

    From the machine learning course on Udacity, an a cappella group sings a Thriller parody on overfitting. At first you're like, "Is this real? Am I dreaming?" Then you're like, "Oh my god, he has a gold glove on." And then you're like, "Yes! This is real! Oh internets, I adore you so."

  • A majority of your email in Gmail, even if you don’t use it

    May 16, 2014  |  Statistics

    Gmail over time

    For reasons of autonomy, control, and privacy, Benjamin Mako Hill runs his own email server. After a closer look though, he realized that much of the email he sends ends up in Gmail anyway.

    Despite the fact that I spend hundreds of dollars a year and hours of work to host my own email server, Google has about half of my personal email! Last year, Google delivered 57% of the emails in my inbox that I replied to. They have delivered more than a third of all the email I've replied to every year since 2006 and more than half since 2010. On the upside, there is some indication that the proportion is going down. So far this year, only 51% of the emails I've replied to arrived from Google.

    Factor in the other services such as Yahoo, Hotmail, etc, I imagine that majority percentage goes up quite a bit. If you want to look at your own inbox Gmail count, Hill posted the scripts for your perusal.

    This tutorial on downloading email metadata might be helpful too, if you're looking for a more general script.

  • Newborn false positives

    May 15, 2014  |  Mistaken Data

    Shutterfly sent promotional emails that congratulate new parents and encourage them to send thank you cards. The problem: a lot of people on that list weren't new parents.

    Several tipsters forwarded us the email that Shutterfly sent out in the wee small hours of this morning. One characterized the email as "data science gone wrong." Another says that she had actually been pregnant and would have been due this month, but miscarried six months ago. Is it possible that Shutterfly analyzed her search data and just happened to conclude, based on that, that she would be welcoming a child around this time? Or is it, as she hoped via email, "just a horrible coincidence?"

    Only Shutterfly knows what actually happened (They insist it was a random mistake.), but it sounds like a naive use of data somewhere in the pipeline. Maybe someone remembered the Target story, got excited, and forgot about the repercussions of false positives. Or, maybe someone made an incorrect assumption about data points with certain purchases and didn't test thoroughly enough.

    In any case, this slide suddenly takes on new meaning.

  • Random things that correlate

    May 12, 2014  |  Statistics

    Divorce rate in Maine vs margarine

    This is fun. Tyler Vigen wrote a program that attempts to automatically find things that correlate. As of writing this, 4,000 correlations were found so far (and actually over 100 more when I finished). Some of the gems include: the divorce rate in Maine versus per capita consumption of margarine, marriage rate in Alabama versus whole milk consumption per capita, and honey produced in bee colonies versus labor political action committees. Many things correlate with cheese consumption.

  • Type I and II errors simplified

    May 9, 2014  |  Statistics

    Type I and II errors

    "Type I" and "Type II" errors, names first given by Jerzy Neyman and Egon Pearson to describe rejecting a null hypothesis when it's true and accepting one when it's not, are too vague for stat newcomers (and in general). This is better. [via]

  • Naked Statistics

    May 8, 2014  |  Statistics

    Naked Statistics by Charles Wheelan promises a fun, non-boring introduction to statistics that doesn't leave you drifting off into space, thinking about anything that is not statistics. From the book description:

    For those who slept through Stats 101, this book is a lifesaver. Wheelan strips away the arcane and technical details and focuses on the underlying intuition that drives statistical analysis. He clarifies key concepts such as inference, correlation, and regression analysis, reveals how biased or careless parties can manipulate or misrepresent data, and shows us how brilliant and creative researchers are exploiting the valuable data from natural experiments to tackle thorny questions.

    Naked StatisticsThe first statistics course I took—not counting the dreadful high school stat class taught by the water polo coach—actually drew me in from the start. Plus, I needed to finish my dissertation, so I didn't pick it up when it came out last year.

    I saw it in the library the other day though, so I checked it out. If anything, I could use a few more anecdotes to better describe statistics to people before they tell me how much they hated it.

    Naked Statistics is pretty much what the description says. It's like your stat introduction course with much less math, which is good for those interested in poking at data but well, slept through Stat 101 and have an irrational fear of numbers. You get important concepts and plenty of reasons why they're worth knowing. Most importantly, it gives you a statistical way to think about data, flaws and all. Wheelan also has a fun writing style that makes this an entertaining read.

    For those who are familiar with inference, correlation, and regression, the book will be too basic. It's not enough just for the anecdotes. However, for anyone with less than a bachelor's degree (or equivalent) in statistics who wants to know more about analyzing data, this book should be right up your alley.

    Keep in mind though that this only gets you part way to understanding your data. Naked Statistics is beginning concepts. Putting statistics into practice is the next step.

    Personally, I skimmed through a good portion of the book, as I'm familiar with the material. I did however read a chapter out loud while taking care of my son. He might not be able to crawl yet, but I'm hoping to ooze some knowledge in through osmosis.

  • Most underrated films

    May 6, 2014  |  Data Sources

    Rotten Tomatoes film ratingsBen Moore was curious about overrated and underrated films.

    "Overrated" and "underrated" are slippery terms to try to quantify. An interesting way of looking at this, I thought, would be to compare the reviews of film critics with those of Joe Public, reasoning that a film which is roundly-lauded by the Hollywood press but proved disappointing for the real audience would be "overrated" and vice versa.

    Through the Rotten Tomatoes API, he found data to make such a comparison. Then he plotted one against the other, along with a quick calculation of the differences between the percentage of official critics who liked and that of the Rotten Tomatoes audience. The most underrated: Facing the Giants, Diary of a Mad Black Woman, and Grandma's Boy. The most overrated: Spy Kids, 3 Backyards, and Stuart Little 2.

    The plot would be better without the rainbow color scheme and a simple reference line through the even-rating diagonal. But this gets bonus points for sharing the code snippet to access the Rotten Tomatoes API in R, which you can generalize.

  • Hip hop vocabulary compared between artists

    May 5, 2014  |  Statistics

    hip hop vocab

    Matt Daniels compared rappers' vocabularies to find out who knows the most words.

    Literary elites love to rep Shakespeare's vocabulary: across his entire corpus, he uses 28,829 words, suggesting he knew over 100,000 words and arguably had the largest vocabulary, ever.

    I decided to compare this data point against the most famous artists in hip hop. I used each artist's first 35,000 lyrics. That way, prolific artists, such as Jay-Z, could be compared to newer artists, such as Drake.

    As two points of reference, Daniels also counted the number of unique words in the first 5,000 used words from seven of Shakespeare's works and the number of uniques from the first 35,000 words of Herman Melville's Moby-Dick.

    I'm not sure how much stock I would put into these literary comparisons though, because this is purely a keyword count. So "pimps", "pimp", "pimping", and "pimpin" count as four words in a vocabulary and I have a hunch that variants of a single word is more common in rap lyrics than in Shakespeare and Melville. Again, I'm guessing here.

    That said, although there could be similar issues within the rapper comparisons, I bet the counts are more comparable.

  • Hiding a pregnancy from advertisers

    May 1, 2014  |  Statistics

    You probably remember how Target used purchase histories to predict pregnancies among their customer base (although, don't forget the false positives). Janet Vertesi, an assistant professor of sociology at Princeton University, made sure that sort of data didn't exist during her nine months.

    First, Vertesi made sure there were absolutely no mentions of her pregnancy on social media, which is one of the biggest ways marketers collect information. She called and emailed family directly to tell them the good news, while also asking them not to put anything on Facebook. She even unfriended her uncle after he sent a congratulatory Facebook message.

    She also made sure to only use cash when buying anything related to her pregnancy, so no information could be shared through her credit cards or store-loyalty cards. For items she did want to buy online, Vertesi created an Amazon account linked to an email address on a personal server, had all packages delivered to a local locker and made sure only to use Amazon gift cards she bought with cash.

    The best part was that her modified activity—like purchasing $500 worth of Amazon gift cards in cash from the local Rite Aid—set off other (in real life) triggers.

  • A principal component analysis step-by-step

    April 17, 2014  |  Statistics

    Sebastian Raschka offers a step-by-step tutorial for a principal component analysis in Python.

    The main purposes of a principal component analysis are the analysis of data to identify patterns and finding patterns to reduce the dimensions of the dataset with minimal loss of information.

    Here, our desired outcome of the principal component analysis is to project a feature space (our dataset consisting of n x d-dimensional samples) onto a smaller subspace that represents our data "well". A possible application would be a pattern classification task, where we want to reduce the computational costs and the error of parameter estimation by reducing the number of dimensions of our feature space by extracting a subspace that describes our data "best".

    That is, imagine you have a dataset with a lot of variables, some of them important and some of them not so much. A PCA helps you identify which is which, so the source doesn't seem so unwieldy or to reduce overhead.

  • Analysis of Bob Ross paintings

    April 17, 2014  |  Statistics

    Bob Ross keywords

    As a lesson on conditional probability for himself, Walt Hickey watched 403 episodes of "The Joy of Painting" with Bob Ross, tagged them with keywords on what Ross painted, and examined Ross's tendencies.

    I analyzed the data to find out exactly what Ross, who died in 1995, painted for more than a decade on TV. The top-line results are to be expected — wouldn't you know, he did paint a bunch of mountains, trees and lakes! — but then I put some numbers to Ross's classic figures of speech. He didn't paint oaks or spruces, he painted "happy trees." He favored "almighty mountains" to peaks. Once he'd painted one tree, he didn't paint another — he painted a "friend."

    Other findings include cumulus and cirrus cloud breakdowns, hill frequency, and Steve Ross (son of Bob Ross) patterns.

  • Porn views for red versus blue states

    April 14, 2014  |  Statistics

    Top ten viewing statesPornhub continues their analysis of porn viewing demographics in their latest comparison of pageviews per capita between red and blue states (SFW for most, I think). The main question: Who watches more?

    Assuming the porn consumption per capita is normally distributed for each state and that different states have independent distribution of porn consumption per capita, we can say with 99% confidence the hypothesis that the per capita porn consumption of democratic states is higher than the republican states.

    Okay, the result statement sounds a little weird, but when you look at the rates, the conclusion seems clear. The states with the highest viewing per capita is shown above, and for some reason Kansas is significantly higher than everyone else. Way to go.

    For a clearer view, Christopher Ingraham charted the same data but incorporated the percent of Obama voters for each state. Interpret as you wish:

    Obama voting and porn

    Again, note Kansas high on the vertical axis.

    Update: Be sure to read this critique for a better picture of what you see here.

  • Using Census survey data properly

    April 11, 2014  |  Statistics

    The American Community Survey, an ongoing survey that the Census administers to millions per year, provides detailed information about how Americans live now and decades ago. There are tons of data tables on topics such as housing situations, education, and commute. The natural thing to do is to download the data, take it at face value, and carry on with your analysis or visualization.

    However, as is usually the case with data, there's more to it than that. Paul Overberg, a database editor at USA Today, explains in a practical guide on how to get the most out of the survey data (which can be generalized to other survey results).

    Journalists who use ACS a lot have a helpful slogan: "Don't make a big deal out of small differences." Journalists have all kinds of old-fashioned tools to deal with this kind of challenge, starting with adverbs: "about," "nearly," "almost," etc. It's also a good idea to round ACS numbers as a signal to users and to improve readability.

    In tables and visualizations, the job is tougher. These introduce ranking and cutpoints, which create potential pitfalls. For tables, it's often better to avoid rankings and instead create groups—high, middle, low. In visualizations, one workaround is to adapt high-low-close stock charts to show a number and its error margins. Interactive data can provide important details on hover or click.

    If you do any kind of data reporting, whatever field it's in, you should be familiar with most of what Overberg describes. If not, better get your learn on.

  • Bracket picks of the masses versus sports pundits

    April 11, 2014  |  Statistics

    NCAA bracket picking

    Stephen Pettigrew and Reuben Fischer-Baum, for Regressing, compared 11 million brackets on ESPN.com against those of pundits.

    To evaluate how much better (or worse) the experts were at predicting this year's tournament, I considered three criteria: the number of games correctly predicted, the number of points earned for correct picks, and the number of Final Four teams correctly identified. Generally the experts' brackets were slightly better than the non-expert ones, although the evidence isn't especially overwhelming. The analysis suggests that next year you'll have just as good a chance of winning your office pool if you make your own picks as if you follow the experts.

    Due to availability, the expert sample size is a small 53, but it does appear the expert brackets are somewhere in the area of the masses. Still too noisy to know for sure though.

    If anything, this speaks more to the randomness of the tournament than it does about people knowing what teams to pick. It's the same reason why my mom, who knows nothing about basketball or any sports for that matter, often comes out ahead in the work pool. The expert picks are just a point of reference.

  • Fox News bar chart gets it wrong

    April 4, 2014  |  Mistaken Data

    Fox News bar chart

    Because Fox News. See also this, this, and this. [Thanks, Meron]

  • Big data, same statistical challenges

    April 4, 2014  |  Statistics

    Tim Harford for Financial Times on big data and how the same problems for small data still apply:

    The multiple-comparisons problem arises when a researcher looks at many possible patterns. Consider a randomised trial in which vitamins are given to some primary schoolchildren and placebos are given to others. Do the vitamins work? That all depends on what we mean by “work”. The researchers could look at the children’s height, weight, prevalence of tooth decay, classroom behaviour, test scores, even (after waiting) prison record or earnings at the age of 25. Then there are combinations to check: do the vitamins have an effect on the poorer kids, the richer kids, the boys, the girls? Test enough different correlations and fluke results will drown out the real discoveries.

    You're usually in for a fluffy article about drowning and social media when 'big data' is in the title. This one is worth the full read.

  • Bike share data in New York, animated

    April 1, 2014  |  Data Sources

    Citi Bike, also known as NYC Bike Share, is releasing monthly data dumps for station check-outs and check-ins, which gives you a sense of where and when people move about the city. Jeff Ferzoco, Sarah Kaufman, and Juan Francisco Saldarriaga mapped 24 hours of activity in the video below.

    [Thanks, Jeff]

Copyright © 2007-2014 FlowingData. All rights reserved. Hosted by Linode.