• How well we don’t understand probability

    July 29, 2014  |  Statistics

    All Things Considered on NPR ran a fine series on how we interpret probability and uncertainty. It came in five bits (plus one follow-up), each five to ten minutes long. They explore explanations of risk in different areas such as national security, health, and the daily weather and how people interpret the numbers and words.

    A recurring theme was experts who use alternative descriptions for the seemingly concrete numbers.

    Doctors, including Leigh Simmons, typically prefer words. Simmons is an internist and part of a group practice that provides primary care at Mass General. "As doctors we tend to often use words like, 'very small risk,' 'very unlikely,' 'very rare,' 'very likely,' 'high risk,' " she says.

    Not that words always makes understanding numeric probability easier. From the social scientist for the National Weather Service:

    And it's not just a numbers game — words used to describe weather can be just as confusing. Take "watch" and "warning," for example.

    "'Watch' means that conditions are ripe for something to happen. 'Warning' means that it is happening — it is imminent," Brown says. "It's easy to get them confused."

    Both the doctor and the social scientist agree that a combination of numbers, words, and a visual explanation could be the best route.

    Some people think we should forgo trying to explain uncertainty to a general public that doesn't understand, but the rejectors themselves don't recognize the importance. Just because you don't understand something doesn't mean you should ignore it.

    Listen to the full series. [via Dart-Throwing Chimp]

  • A more visual world data portal

    July 24, 2014  |  Data Sources

    OECD data portal

    One of the most annoying parts of downloading data from large portals is that you never quite know what you're gonna get. It's a box of chocolates. It's government data sites. It's lists of datasets with vague or unhelpful titles with links to download. Of course, I'd rather have a hodgepodge than nothing at all, but as with most things, there's room for improvement.

    The OECD, which maintains and provides data on the country level, takes steps towards a more helpful portal that makes data grabs less of a headache. With the help of Raureif, 9elements, and Moritz Stefaner, the new portal is still in beta, but there's plenty to like.
    Continue Reading

  • Large-ish data packages in R

    July 24, 2014  |  Data Sources

    If you've played around with R enough, there comes a time when you just need some data to mess around with. Maybe it's to learn a new method or to make one of your own. R offers some small-ish, clean datasets to poke at, but sometimes you need bigger, messier data. Hadley Wickham from RStudio released four popular large-ish datasets in package form to help you with that.

    I've released four new data packages to CRAN: babynames, fueleconomy, nasaweather and nycflights13. The goal of these packages is to provide some interesting, and relatively large, datasets to demonstrate various data analysis challenges in R. The package source code (on github, linked above) is fully reproducible so that you can see some data tidying in action, or make your own modifications to the data.

    Good.

  • Polling for stress

    July 9, 2014  |  Statistics

    Stress test

    NPR, the Robert Wood Johnson Foundation and the Harvard School of Public Health conducted a survey about peoples' stress levels and factors contributing to the stress. It took place for about a month. NPR started a summary of their findings, of what will be a two-week segment on the air and online.

    The above shows the percent of respondents in the age brackets who said the factors (the rows in this case) contributed to their current stress. It looks like I might be in a less stressful stage of my life, between the age of 30 and 39.

    It's just an early summary of poll responses right now, so I'm hoping they go into more detail about statistically significant differences between demographics and how the 2,500-person sample correlates to the the US population.

  • Data science, big data, and statistics – all together now

    July 2, 2014  |  Statistics

    Terry Speed, a emeritus professor in statistics at University of California at Berkeley, gave an excellent talk on how statisticians can play nice with big data and data science. Usually these talks go in the direction of saying data science is statistics. This one is more on the useful, non-snarky side.

  • Subpar Captain America

    July 1, 2014  |  Statistics

    Animation Domination High-Def has a Captain America video of things that America is not so good at, relative to other countries. And they even cited their data source, the CIA World Factbook. How about that.

  • Test your statistical wits about stuff in the world

    June 24, 2014  |  Statistics

    How wrong you are

    Many of us aren't aware of how one country compares to others or public policy that has been around for decades. How Wrong You Are is a simple quiz game by Moiz Syed and Juliusz Gonera that tests such knowledge.

    How Wrong You Are is a collection of important questions that people are sometimes misinformed about. We poll you to measure how right — or how wrong — the public is about these important questions.

    Every week, we will add a new question. These are all questions that we hope you already know. But if you don't, don't worry! You learned something. Share your results, successful or not. Chances are, if you didn't know this question, other people might not, either.

    Play the game here. At the very least, you'll learn something new.

  • Lessons from improperly anonymized taxi logs

    June 23, 2014  |  Statistics

    Through a Freedom of Information request Chris Whong received and eventually released NYC taxi logs starting in 2013 (about 173 million trips). Vijay Pandurangan looked at the data a little closer and deanonymized the logs to link hashed license numbers to the driver names. It didn't take much to do it. Pandurangan described the process and lessons organizations can learn when they release data.

    Someone on Reddit pointed out that one specific driver seemed to be doing an incredible amount of business. When faced with anomalous data like that, it's good practice to weed out data error before jumping to conclusions about cheating taxi drivers. Also, I couldn't shake the feeling that there was something about that encoded id number: "CFCD208495D565EF66E7DFF9F98764DA." After a little bit of poking around, I realised that that code is actually the MD5 hash of the character '0'. This proved my suspicion that this was actually a data collection error, but also made me immediately realise that the entire anonymization process was flawed and could easily be reversed.

    He also provided the code snippet he used to do it.

  • Data grab bag

    June 12, 2014  |  Statistics

    — When you deal with data, you can think like a statistician, even if you don't know the math (although it will certainly help a lot). Jonathan Stray brings up fine points to draw conclusions from data, as does Jacob Harris in a detailed case study on distrusting your data.

    — Learning data science still seems like a fuzzy, abstract idea. Trey Causey offers advice on getting started with the bubbling field.

    — Is college still worth it? Yes.

    Coding isn't easy. If it were, everyone would do it.

    R gotchas.

  • Government Data

    How to Make Government Data Sites Better

    Accessing government data from the source is frustrating. If you've done it, or at least tried to, you know the pain that is oddly formatted…
  • What a few thousand fake followers gets you

    June 6, 2014  |  Statistics

    Real and fake followers

    There are a lot of fake, spammy accounts on Twitter that come in a variety of forms. Some tweet links to junk, some serve as retweet and faving bots, and others exist purely to boost follower counts. Gilad Lotan, a data scientist at betaworks, was curious about that last type, so he bought 4,000 followers for five bucks and looked closer at his new found friends.

    Continue Reading

  • GDP rises in the UK after spending on illegal activities counted

    June 5, 2014  |  Statistics

    The gross domestic product for the United Kingdom rose by 5% seemingly overnight, after spending on cocaine and prostitution was (roughly) accounted for. Naturally there's been a bit of fuss over the new estimate. Tim Harford explains why the new count isn't such a travesty.

    We need to understand three things about gross domestic product statistics. First, GDP itself is ineffable — an attempt to synthesise, for practical purposes, something that defies description. Second, the national accounts are not designed to give a round of applause to the good stuff and a loud raspberry to the bad stuff. They are supposed to measure economic transactions. And, third, anyone who thinks politicians try to maximise GDP has not been paying much attention to politicians.

    After you read that, it's also worth listening to the Planet Money podcast on GDP from a few months back. Fuzzy estimate.

  • What pregnant women want

    May 26, 2014  |  Statistics

    What pregnant women search for

    In another take on the game of what Google suggests while searching, Seth Stephens-Davidowitz for The New York Times looked at queries related to pregnant women. Some searches were similar across countries, whereas others varied culturally.

    Start with questions about what pregnant women can do safely. The top questions in the United States: Can pregnant women "eat shrimp," "drink wine," "drink coffee" or "take Tylenol"?

    But other countries don't look much like the United States or one another. Whether pregnant women can "drink wine" is not among the top 10 questions in Canada, Australia or Britain. Australia's concerns are mostly related to eating dairy products while pregnant, particularly cream cheese. In Nigeria, where 30 percent of the population uses the Internet, the top question is whether pregnant women can drink cold water.

    Stephens-Davidowitz's analysis is mostly anecdotal but a fun read.

    I want to see something like this direct from Google, with more rigor. Now that would be interesting.

  • Strava Metro aims to help cities improve biking routes

    May 23, 2014  |  Data Sharing

    Strava Metro MelbourneLast month, Strava, which allows users to track their bike rides and runs, launched an interactive map that shows where people move worldwide. That seems to be a lead-in to their larger project Strava Metro. Here's the pitch:

    Strava Metro is a data service providing "ground truth" on where people ride and run. Millions of GPS-tracked activities are uploaded to Strava every week from around the globe. In denser metro areas, nearly one-half of these are commutes. These activities create billions of data points that, when aggregated, enable deep analysis and understanding of real-world cycling and pedestrian route preferences.

    Strava had a handful of clients before the official launch, such as the Oregon Department of Transportation. From Bike Portland:

    Last fall, the agency paid $20,000 for one-year license of a dataset that includes the activities of about 17,700 riders and 400,000 individual bicycle trips totaling 5 million BMT (bicycle miles traveled) logged on Strava in 2013. The Strava bike "traces" are mapped to OpenStreetMap.

    This is what I was getting at with those running maps, so it's great to see that Strava was already on it.

    It'll be interesting to see where this goes, not just business-wise, but with data sharing, privacy, and how users react to their (anonymized) data being sold.

  • Machine learning a cappella on overfitting

    May 21, 2014  |  Statistics

    From the machine learning course on Udacity, an a cappella group sings a Thriller parody on overfitting. At first you're like, "Is this real? Am I dreaming?" Then you're like, "Oh my god, he has a gold glove on." And then you're like, "Yes! This is real! Oh internets, I adore you so."

  • A majority of your email in Gmail, even if you don’t use it

    May 16, 2014  |  Statistics

    Gmail over time

    For reasons of autonomy, control, and privacy, Benjamin Mako Hill runs his own email server. After a closer look though, he realized that much of the email he sends ends up in Gmail anyway.

    Despite the fact that I spend hundreds of dollars a year and hours of work to host my own email server, Google has about half of my personal email! Last year, Google delivered 57% of the emails in my inbox that I replied to. They have delivered more than a third of all the email I've replied to every year since 2006 and more than half since 2010. On the upside, there is some indication that the proportion is going down. So far this year, only 51% of the emails I've replied to arrived from Google.

    Factor in the other services such as Yahoo, Hotmail, etc, I imagine that majority percentage goes up quite a bit. If you want to look at your own inbox Gmail count, Hill posted the scripts for your perusal.

    This tutorial on downloading email metadata might be helpful too, if you're looking for a more general script.

  • Newborn false positives

    May 15, 2014  |  Mistaken Data

    Shutterfly sent promotional emails that congratulate new parents and encourage them to send thank you cards. The problem: a lot of people on that list weren't new parents.

    Several tipsters forwarded us the email that Shutterfly sent out in the wee small hours of this morning. One characterized the email as "data science gone wrong." Another says that she had actually been pregnant and would have been due this month, but miscarried six months ago. Is it possible that Shutterfly analyzed her search data and just happened to conclude, based on that, that she would be welcoming a child around this time? Or is it, as she hoped via email, "just a horrible coincidence?"

    Only Shutterfly knows what actually happened (They insist it was a random mistake.), but it sounds like a naive use of data somewhere in the pipeline. Maybe someone remembered the Target story, got excited, and forgot about the repercussions of false positives. Or, maybe someone made an incorrect assumption about data points with certain purchases and didn't test thoroughly enough.

    In any case, this slide suddenly takes on new meaning.

  • Random things that correlate

    May 12, 2014  |  Statistics

    Divorce rate in Maine vs margarine

    This is fun. Tyler Vigen wrote a program that attempts to automatically find things that correlate. As of writing this, 4,000 correlations were found so far (and actually over 100 more when I finished). Some of the gems include: the divorce rate in Maine versus per capita consumption of margarine, marriage rate in Alabama versus whole milk consumption per capita, and honey produced in bee colonies versus labor political action committees. Many things correlate with cheese consumption.

  • Type I and II errors simplified

    May 9, 2014  |  Statistics

    Type I and II errors

    "Type I" and "Type II" errors, names first given by Jerzy Neyman and Egon Pearson to describe rejecting a null hypothesis when it's true and accepting one when it's not, are too vague for stat newcomers (and in general). This is better. [via]

  • Naked Statistics

    May 8, 2014  |  Statistics

    Naked Statistics by Charles Wheelan promises a fun, non-boring introduction to statistics that doesn't leave you drifting off into space, thinking about anything that is not statistics. From the book description:

    For those who slept through Stats 101, this book is a lifesaver. Wheelan strips away the arcane and technical details and focuses on the underlying intuition that drives statistical analysis. He clarifies key concepts such as inference, correlation, and regression analysis, reveals how biased or careless parties can manipulate or misrepresent data, and shows us how brilliant and creative researchers are exploiting the valuable data from natural experiments to tackle thorny questions.

    Naked StatisticsThe first statistics course I took—not counting the dreadful high school stat class taught by the water polo coach—actually drew me in from the start. Plus, I needed to finish my dissertation, so I didn't pick it up when it came out last year.

    I saw it in the library the other day though, so I checked it out. If anything, I could use a few more anecdotes to better describe statistics to people before they tell me how much they hated it.

    Naked Statistics is pretty much what the description says. It's like your stat introduction course with much less math, which is good for those interested in poking at data but well, slept through Stat 101 and have an irrational fear of numbers. You get important concepts and plenty of reasons why they're worth knowing. Most importantly, it gives you a statistical way to think about data, flaws and all. Wheelan also has a fun writing style that makes this an entertaining read.

    For those who are familiar with inference, correlation, and regression, the book will be too basic. It's not enough just for the anecdotes. However, for anyone with less than a bachelor's degree (or equivalent) in statistics who wants to know more about analyzing data, this book should be right up your alley.

    Keep in mind though that this only gets you part way to understanding your data. Naked Statistics is beginning concepts. Putting statistics into practice is the next step.

    Personally, I skimmed through a good portion of the book, as I'm familiar with the material. I did however read a chapter out loud while taking care of my son. He might not be able to crawl yet, but I'm hoping to ooze some knowledge in through osmosis.

Copyright © 2007-2014 FlowingData. All rights reserved. Hosted by Linode.