• A quick lesson on making predictions

    October 31, 2012  |  Statistics

    Political analyst and statistician Nate Silver has gotten some flack lately for consistently projecting a 70-plus percent chance of a Barack Obama win this election. But as Jeff Leek explains, the criticism doesn't spawn from Silver being wrong. Rather, it comes from the critics' misunderstanding of statistics. Leek provides a quick lesson on how Silver makes his predications and how the methods apply to other things, like the weather.

    Now, this might seem like a goofy way to come up with a "percent chance" with simulated elections and all. But it turns out it is actually a pretty important thing to know and relevant to those of us on the East Coast right now. It turns out weather forecasts (and projected hurricane paths) are based on the same sort of thing — simulated versions of the weather are run and the "percent chance of rain" is the fraction of times it rains in a particular place.

    So Romney may still win and Obama may lose — and Silver may still get a lot of it right. But regardless, the approach taken by Silver is not based on politics, it is based on statistics.

    Don't fear the black box.

  • Data on decades of Boy Scout expulsions released

    October 22, 2012  |  Data Sources

    Allegations in the Boy Scouts

    The Los Angeles Times released nearly 5,000 records of allegations from the Boy Scouts of America as a browseable map and searchable list. You can also download the data.

    This data­base con­tains in­form­a­tion on about 5,000 men and a hand­ful of wo­men who were ex­pelled from the Boy Scouts of Amer­ica between 1947 and Janu­ary 2005 on sus­pi­cion of sexu­al ab­use. The dots on the map in­dic­ate the loc­a­tion of troops con­nec­ted in some way to the ac­cused. The timeline be­low shows the volume of cases opened by year; however, an un­known num­ber of files were purged by the Scouts pri­or to the early 1990s

    The interactive map helps you narrow down by city, but it's kind of hard to see cases on a country-wide perspective. Here's a quick look.

    The worst part is that a lot of the cases went unreported.

  • The birthday problem explained

    October 5, 2012  |  Statistics

    How many people does it take for there to be a 50% chance that a pair in the group has the same birthday? Only 23 people. What about a 99% chance? Maybe even more shocking: 57 people. This is the birthday problem, which every undergrad who's taken a stat course has seen. Steven Strogataz explains the logic and calculations.

    Intuitively, how can 23 people be enough? It’s because of all the combinations they create, all the opportunities for luck to strike. With 23 people, there are 253 possible pairs of people (see the notes for why), and that turns out to be enough to push the odds of a match above 50 percent.

    Incidentally, if you go up to 43 people — the number of individuals who have served as United States president so far — the odds of a match increase to 92 percent. And indeed two of the presidents do have the same birthday: James Polk and Warren Harding were both born on Nov. 2.

    The Johnny Carson clip referenced in the article is worth watching. Carson tries to test the results with the audience, but goes about it the wrong way.

  • Data for good, not bad

    September 21, 2012  |  Statistics

    I'm so glad there are people like Jake Porway in the world. The founder and executive director of DataKind gives his quick pitch on "using data in the service of humanity."

  • Hiring a data scientist

    September 19, 2012  |  Statistics

    Thomas H. Davenport and D.J. Patil give the rundown on what a data scientist is, what to look for and how to hire them. It's an article in Harvard Business Review, so it's geared towards managers, and I felt like I was reading a horoscope at times, but there are some interesting tidbits in there.

    Data scientists don’t do well on a short leash. They should have the freedom to experiment and explore possibilities. That said, they need close relationships with the rest of the business. The most important ties for them to forge are with executives in charge of products and services rather than with people overseeing business functions. As the story of Jonathan Goldman illustrates, their greatest opportunity to add value is not in creating reports or presentations for senior executives but in innovating with customer-facing products and processes.

    I still call myself a statistician. The main difference between data scientist and statistician seems to be programming skills, but if you're doing statistics without code, I'm not sure what you're doing (other than theory).

    Update: This recent panel from DataGotham also discusses the data scientist hiring process. [Thanks, Drew]

  • Humans predicting the weather

    September 10, 2012  |  Statistics

    Nate Silver says the weatherman is not a moron.

    Still, most people take their forecasts for granted. Like a baseball umpire, a weather forecaster rarely gets credit for getting the call right. Last summer, meteorologists at the National Hurricane Center were tipped off to something serious when nearly all their computer models indicated that a fierce storm was going to be climbing the Northeast Corridor. The eerily similar results between models helped the center amplify its warning for Hurricane Irene well before it touched down on the Atlantic shore, prompting thousands to evacuate their homes. To many, particularly in New York, Irene was viewed as a media-manufactured nonevent, but that was largely because the Hurricane Center nailed its forecast. Six years earlier, the National Weather Service also made a nearly perfect forecast of Hurricane Katrina, anticipating its exact landfall almost 60 hours in advance. If public officials hadn’t bungled the evacuation of New Orleans, the death toll might have been remarkably low.

    I like the bit later in the article that describes the number crunching machine and how humans are involved in the analysis. The National Weather Service has heavy-duty computing power to process data coming from weather stations across the country, but the computer is still bad at doing a lot of things.

    To most people, statistics means plugging numbers into an advanced calculator that spits out values, without much thought involved. Those people don't work with data.

  • Analyzing text messages to save lives

    September 5, 2012  |  Statistics

    Nancy Lublin, CEO of Do Something, gives a five-minute TED talk on the potential in analyzing text messages. During a texting campaign, Do Something started to receive texts from troubled teenagers, that ranged from bullying to rape, which led to the organization's work in setting up a texting hotline. Lublin hopes that, once the system is built, the data gathered from these messages can be used as a census of problems, and can perhaps be used in the same way that Target uses data to figure out if women are pregnant — but to save lives, instead of figuring out what coupons to send.

    [Thanks, Tommy]

  • Poker is a game of skill, not luck

    August 28, 2012  |  Statistics

    King nine offsuit

    Randal Heeb convinced a New York City judge that poker isn't a game of luck, using a 120-page report full of analysis and charts.

    Judge Weinstein, relying on the research of Randal Heeb - an economist, statistician and poker player - found that while luck determines what cards a player gets, skill plays the bigger role in a player's ultimate success. With such charts as "Win-Rate Comparison: King Nine Offsuit,' the 91-year-old judge delved into the complexities of the argument more thoroughly than any past court has. John Pappas was pleased with the result, and impressed with the methodology.

    The king nine offsuit chart he refers to is above. Simulated earnings for better players are shown on the left in blue, and earnings for not so good players are shown on the right. Although both groups are likely to lose with the hand, skill appears to decrease expected losses.

    Check out the full report here [pdf].

  • Twitter vs. Facebook: What people share

    August 23, 2012  |  Statistics

    flowingdata-metaviz

    Edwin Chen, a data scientist at Twitter, took an in-depth look at what people are more inclined to tweet on Twitter and like on Facebook. He used FlowingData as his main data source, but also analyzed Quora, xkcd, and New Scientist. The main finding:

    Twitter is still for the techies: articles where the number of tweets greatly outnumber FB likes tend to revolve around software companies and programming. Facebook, on the other hand, appeals to everyone else: yeah, to the masses, and to non-software technical folks in general as well.

    I saw the analysis when it was posted over a year ago but never got around to sharing it. It crossed my desktop again recently. The results still seem to apply.

    From a practical standpoint, I don't think about whether or not people are going to share something more on Twitter or Facebook before I post it. I just link to what I think is interesting. However, when I post something with a poop or fart joke in it (so half the time, basically), I make sure I share it on Facebook, which I have to do manually. Because you know, bowel movements have universal appeal.

  • Network analysis on high school hierarchy of friends

    August 13, 2012  |  Statistics

    Brian Ball and M. E. J. Newman analyzed friendship data from a high school and junior high, and found a hierarchy similar to the one in Mean Girls.

    Here we analyze a large collection of such networks representing friendships among students at US high and junior-high schools and show that the pattern of unreciprocated friendships is far from random. In every network, without exception, we find that there exists a ranking of participants, from low to high, such that almost all unreciprocated friendships consist of a lower-ranked individual claiming friendship with a higher-ranked one.

    So someone higher up on the totem poll had more people saying they were friends with him or her, but the popular one didn't necessarily feel the same.

    I told my wife this, and her reaction was basically, "Uh, yeah. And?"

  • Fox News continues charting excellence

    August 6, 2012  |  Mistaken Data

    Bush cuts

    Fox News tried to show the change in the top tax rate if the Bush tax cuts expire, so they showed the rate now and what'd it be in 2013. Wow, it'll be around five times higher. Wait. No.

    The value axis starts at 34 percent instead of zero, which you don't do with bar charts, because length is the visual cue. That is to say, when you look at this chart, you compare how high each bar is. Fox News might as well have started the vertical axis at 34.9 percent. That would've been more dramatic.

    Here's what the bar chart is supposed to look like:

    With a difference of 4.6 percentage points, the change doesn't look so crazy.

    [via Effective Graphs]

  • From statistics to data science, and vice versa

    July 26, 2012  |  Statistics

    Carnegie Mellon statistics professor Cosma Shalizi considers the differences and similarities between statistics and data science.

    If people want to call those who do such jobs "data scientists" rather than "statisticians" because it sounds more dignified, or gets them more money, or makes them easier to hire, then more power to them. If they want to avoid the suggestion that you need a statistics degree to do this work, they have a point but it seems a clumsy way to make it. If, however, the name "statistician" is avoided because that connotes not a powerful discipline which transforms profound ideas about learning from experience into practical tools, but rather, a meaningless conglomeration of rituals better conducted with twenty-sided dice, then we as a profession have failed ourselves and, more importantly, the public, and the blame lies with us. Since what we have to offer is really quite wonderful, we should not let that happen.

    Some time during the past couple of years, statistics became data science's older, more boring sibling that always plays by the rules. There are a lot of statisticians who now call themselves data scientists. I still call myself a statistician.

    But I think we're getting closer to that part in the movie when the older, more stuffy character learns from the young whipper snapper that loosening up could be a good thing, and when the young one realizes that some elbow grease and tradition can go a long way.

  • Computing for data analysis

    July 24, 2012  |  Statistics

    If you want to learn visualization, you should learn data. To learn data, you should learn statistics. Where to begin? The free analysis courses offered on Coursera, by Johns Hopkins professors is probably a good place to start. Currently available: Computing for Data Analysis with biostatistics professor Roger D. Peng and Data Analysis with Jeff Leek, also a biostatistics professor.

    There's also a handful of data-related courses from other university professors that might be worth a look.

  • How consumers suck at math

    July 18, 2012  |  Statistics

    Derek Thompson for The Atlantic on how retail uses our numeric biases to their advantage:

    Now that I've just told you that consumers try to avoid additional payments, I should add that there are two additional payments we love: rebates and warranties. The first buys the illusion of wealth ("I'm being paid money to spend money!"). The second buys peace of mind ("Now I can own this thing forever without worrying about it!"). Both are basically tricks. "Instead of buying something and getting a rebate," Poundstone writes, "why not just pay a lower price in the first place?'

    "[Warranties] make no rational sense," Harvard economist David Cutler told the Washington Post. "The implied probability that [a product] will break has to be substantially greater than the risk that you can't afford to fix it or replace it. If you're buying a $400 item, for the overwhelming number of consumers that level of spending is not a risk you need to insure under any circumstances."

    Other tidbits: our obsession with prices ending with a nine and how we justify purchases of things that are more expensive but aren't necessarily better than the cheaper item.

  • Data plural versus data singular

    July 12, 2012  |  Statistics

    Kevin Drum on data is or data are:

    Now, I know that lots of people continue to foolishly disagree with me about this, but I'm curious how far they're willing to push things. If you had, say, five bits of information, would you say I only have five data? If you really, truly believe that data is a plural noun, you'd have no problem with this. But does anyone actually do it?

    This was in response to the Wall Street Journal's style guy saying that they can go either way, as the word as has evolved to also mean a singular collection of numbers.

    Here's what the New York Times style guide has to say about it:

    [D]ata is acceptable as a singular term for information: The data was persuasive. In its traditional sense, meaning a collection of facts and figures, the noun can still be plural: They tabulate the data, which arrive from bookstores nationwide. (In this sense, the singular is datum, a word both stilted and deservedly obscure.)

    I say data is. The plural version sounds weird to me.

  • Soda versus pop on Twitter

    July 9, 2012  |  Statistics

    Soda vs pop on Twitter

    Edwin Chen, a data scientist at Twitter, explored the geographic differences in language usage of soda, pop, and coke. We've seen this before, so it shouldn't be surprising to see that in the United States soda is dominant on the coasts, pop in the midwest, and coke in the southeast. The global view is new, with coke basically penetrating almost all of Europe.

    What I think is most interesting though is the idea of tweets and status updates as data that represents cultures. There are applications that keep track of tweet volume, number of replies, and when the best time to share a link is, but in ten years none of that will matter. These miniature data time capsules on the other hand will be worth another look.

  • Mitt Romney pseudo-venn diagram, used incorrectly

    July 8, 2012  |  Mistaken Data

    Promise gap venn diagram

    The Mitt Romney campaign put this venn diagram up a few days ago, aiming to show the "promise gap." On the left is an Obama promise, and on the right is the result. In the middle, the combination of the promise and the result, is the gap. Wait, that's not right.

  • How long it takes to get pregnant

    July 3, 2012  |  Statistics

    Probability of conception by month and age

    The odds of getting pregnant after a certain time trying are surprisingly hard to come by. There are statistics here and there, but none provide a good overview of the probabilities. Mathematician Richie Cotton crunched some numbers using monthly fecundity rate — the monthly chance of getting pregnant — to estimate about how long it would take for he and his girlfriend to conceive.

    [A]lmost half of the (healthy) 25 year olds get pregnant in the first month, and after two years (the point when doctors start considering you to have fertility problems) more than 90% of 35 year olds should conceive. By contrast, just over 20% of 45 year old women will. In fact, even this statistic is over-optimistic: at this age, fertility is rapidly decreasing, and a 1% MFR at age 45 will mean a much lower MFR at age 47 and the negative binomial model breaks down.

    Obviously, there are other factors to consider like male fertility and how often a couple has sex, but there you go.

    [via Revolutions]

  • Synchronized Swimming in Data and the Water Metaphor

    June 21, 2012  |  Statistics

    The flood. The avalanche. The tsunami. Drowning in data. For the past few years, a couple of times a week, there's an article about all the data we have access to and how we're struggling to stay afloat in the growing sea of data. Big data is getting too big they say.

    The water metaphor is fine, but the fear of the data flow is irrational, so let's run swim with the former.
    Continue Reading

  • Analysis of chords used in popular songs

    June 20, 2012  |  Statistics

    Chord usage

    Hooktheory, a system for learning to write music, analyzed 1,300 popular songs for how chords were used. The above shows chords that followed an E minor chord.

    This result is striking. If you write a song in C with an E minor in it, you should probably think very hard if you want to put a chord that is anything other than an A minor chord or an F major chord. For the songs in the database, 93% of the time one of these two chords came next.

    The most common chords used overall were G, F, and C.

    [via Waxy]

    Update: See also this great musical sketch by Axis of Awesome in which they sing some 40 songs that use the same four chords. [Thanks, Jan]

Copyright © 2007-2014 FlowingData. All rights reserved. Hosted by Linode.