• Fox News continues charting excellence

    August 6, 2012  |  Mistaken Data

    Bush cuts

    Fox News tried to show the change in the top tax rate if the Bush tax cuts expire, so they showed the rate now and what'd it be in 2013. Wow, it'll be around five times higher. Wait. No.

    The value axis starts at 34 percent instead of zero, which you don't do with bar charts, because length is the visual cue. That is to say, when you look at this chart, you compare how high each bar is. Fox News might as well have started the vertical axis at 34.9 percent. That would've been more dramatic.

    Here's what the bar chart is supposed to look like:

    With a difference of 4.6 percentage points, the change doesn't look so crazy.

    [via Effective Graphs]

  • From statistics to data science, and vice versa

    July 26, 2012  |  Statistics

    Carnegie Mellon statistics professor Cosma Shalizi considers the differences and similarities between statistics and data science.

    If people want to call those who do such jobs "data scientists" rather than "statisticians" because it sounds more dignified, or gets them more money, or makes them easier to hire, then more power to them. If they want to avoid the suggestion that you need a statistics degree to do this work, they have a point but it seems a clumsy way to make it. If, however, the name "statistician" is avoided because that connotes not a powerful discipline which transforms profound ideas about learning from experience into practical tools, but rather, a meaningless conglomeration of rituals better conducted with twenty-sided dice, then we as a profession have failed ourselves and, more importantly, the public, and the blame lies with us. Since what we have to offer is really quite wonderful, we should not let that happen.

    Some time during the past couple of years, statistics became data science's older, more boring sibling that always plays by the rules. There are a lot of statisticians who now call themselves data scientists. I still call myself a statistician.

    But I think we're getting closer to that part in the movie when the older, more stuffy character learns from the young whipper snapper that loosening up could be a good thing, and when the young one realizes that some elbow grease and tradition can go a long way.

  • Computing for data analysis

    July 24, 2012  |  Statistics

    If you want to learn visualization, you should learn data. To learn data, you should learn statistics. Where to begin? The free analysis courses offered on Coursera, by Johns Hopkins professors is probably a good place to start. Currently available: Computing for Data Analysis with biostatistics professor Roger D. Peng and Data Analysis with Jeff Leek, also a biostatistics professor.

    There's also a handful of data-related courses from other university professors that might be worth a look.

  • How consumers suck at math

    July 18, 2012  |  Statistics

    Derek Thompson for The Atlantic on how retail uses our numeric biases to their advantage:

    Now that I've just told you that consumers try to avoid additional payments, I should add that there are two additional payments we love: rebates and warranties. The first buys the illusion of wealth ("I'm being paid money to spend money!"). The second buys peace of mind ("Now I can own this thing forever without worrying about it!"). Both are basically tricks. "Instead of buying something and getting a rebate," Poundstone writes, "why not just pay a lower price in the first place?'

    "[Warranties] make no rational sense," Harvard economist David Cutler told the Washington Post. "The implied probability that [a product] will break has to be substantially greater than the risk that you can't afford to fix it or replace it. If you're buying a $400 item, for the overwhelming number of consumers that level of spending is not a risk you need to insure under any circumstances."

    Other tidbits: our obsession with prices ending with a nine and how we justify purchases of things that are more expensive but aren't necessarily better than the cheaper item.

  • Data plural versus data singular

    July 12, 2012  |  Statistics

    Kevin Drum on data is or data are:

    Now, I know that lots of people continue to foolishly disagree with me about this, but I'm curious how far they're willing to push things. If you had, say, five bits of information, would you say I only have five data? If you really, truly believe that data is a plural noun, you'd have no problem with this. But does anyone actually do it?

    This was in response to the Wall Street Journal's style guy saying that they can go either way, as the word as has evolved to also mean a singular collection of numbers.

    Here's what the New York Times style guide has to say about it:

    [D]ata is acceptable as a singular term for information: The data was persuasive. In its traditional sense, meaning a collection of facts and figures, the noun can still be plural: They tabulate the data, which arrive from bookstores nationwide. (In this sense, the singular is datum, a word both stilted and deservedly obscure.)

    I say data is. The plural version sounds weird to me.

  • Soda versus pop on Twitter

    July 9, 2012  |  Statistics

    Soda vs pop on Twitter

    Edwin Chen, a data scientist at Twitter, explored the geographic differences in language usage of soda, pop, and coke. We've seen this before, so it shouldn't be surprising to see that in the United States soda is dominant on the coasts, pop in the midwest, and coke in the southeast. The global view is new, with coke basically penetrating almost all of Europe.

    What I think is most interesting though is the idea of tweets and status updates as data that represents cultures. There are applications that keep track of tweet volume, number of replies, and when the best time to share a link is, but in ten years none of that will matter. These miniature data time capsules on the other hand will be worth another look.

  • Mitt Romney pseudo-venn diagram, used incorrectly

    July 8, 2012  |  Mistaken Data

    Promise gap venn diagram

    The Mitt Romney campaign put this venn diagram up a few days ago, aiming to show the "promise gap." On the left is an Obama promise, and on the right is the result. In the middle, the combination of the promise and the result, is the gap. Wait, that's not right.

  • How long it takes to get pregnant

    July 3, 2012  |  Statistics

    Probability of conception by month and age

    The odds of getting pregnant after a certain time trying are surprisingly hard to come by. There are statistics here and there, but none provide a good overview of the probabilities. Mathematician Richie Cotton crunched some numbers using monthly fecundity rate — the monthly chance of getting pregnant — to estimate about how long it would take for he and his girlfriend to conceive.

    [A]lmost half of the (healthy) 25 year olds get pregnant in the first month, and after two years (the point when doctors start considering you to have fertility problems) more than 90% of 35 year olds should conceive. By contrast, just over 20% of 45 year old women will. In fact, even this statistic is over-optimistic: at this age, fertility is rapidly decreasing, and a 1% MFR at age 45 will mean a much lower MFR at age 47 and the negative binomial model breaks down.

    Obviously, there are other factors to consider like male fertility and how often a couple has sex, but there you go.

    [via Revolutions]

  • Synchronized Swimming in Data and the Water Metaphor

    June 21, 2012  |  Statistics

    The flood. The avalanche. The tsunami. Drowning in data. For the past few years, a couple of times a week, there's an article about all the data we have access to and how we're struggling to stay afloat in the growing sea of data. Big data is getting too big they say.

    The water metaphor is fine, but the fear of the data flow is irrational, so let's run swim with the former.
    Continue Reading

  • Analysis of chords used in popular songs

    June 20, 2012  |  Statistics

    Chord usage

    Hooktheory, a system for learning to write music, analyzed 1,300 popular songs for how chords were used. The above shows chords that followed an E minor chord.

    This result is striking. If you write a song in C with an E minor in it, you should probably think very hard if you want to put a chord that is anything other than an A minor chord or an F major chord. For the songs in the database, 93% of the time one of these two chords came next.

    The most common chords used overall were G, F, and C.

    [via Waxy]

    Update: See also this great musical sketch by Axis of Awesome in which they sing some 40 songs that use the same four chords. [Thanks, Jan]

  • Why robbing banks for a living is a bad idea

    June 18, 2012  |  Statistics

    In an article for Significance Magazine, economists Barry Reilly, Neil Rickman and Robert Witt explain why robbing banks stinks as a profession.

    The return on an average bank robbery is, frankly, rubbish. It is not unimaginable wealth. It is a very modest £12 706.60 per person per raid. Indeed, it is so low that it is not worth the banks’ while to spend as little as £4500 per cashier position at every branch on rising screens to deter them.

    A single bank raid, even a successful one, is not going to keep our would-be robber in a life of luxury. It is not going to keep him long in a life of any kind. Given that the average UK wage for those in full-time employment is around £26 000, it will give him a modest lifestyle for no more than 6 months. If he decides to make a career of it, and robs two banks a year to make a sub-average income, his chances of eventually getting caught will increase: at 0.8 probability per raid, after three raids or a year and a half his odds of remaining at large are 0.8×0.8×0.8=0.512; after four raids he is more likely than not to be inside. As a profitable occupation, bank robbery leaves a lot to be desired.

    Be sure to read the full article for more details on the varying gains and losses when the team is bigger and whether or not a gun is used. Spoiler: an additional member to the robbing team raises the expected haul by about £9,000, and the use of a firearm raises the expected output by about £10,000. Just don't get arrested.

    [via Ars Technica]

  • This American Life on Blackjack

    June 15, 2012  |  Statistics

    The newest episode of This American Life is on the game of Blackjack. Years ago one summer, I was a recent college graduate with a degree in engineering and a minor in statistics, making seven bucks and some change an hour and waiting for grad school to start. My idle mind grew obsessed with card counting. It didn't work out so well, but needless to say I found this episode fascinating.

  • Profile of the Facebook Data Science Team

    June 14, 2012  |  Statistics

    MIT Technology Review profiles the Facebook Data Science Team, described as a gathering of grad students at a top school and headed by Cameron Marlow, the "young professor."

    Back at Facebook, Marlow isn't the one who makes decisions about what the company charges for, even if his work will shape them. Whatever happens, he says, the primary goal of his team is to support the well-being of the people who provide Facebook with their data, using it to make the service smarter. Along the way, he says, he and his colleagues will advance humanity's understanding of itself. That echoes Zuckerberg's often doubted but seemingly genuine belief that Facebook's job is to improve how the world communicates. Just don't ask yet exactly what that will entail. "It's hard to predict where we'll go, because we're at the very early stages of this science," says ­Marlow. "The number of potential things that we could ask of Facebook's data is enormous."

  • Losing American Community Survey would be ‘disastrous’

    June 11, 2012  |  Data Sources

    Many want to get rid of the American Community Survey, a Census program which releases region-specific data annually. University of Michigan professor William Frey explains why cutting the survey would be a mistake.

  • Reverse engineering targeted emails from 2012 Campaign

    May 31, 2012  |  Statistics

    After noticing the Obama campaign was sending variations of an email to voters, ProPublica identified six distinct types with certain demographics and showed the differences. It was called the Message Machine. Now ProPublica is taking it a step further, hoping to dissect every email from all 2012 campaigns.

    Today, we are relaunching the Message Machine, and expanding it from handling just one mailing to handling every email from all of the campaigns in the 2012 election. It will seek a broad understanding in real time of the new and sophisticated ways modern campaigns are targeting voters.

    It's a big puzzle, and to solve it, we need a big sample of political emails, and an understanding of who received them. That’s where you come in.

    If you get campaign emails on any subject — donation, get-out-the-vote, volunteering, events, etc. — just forward them to [email protected] using your email program's standard forwarding feature. Nothing fancier than that needed.

    Way cool. Although I bet there will be a lot of noise, especially from the smaller, less data-savvy campaigns.

  • Hans Rosling one-minute TED talk

    May 30, 2012  |  Statistics

    Screw the sword swallowing and giant screen of moving bubbles. Just get Rosling a handful of rocks and he draws a crowd.

    [via infosthetics]

  • Why are so many men pregnant?

    May 17, 2012  |  Mistaken Data  |  Kim Rees

    Garbage in, garbage out the old adage goes. Nigel Hawkes, Director of Straight Statistics, describes a sort of statistical whistleblowing letter to the British Medical Journal.

    A team from Imperial College found that in 2009-10, nearly 20,000 adults were coded as having attended paediatric outpatient services, and 3,000 patients under 19 were apparently treated in geriatric clinics. Even more striking, between 15,000 and 20,000 men have been admitted to obstetric wards each year since 2003, and almost 10,000 to gynaecology wards.

    It's hard to put your faith in analysis, visualization, policy, and anything else that comes out of data with reports like these. With human error being a known issue, we have to find better ways of inputting and double-checking data. Unfortunate mistakes at the outset only lead to bigger problems down the line.

  • A Future Without Key Social and Economic Statistics for the Country

    May 13, 2012  |  Data Sources

    Robert Groves, director of the U.S. Census Bureau, on the Appropriations Bill:

    The Appropriations Bill eliminates the Economic Census, which measures the health of our economy. It terminates the American Community Survey, which produces the social and demographic information that monitors the impact of economic trends on communities throughout the country. It halts crucial development of ways to save money on the next decennial census. In the last three years the Census Bureau has reacted to budget and technological challenges by mounting aggressive operational efficiency programs to make these key statistical cornerstones of the country more cost efficient. Eliminating them halts all the progress to build 21st century statistical tools through those innovations. This bill thus devastates the nation’s statistical information about the status of the economy and the larger society.

    A lot of the negative comments following the post are from people who have never used Census data, or any substantial amount of data for that matter, and have no clue how a dataset can feed into a model to make other estimates. Then there's the people who don't want to answer questions about their toilets. I wonder what their Facebook profiles look like.

  • TV anachronisms

    May 11, 2012  |  Statistics

    Modern to period use ratio

    Princeton history graduate student Benjamin Schmidt explores changes in language through TV anachronisms. In Schmidt's most recent analysis, he examines Megan's use of "callback" in the last episode of Mad Men. Above is the ratio of modern use to period use. Notice callback sticking out in the top left.

    The big one from the charts: Megan gets "a callback for" an audition. This is, the data says, a candidate for the worst anachronism of the season. The word "callback" is about 100x more common by the 1990s, and "callback for" is even worse. The OED doesn't have any examples of a theater-oriented use of "callback" until the 1970s; although I bet one could find some examples somewhere earlier in the New York theater scene, that may not save it. It wouldn't really suite Megan's generally dilettantish attitude towards the theater, or the office staff's lack of knowledge of it, for them to be so au courant. "call-back" and "call back" don't seem much more likely.

    Other anachronisms include the use of "pay phone" and a frequent use of "on the phone with" which didn't peak until the 1970s.

    Don't miss the look into Downton Abbey anachronisms. Also, more details from Schmidt on his methodology.

    [via Revolutions]

  • Why the American Community Survey is worth keeping

    May 10, 2012  |  Statistics

    Jerzy Wieczorek, a statistician with the U.S. Census Bureau, explains why the American Community Survey is worthwhile.

    Besides the direct estimates from the ACS itself, the Census Bureau uses ACS data as the backbone of several other programs. For example, the Small Area Income and Poverty Estimates program provides annual data to the Department of Education for use in allocating funds to school districts, based on local counts and rates of children in poverty. Without the ACS we would be limited to using smaller surveys (and thus less accurate information about poverty in each school district) or older data (which can become outdated within a few years, such as during the recent recession). Either way, it would hurt our ability to allocate resources fairly to schoolchildren nationwide.

    Similarly, the Census Bureau uses the ACS to produce other timely small-area estimates required by Congressional legislation or requested by other agencies: the number of people with health insurance, people with disabilities, minority language speakers, etc. The legislation requires a data source like the ACS not only so that it can be carried out well, but also so its progress can be monitored.

Copyright © 2007-2014 FlowingData. All rights reserved. Hosted by Linode.