I knew things were bad, but I didn't know they were this bad. Obama has his work cut out for him. [Thanks, @adamsinger]
I knew things were bad, but I didn't know they were this bad. Obama has his work cut out for him. [Thanks, @adamsinger]
By way of Rafa Irizarry from Simply Statistics, a plot of Nate Silver's probabilities for Barack Obama winning a state versus the percentage of vote in each state, as of midnight EST.
I guess that's pretty (100%) good. Looks like the folks at Princeton didn't do half bad either. It's a win for Obama and a win for statistics. Well, good statistics, at least. (Looking at you, University of Colorado.)
Political analyst and statistician Nate Silver has gotten some flack lately for consistently projecting a 70-plus percent chance of a Barack Obama win this election. But as Jeff Leek explains, the criticism doesn't spawn from Silver being wrong. Rather, it comes from the critics' misunderstanding of statistics. Leek provides a quick lesson on how Silver makes his predications and how the methods apply to other things, like the weather.
Now, this might seem like a goofy way to come up with a "percent chance" with simulated elections and all. But it turns out it is actually a pretty important thing to know and relevant to those of us on the East Coast right now. It turns out weather forecasts (and projected hurricane paths) are based on the same sort of thing — simulated versions of the weather are run and the "percent chance of rain" is the fraction of times it rains in a particular place.
So Romney may still win and Obama may lose — and Silver may still get a lot of it right. But regardless, the approach taken by Silver is not based on politics, it is based on statistics.
Don't fear the black box.
The Los Angeles Times released nearly 5,000 records of allegations from the Boy Scouts of America as a browseable map and searchable list. You can also download the data.
This database contains information on about 5,000 men and a handful of women who were expelled from the Boy Scouts of America between 1947 and January 2005 on suspicion of sexual abuse. The dots on the map indicate the location of troops connected in some way to the accused. The timeline below shows the volume of cases opened by year; however, an unknown number of files were purged by the Scouts prior to the early 1990s
The interactive map helps you narrow down by city, but it's kind of hard to see cases on a country-wide perspective. Here's a quick look.
The worst part is that a lot of the cases went unreported.
How many people does it take for there to be a 50% chance that a pair in the group has the same birthday? Only 23 people. What about a 99% chance? Maybe even more shocking: 57 people. This is the birthday problem, which every undergrad who's taken a stat course has seen. Steven Strogataz explains the logic and calculations.
Intuitively, how can 23 people be enough? It’s because of all the combinations they create, all the opportunities for luck to strike. With 23 people, there are 253 possible pairs of people (see the notes for why), and that turns out to be enough to push the odds of a match above 50 percent.
Incidentally, if you go up to 43 people — the number of individuals who have served as United States president so far — the odds of a match increase to 92 percent. And indeed two of the presidents do have the same birthday: James Polk and Warren Harding were both born on Nov. 2.
The Johnny Carson clip referenced in the article is worth watching. Carson tries to test the results with the audience, but goes about it the wrong way.
I'm so glad there are people like Jake Porway in the world. The founder and executive director of DataKind gives his quick pitch on "using data in the service of humanity."
Thomas H. Davenport and D.J. Patil give the rundown on what a data scientist is, what to look for and how to hire them. It's an article in Harvard Business Review, so it's geared towards managers, and I felt like I was reading a horoscope at times, but there are some interesting tidbits in there.
Data scientists don’t do well on a short leash. They should have the freedom to experiment and explore possibilities. That said, they need close relationships with the rest of the business. The most important ties for them to forge are with executives in charge of products and services rather than with people overseeing business functions. As the story of Jonathan Goldman illustrates, their greatest opportunity to add value is not in creating reports or presentations for senior executives but in innovating with customer-facing products and processes.
I still call myself a statistician. The main difference between data scientist and statistician seems to be programming skills, but if you're doing statistics without code, I'm not sure what you're doing (other than theory).
Update: This recent panel from DataGotham also discusses the data scientist hiring process. [Thanks, Drew]
Nate Silver says the weatherman is not a moron.
Still, most people take their forecasts for granted. Like a baseball umpire, a weather forecaster rarely gets credit for getting the call right. Last summer, meteorologists at the National Hurricane Center were tipped off to something serious when nearly all their computer models indicated that a fierce storm was going to be climbing the Northeast Corridor. The eerily similar results between models helped the center amplify its warning for Hurricane Irene well before it touched down on the Atlantic shore, prompting thousands to evacuate their homes. To many, particularly in New York, Irene was viewed as a media-manufactured nonevent, but that was largely because the Hurricane Center nailed its forecast. Six years earlier, the National Weather Service also made a nearly perfect forecast of Hurricane Katrina, anticipating its exact landfall almost 60 hours in advance. If public officials hadn’t bungled the evacuation of New Orleans, the death toll might have been remarkably low.
I like the bit later in the article that describes the number crunching machine and how humans are involved in the analysis. The National Weather Service has heavy-duty computing power to process data coming from weather stations across the country, but the computer is still bad at doing a lot of things.
To most people, statistics means plugging numbers into an advanced calculator that spits out values, without much thought involved. Those people don't work with data.
Nancy Lublin, CEO of Do Something, gives a five-minute TED talk on the potential in analyzing text messages. During a texting campaign, Do Something started to receive texts from troubled teenagers, that ranged from bullying to rape, which led to the organization's work in setting up a texting hotline. Lublin hopes that, once the system is built, the data gathered from these messages can be used as a census of problems, and can perhaps be used in the same way that Target uses data to figure out if women are pregnant — but to save lives, instead of figuring out what coupons to send.
Randal Heeb convinced a New York City judge that poker isn't a game of luck, using a 120-page report full of analysis and charts.
Judge Weinstein, relying on the research of Randal Heeb - an economist, statistician and poker player - found that while luck determines what cards a player gets, skill plays the bigger role in a player's ultimate success. With such charts as "Win-Rate Comparison: King Nine Offsuit,' the 91-year-old judge delved into the complexities of the argument more thoroughly than any past court has. John Pappas was pleased with the result, and impressed with the methodology.
The king nine offsuit chart he refers to is above. Simulated earnings for better players are shown on the left in blue, and earnings for not so good players are shown on the right. Although both groups are likely to lose with the hand, skill appears to decrease expected losses.
Check out the full report here [pdf].
Edwin Chen, a data scientist at Twitter, took an in-depth look at what people are more inclined to tweet on Twitter and like on Facebook. He used FlowingData as his main data source, but also analyzed Quora, xkcd, and New Scientist. The main finding:
Twitter is still for the techies: articles where the number of tweets greatly outnumber FB likes tend to revolve around software companies and programming. Facebook, on the other hand, appeals to everyone else: yeah, to the masses, and to non-software technical folks in general as well.
I saw the analysis when it was posted over a year ago but never got around to sharing it. It crossed my desktop again recently. The results still seem to apply.
From a practical standpoint, I don't think about whether or not people are going to share something more on Twitter or Facebook before I post it. I just link to what I think is interesting. However, when I post something with a poop or fart joke in it (so half the time, basically), I make sure I share it on Facebook, which I have to do manually. Because you know, bowel movements have universal appeal.
Brian Ball and M. E. J. Newman analyzed friendship data from a high school and junior high, and found a hierarchy similar to the one in Mean Girls.
Here we analyze a large collection of such networks representing friendships among students at US high and junior-high schools and show that the pattern of unreciprocated friendships is far from random. In every network, without exception, we find that there exists a ranking of participants, from low to high, such that almost all unreciprocated friendships consist of a lower-ranked individual claiming friendship with a higher-ranked one.
So someone higher up on the totem poll had more people saying they were friends with him or her, but the popular one didn't necessarily feel the same.
I told my wife this, and her reaction was basically, "Uh, yeah. And?"
Fox News tried to show the change in the top tax rate if the Bush tax cuts expire, so they showed the rate now and what'd it be in 2013. Wow, it'll be around five times higher. Wait. No.
The value axis starts at 34 percent instead of zero, which you don't do with bar charts, because length is the visual cue. That is to say, when you look at this chart, you compare how high each bar is. Fox News might as well have started the vertical axis at 34.9 percent. That would've been more dramatic.
Here's what the bar chart is supposed to look like:
With a difference of 4.6 percentage points, the change doesn't look so crazy.
[via Effective Graphs]
Carnegie Mellon statistics professor Cosma Shalizi considers the differences and similarities between statistics and data science.
If people want to call those who do such jobs "data scientists" rather than "statisticians" because it sounds more dignified, or gets them more money, or makes them easier to hire, then more power to them. If they want to avoid the suggestion that you need a statistics degree to do this work, they have a point but it seems a clumsy way to make it. If, however, the name "statistician" is avoided because that connotes not a powerful discipline which transforms profound ideas about learning from experience into practical tools, but rather, a meaningless conglomeration of rituals better conducted with twenty-sided dice, then we as a profession have failed ourselves and, more importantly, the public, and the blame lies with us. Since what we have to offer is really quite wonderful, we should not let that happen.
Some time during the past couple of years, statistics became data science's older, more boring sibling that always plays by the rules. There are a lot of statisticians who now call themselves data scientists. I still call myself a statistician.
But I think we're getting closer to that part in the movie when the older, more stuffy character learns from the young whipper snapper that loosening up could be a good thing, and when the young one realizes that some elbow grease and tradition can go a long way.
If you want to learn visualization, you should learn data. To learn data, you should learn statistics. Where to begin? The free analysis courses offered on Coursera, by Johns Hopkins professors is probably a good place to start. Currently available: Computing for Data Analysis with biostatistics professor Roger D. Peng and Data Analysis with Jeff Leek, also a biostatistics professor.
There's also a handful of data-related courses from other university professors that might be worth a look.
Derek Thompson for The Atlantic on how retail uses our numeric biases to their advantage:
Now that I've just told you that consumers try to avoid additional payments, I should add that there are two additional payments we love: rebates and warranties. The first buys the illusion of wealth ("I'm being paid money to spend money!"). The second buys peace of mind ("Now I can own this thing forever without worrying about it!"). Both are basically tricks. "Instead of buying something and getting a rebate," Poundstone writes, "why not just pay a lower price in the first place?'
"[Warranties] make no rational sense," Harvard economist David Cutler told the Washington Post. "The implied probability that [a product] will break has to be substantially greater than the risk that you can't afford to fix it or replace it. If you're buying a $400 item, for the overwhelming number of consumers that level of spending is not a risk you need to insure under any circumstances."
Other tidbits: our obsession with prices ending with a nine and how we justify purchases of things that are more expensive but aren't necessarily better than the cheaper item.
Kevin Drum on data is or data are:
Now, I know that lots of people continue to foolishly disagree with me about this, but I'm curious how far they're willing to push things. If you had, say, five bits of information, would you say I only have five data? If you really, truly believe that data is a plural noun, you'd have no problem with this. But does anyone actually do it?
This was in response to the Wall Street Journal's style guy saying that they can go either way, as the word as has evolved to also mean a singular collection of numbers.
Here's what the New York Times style guide has to say about it:
[D]ata is acceptable as a singular term for information: The data was persuasive. In its traditional sense, meaning a collection of facts and figures, the noun can still be plural: They tabulate the data, which arrive from bookstores nationwide. (In this sense, the singular is datum, a word both stilted and deservedly obscure.)
I say data is. The plural version sounds weird to me.
Edwin Chen, a data scientist at Twitter, explored the geographic differences in language usage of soda, pop, and coke. We've seen this before, so it shouldn't be surprising to see that in the United States soda is dominant on the coasts, pop in the midwest, and coke in the southeast. The global view is new, with coke basically penetrating almost all of Europe.
What I think is most interesting though is the idea of tweets and status updates as data that represents cultures. There are applications that keep track of tweet volume, number of replies, and when the best time to share a link is, but in ten years none of that will matter. These miniature data time capsules on the other hand will be worth another look.
The Mitt Romney campaign put this venn diagram up a few days ago, aiming to show the "promise gap." On the left is an Obama promise, and on the right is the result. In the middle, the combination of the promise and the result, is the gap. Wait, that's not right.
The odds of getting pregnant after a certain time trying are surprisingly hard to come by. There are statistics here and there, but none provide a good overview of the probabilities. Mathematician Richie Cotton crunched some numbers using monthly fecundity rate — the monthly chance of getting pregnant — to estimate about how long it would take for he and his girlfriend to conceive.
[A]lmost half of the (healthy) 25 year olds get pregnant in the first month, and after two years (the point when doctors start considering you to have fertility problems) more than 90% of 35 year olds should conceive. By contrast, just over 20% of 45 year old women will. In fact, even this statistic is over-optimistic: at this age, fertility is rapidly decreasing, and a 1% MFR at age 45 will mean a much lower MFR at age 47 and the negative binomial model breaks down.
Obviously, there are other factors to consider like male fertility and how often a couple has sex, but there you go.