The newest episode of This American Life is on the game of Blackjack. Years ago one summer, I was a recent college graduate with a degree in engineering and a minor in statistics, making seven bucks and some change an hour and waiting for grad school to start. My idle mind grew obsessed with card counting. It didn't work out so well, but needless to say I found this episode fascinating.
MIT Technology Review profiles the Facebook Data Science Team, described as a gathering of grad students at a top school and headed by Cameron Marlow, the "young professor."
Back at Facebook, Marlow isn't the one who makes decisions about what the company charges for, even if his work will shape them. Whatever happens, he says, the primary goal of his team is to support the well-being of the people who provide Facebook with their data, using it to make the service smarter. Along the way, he says, he and his colleagues will advance humanity's understanding of itself. That echoes Zuckerberg's often doubted but seemingly genuine belief that Facebook's job is to improve how the world communicates. Just don't ask yet exactly what that will entail. "It's hard to predict where we'll go, because we're at the very early stages of this science," says Marlow. "The number of potential things that we could ask of Facebook's data is enormous."
Many want to get rid of the American Community Survey, a Census program which releases region-specific data annually. University of Michigan professor William Frey explains why cutting the survey would be a mistake.
After noticing the Obama campaign was sending variations of an email to voters, ProPublica identified six distinct types with certain demographics and showed the differences. It was called the Message Machine. Now ProPublica is taking it a step further, hoping to dissect every email from all 2012 campaigns.
Today, we are relaunching the Message Machine, and expanding it from handling just one mailing to handling every email from all of the campaigns in the 2012 election. It will seek a broad understanding in real time of the new and sophisticated ways modern campaigns are targeting voters.
It's a big puzzle, and to solve it, we need a big sample of political emails, and an understanding of who received them. That’s where you come in.
If you get campaign emails on any subject — donation, get-out-the-vote, volunteering, events, etc. — just forward them to [email protected] using your email program's standard forwarding feature. Nothing fancier than that needed.
Way cool. Although I bet there will be a lot of noise, especially from the smaller, less data-savvy campaigns.
Screw the sword swallowing and giant screen of moving bubbles. Just get Rosling a handful of rocks and he draws a crowd.
A team from Imperial College found that in 2009-10, nearly 20,000 adults were coded as having attended paediatric outpatient services, and 3,000 patients under 19 were apparently treated in geriatric clinics. Even more striking, between 15,000 and 20,000 men have been admitted to obstetric wards each year since 2003, and almost 10,000 to gynaecology wards.
It's hard to put your faith in analysis, visualization, policy, and anything else that comes out of data with reports like these. With human error being a known issue, we have to find better ways of inputting and double-checking data. Unfortunate mistakes at the outset only lead to bigger problems down the line.
Robert Groves, director of the U.S. Census Bureau, on the Appropriations Bill:
The Appropriations Bill eliminates the Economic Census, which measures the health of our economy. It terminates the American Community Survey, which produces the social and demographic information that monitors the impact of economic trends on communities throughout the country. It halts crucial development of ways to save money on the next decennial census. In the last three years the Census Bureau has reacted to budget and technological challenges by mounting aggressive operational efficiency programs to make these key statistical cornerstones of the country more cost efficient. Eliminating them halts all the progress to build 21st century statistical tools through those innovations. This bill thus devastates the nation’s statistical information about the status of the economy and the larger society.
A lot of the negative comments following the post are from people who have never used Census data, or any substantial amount of data for that matter, and have no clue how a dataset can feed into a model to make other estimates. Then there's the people who don't want to answer questions about their toilets. I wonder what their Facebook profiles look like.
Princeton history graduate student Benjamin Schmidt explores changes in language through TV anachronisms. In Schmidt's most recent analysis, he examines Megan's use of "callback" in the last episode of Mad Men. Above is the ratio of modern use to period use. Notice callback sticking out in the top left.
The big one from the charts: Megan gets "a callback for" an audition. This is, the data says, a candidate for the worst anachronism of the season. The word "callback" is about 100x more common by the 1990s, and "callback for" is even worse. The OED doesn't have any examples of a theater-oriented use of "callback" until the 1970s; although I bet one could find some examples somewhere earlier in the New York theater scene, that may not save it. It wouldn't really suite Megan's generally dilettantish attitude towards the theater, or the office staff's lack of knowledge of it, for them to be so au courant. "call-back" and "call back" don't seem much more likely.
Other anachronisms include the use of "pay phone" and a frequent use of "on the phone with" which didn't peak until the 1970s.
Jerzy Wieczorek, a statistician with the U.S. Census Bureau, explains why the American Community Survey is worthwhile.
Besides the direct estimates from the ACS itself, the Census Bureau uses ACS data as the backbone of several other programs. For example, the Small Area Income and Poverty Estimates program provides annual data to the Department of Education for use in allocating funds to school districts, based on local counts and rates of children in poverty. Without the ACS we would be limited to using smaller surveys (and thus less accurate information about poverty in each school district) or older data (which can become outdated within a few years, such as during the recent recession). Either way, it would hurt our ability to allocate resources fairly to schoolchildren nationwide.
Similarly, the Census Bureau uses the ACS to produce other timely small-area estimates required by Congressional legislation or requested by other agencies: the number of people with health insurance, people with disabilities, minority language speakers, etc. The legislation requires a data source like the ACS not only so that it can be carried out well, but also so its progress can be monitored.
Last month Republicans were pushing a bill to get rid of the American Community Survey, an 11-page questionnaire about housing, education, and other things. Yesterday, a bill passed to cut the survey in a 232 to 190 vote.
Republicans, acknowledging its usefulness, attacked the survey as an unconstitutional invasion of privacy, arguing that the government has no business knowing how many flush toilets someone has, for instance.
"It would seem that these questions hardly fit the scope of what was intended or required by the Constitution," said Rep. Daniel Webster (R-Fla.), author of the amendment.
"This survey is inappropriate for taxpayer dollars," Webster added. "It's the definition of a breach of personal privacy. It's the picture of what's wrong in Washington, D.C. It's unconstitutional."
The ACS is the picture of what's wrong in Washington? This is idiocy.
Thanks to the Internet Archive and CNN, thirteen years of transcripts, about a gigabyte compressed, is available to download as one file.
For over a decade, CNN (Cable News Network) has been providing transcripts of shows, events and newscasts from its broadcasts. The archive has been maintained and the text transcripts have been dependably available at transcripts.cnn.com. This is a just-in-case grab of the years of transcripts for later study and historical research.
Changes in news coverage and CNN's focus over the years, anyone?
I've been reading papers on how people learn statistics (and thoughts on teaching the subject) and came across the frequently-cited work of mathematical psychologists Amos Tversky and Daniel Kahneman. In 1972, they studied statistical misconceptions. It doesn't seem much has changed. Joan Garfield (1995) summarizes in How to Learn Statistics [pdf].
It was bound to happen at some point. Doctor and statistician Hans Rosling, best known for his sword-swallowing TED talk, among plenty of other things, made the Time 100 Most Influential list this year.
What does Rosling make of his statistical analysis of worldwide trends? "I am not an optimist," he says. "I'm a very serious possibilist. It's a new category where we take emotion apart and we just work analytically with the world." We can all, Rosling thinks, become healthy and wealthy. What a promising thought, so eloquently rendered with data.
Five and a half years ago, Netflix offered data and a $1 million prize to improve their recommendation system by at least ten percent. In 2009, a statistics team at AT&T Labs, BellKor, did that. Unfortunately, Netflix never integrated the algorithm into production.
If you followed the Prize competition, you might be wondering what happened with the final Grand Prize ensemble that won the $1M two years later. This is a truly impressive compilation and culmination of years of work, blending hundreds of predictive models to finally cross the finish line. We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment. Also, our focus on improving Netflix personalization had shifted to the next level by then.
That's too bad. Netflix knows their business better than anyone, but I sure wish Keeping Up with the Kardashians wasn't listed in my top 10 right now.
George E.P. Box, a statistician known for his body of work in time series analysis and Bayesian inference (and his quotes), recounts how he became a statistician while trying to solve actual problems. He was a 19-year-old college student studying chemistry. Instead of finishing, he joined the army, fed up with what the British government was doing to stop Hitler.
Before I could actually do any of that I was moved to a highly secret experimental station in the south of England. At the time they were bombing London every night and our job was to help to find out what to do if, one night, they used poisonous gas.
Some of England's best scientists were there. There were a lot of experiments with small animals, I was a lab assistant making biochemical determinations, my boss was a professor of physiology dressed up as a colonel, and I was dressed up as a staff sergeant.
The results I was getting were very variable and I told my colonel that what we really needed was a statistician.
He said "we can't get one, what do you know about it?" I said "Nothing, I once tried to read a book about it by someone called R. A. Fisher but I didn't understand it". He said "You've read the book so you better do it", so I said, "Yes sir".
Box eventually worked with Fischer, studied under E. S. Pearson in college after his discharge from the army, and started the Statistical Techniques Research Group at Princeton on the insistence of one John Tukey.
After seeing his friend's CoinStar receipt for 27 pounds of coins that equalled $256.14, Dan Kozikowski dug deeper and estimated what a pound of change is worth, on average.
Now, to finish out the analysis, let's tie this back to weight. Fortunately, the U.S. Mint standardizes and publishes the weight of each coin here. With that in hand... drumroll please... we’d expect about 34.9 quarters, 19.8 dimes, 11.5 nickels, and 61.2 pennies in a New York pound of coins, for a total value of $12.00. A Boston pound is worth slightly less&mdsah;$11.81.
I love it when people analyze the everyday. (Although I'm sure CoinStar looks at distributions like this all the time for storage supply something or other.)
Alas, the coin distribution of Kozikowski's friend didn't quite match the estimate, as shown in the graph above. He attributes it to the friend spending quarters, dimes, and nickels before going to the CoinStar. There are only fewer quarters though and almost twice the expected count of dimes and nickels, so the model needs to be refined.
Other than ten-year population estimates, the United States Census Bureau annually collects information about how people live in the country through the American Community Survey. It's an eleven-page survey [pdf] that asks about your housing situation, education, and job, and there are 60 Republican members of Congress who want to make this currently mandatory survey optional.
The ACS will reach 3.5 million households this year, using dozens of detailed questions—including asking about a household's use of flush toilets, wood fuel and carpools—to determine the need for various government programs. The survey's mandatory status, along with telephone and in-person follow-ups to initial mailings, helps keep response rates near 100%.
Now, 60 Republican members of Congress, including presidential candidate and Texas Rep. Ron Paul, are challenging the survey's mandatory status, with a bill that would make it voluntary to complete the ACS. The push is fueled by privacy concerns and the very detailed nature of the questions.
Whoa. What did I just read?
I think most of you know of Freakonomics, but in case you don't, it started as a book in 2005, by economist Steven Levitt and journalist Stephen Dubner. The book examines corners of life (like cheating in sumo) through data. It's a good read. SuperFreakonomics was the follow-up in 2009. Freakonomics has since grown up into a media company, complete with documentary, radio show, and blog. Needless to say, it's had a lot of success.
In the latest issue of American Scientist, statisticians Kaiser Fung and Andrew Gelman wrote a strong critique of Levitt and Dubner's work.
In our analysis of the Freakonomics approach, we encountered a range of avoidable mistakes, from back-of-the-envelope analyses gone wrong to unexamined assumptions to an uncritical reliance on the work of Levitt’s friends and colleagues. This turns accessibility on its head: Readers must work to discern which conclusions are fully quantitative, which are somewhat data driven and which are purely speculative.
Fung and Gelman then cite examples that they believe erroneous.
It's not mean-spirited, but Gelman has a way of offending even if he doesn't mean to, so I knew a third of the way through that this could not end well.
Dubner replied. (Skip part II, which addresses a different issue that shouldn't have been an issue in the first place.) He assesses — after explaining why almost everything that Fung and Gelman wrote is wrong — that they were blinded by their want to disprove.
[O]nce they'd picked up a hammer, did everything look like a nail?
I can certainly understand why Freakonomics is an appealing target for someone like Gelman-Fung. As I noted earlier, there are strong incentives to attack, particularly in the public sphere, where one can get a ton of attention in a blink by assailing the reputation of someone who's been plugging away for years. Whether in the academy, the media, the political arena, or elsewhere, public discourse these days often seems little more than a tit-for-tat game in which you wait for someone or something to achieve a certain momentum and then shout as loudly as you can that it’s "wrong!" Or, in written form: Epic fail.
I've only read the first book, which like I said was really good, so I can't really go with either side, but Dubner provides some compelling arguments, and I have a feeling most people will believe him more.
From Gizmodo, this shows battery size in the new iPad versus that of the iPad 2. The battery in the former is 70 percent bigger than that of the latter. Something's not right here.