• House votes to cut the American Community Survey

    May 10, 2012  |  Statistics

    Last month Republicans were pushing a bill to get rid of the American Community Survey, an 11-page questionnaire about housing, education, and other things. Yesterday, a bill passed to cut the survey in a 232 to 190 vote.

    Republicans, acknowledging its usefulness, attacked the survey as an unconstitutional invasion of privacy, arguing that the government has no business knowing how many flush toilets someone has, for instance.

    "It would seem that these questions hardly fit the scope of what was intended or required by the Constitution," said Rep. Daniel Webster (R-Fla.), author of the amendment.

    "This survey is inappropriate for taxpayer dollars," Webster added. "It's the definition of a breach of personal privacy. It's the picture of what's wrong in Washington, D.C. It's unconstitutional."

    The ACS is the picture of what's wrong in Washington? This is idiocy.

  • CNN transcript collection, 2000-2012

    May 9, 2012  |  Data Sources

    Thanks to the Internet Archive and CNN, thirteen years of transcripts, about a gigabyte compressed, is available to download as one file.

    For over a decade, CNN (Cable News Network) has been providing transcripts of shows, events and newscasts from its broadcasts. The archive has been maintained and the text transcripts have been dependably available at transcripts.cnn.com. This is a just-in-case grab of the years of transcripts for later study and historical research.

    Changes in news coverage and CNN's focus over the years, anyone?

    [via @A_L]

  • Common statistical fallacies

    May 3, 2012  |  Statistics

    I've been reading papers on how people learn statistics (and thoughts on teaching the subject) and came across the frequently-cited work of mathematical psychologists Amos Tversky and Daniel Kahneman. In 1972, they studied statistical misconceptions. It doesn't seem much has changed. Joan Garfield (1995) summarizes in How to Learn Statistics [pdf].
    Continue Reading

  • Hans Rosling makes Time 100 Most Influential

    April 18, 2012  |  Statistics

    It was bound to happen at some point. Doctor and statistician Hans Rosling, best known for his sword-swallowing TED talk, among plenty of other things, made the Time 100 Most Influential list this year.

    What does Rosling make of his statistical analysis of worldwide trends? "I am not an optimist," he says. "I'm a very serious possibilist. It's a new category where we take emotion apart and we just work analytically with the world." We can all, Rosling thinks, become healthy and wealthy. What a promising thought, so eloquently rendered with data.

    [Thanks, wife]

  • Why $1m Netflix algorithm never went to production

    April 17, 2012  |  Statistics

    Five and a half years ago, Netflix offered data and a $1 million prize to improve their recommendation system by at least ten percent. In 2009, a statistics team at AT&T Labs, BellKor, did that. Unfortunately, Netflix never integrated the algorithm into production.

    If you followed the Prize competition, you might be wondering what happened with the final Grand Prize ensemble that won the $1M two years later. This is a truly impressive compilation and culmination of years of work, blending hundreds of predictive models to finally cross the finish line. We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment. Also, our focus on improving Netflix personalization had shifted to the next level by then.

    That's too bad. Netflix knows their business better than anyone, but I sure wish Keeping Up with the Kardashians wasn't listed in my top 10 right now.

    [via Techdirt]

  • The Accidental Statistician

    April 6, 2012  |  Statistics

    George E.P. Box, a statistician known for his body of work in time series analysis and Bayesian inference (and his quotes), recounts how he became a statistician while trying to solve actual problems. He was a 19-year-old college student studying chemistry. Instead of finishing, he joined the army, fed up with what the British government was doing to stop Hitler.

    Before I could actually do any of that I was moved to a highly secret experimental station in the south of England. At the time they were bombing London every night and our job was to help to find out what to do if, one night, they used poisonous gas.

    Some of England's best scientists were there. There were a lot of experiments with small animals, I was a lab assistant making biochemical determinations, my boss was a professor of physiology dressed up as a colonel, and I was dressed up as a staff sergeant.

    The results I was getting were very variable and I told my colonel that what we really needed was a statistician.

    He said "we can't get one, what do you know about it?" I said "Nothing, I once tried to read a book about it by someone called R. A. Fisher but I didn't understand it". He said "You've read the book so you better do it", so I said, "Yes sir".

    Box eventually worked with Fischer, studied under E. S. Pearson in college after his discharge from the army, and started the Statistical Techniques Research Group at Princeton on the insistence of one John Tukey.

  • What a Pound of Change is Worth

    April 5, 2012  |  Statistics


    After seeing his friend's CoinStar receipt for 27 pounds of coins that equalled $256.14, Dan Kozikowski dug deeper and estimated what a pound of change is worth, on average.

    Now, to finish out the analysis, let's tie this back to weight. Fortunately, the U.S. Mint standardizes and publishes the weight of each coin here. With that in hand... drumroll please... we’d expect about 34.9 quarters, 19.8 dimes, 11.5 nickels, and 61.2 pennies in a New York pound of coins, for a total value of $12.00. A Boston pound is worth slightly less&mdsah;$11.81.

    I love it when people analyze the everyday. (Although I'm sure CoinStar looks at distributions like this all the time for storage supply something or other.)

    Alas, the coin distribution of Kozikowski's friend didn't quite match the estimate, as shown in the graph above. He attributes it to the friend spending quarters, dimes, and nickels before going to the CoinStar. There are only fewer quarters though and almost twice the expected count of dimes and nickels, so the model needs to be refined.

  • Fear of Big Brother and Government Surveys

    April 3, 2012  |  Statistics

    Other than ten-year population estimates, the United States Census Bureau annually collects information about how people live in the country through the American Community Survey. It's an eleven-page survey [pdf] that asks about your housing situation, education, and job, and there are 60 Republican members of Congress who want to make this currently mandatory survey optional.

    The ACS will reach 3.5 million households this year, using dozens of detailed questions—including asking about a household's use of flush toilets, wood fuel and carpools—to determine the need for various government programs. The survey's mandatory status, along with telephone and in-person follow-ups to initial mailings, helps keep response rates near 100%.

    Now, 60 Republican members of Congress, including presidential candidate and Texas Rep. Ron Paul, are challenging the survey's mandatory status, with a bill that would make it voluntary to complete the ACS. The push is fueled by privacy concerns and the very detailed nature of the questions.

    Find the full details of the bill on the Library of Congress site. Things got interesting when I searched for this link.
    Continue Reading

  • 1940 Census Individual Records Released

    April 3, 2012  |  Data Sources

    Racial ethnic diversity

    The 72-year mark has arrived, and the United States Census released individual records from 1940 yesterday. So you can now, for example, see that J.D. Salinger lived at 1133 Park Avenue.

  • Freakonomics Critique and Rebuttal

    March 22, 2012  |  Statistics

    Whoa. What did I just read?

    I think most of you know of Freakonomics, but in case you don't, it started as a book in 2005, by economist Steven Levitt and journalist Stephen Dubner. The book examines corners of life (like cheating in sumo) through data. It's a good read. SuperFreakonomics was the follow-up in 2009. Freakonomics has since grown up into a media company, complete with documentary, radio show, and blog. Needless to say, it's had a lot of success.

    In the latest issue of American Scientist, statisticians Kaiser Fung and Andrew Gelman wrote a strong critique of Levitt and Dubner's work.

    In our analysis of the Freakonomics approach, we encountered a range of avoidable mistakes, from back-of-the-envelope analyses gone wrong to unexamined assumptions to an uncritical reliance on the work of Levitt’s friends and colleagues. This turns accessibility on its head: Readers must work to discern which conclusions are fully quantitative, which are somewhat data driven and which are purely speculative.

    Fung and Gelman then cite examples that they believe erroneous.

    It's not mean-spirited, but Gelman has a way of offending even if he doesn't mean to, so I knew a third of the way through that this could not end well.

    Dubner replied. (Skip part II, which addresses a different issue that shouldn't have been an issue in the first place.) He assesses — after explaining why almost everything that Fung and Gelman wrote is wrong — that they were blinded by their want to disprove.

    [O]nce they'd picked up a hammer, did everything look like a nail?

    Dubner continues:

    I can certainly understand why Freakonomics is an appealing target for someone like Gelman-Fung. As I noted earlier, there are strong incentives to attack, particularly in the public sphere, where one can get a ton of attention in a blink by assailing the reputation of someone who's been plugging away for years. Whether in the academy, the media, the political arena, or elsewhere, public discourse these days often seems little more than a tit-for-tat game in which you wait for someone or something to achieve a certain momentum and then shout as loudly as you can that it’s "wrong!" Or, in written form: Epic fail.

    I've only read the first book, which like I said was really good, so I can't really go with either side, but Dubner provides some compelling arguments, and I have a feeling most people will believe him more.

    Update: Gelman replies to the reply and Fung adds to that.

  • New iPad battery size is huge

    March 16, 2012  |  Mistaken Data

    ipad expanded battery

    From Gizmodo, this shows battery size in the new iPad versus that of the iPad 2. The battery in the former is 70 percent bigger than that of the latter. Something's not right here.

    [Thanks, David]

  • Stephen Colbert on Target and predictive analytics

    February 27, 2012  |  Statistics

    "Target doesn't just know when you're buying sheets. They know what you're doing in between them."

    [Comedy Central via @alexlundry]

  • Companies learn your secrets with data about you

    February 16, 2012  |  Statistics

    In the 1980s, students and researchers at UCLA, led by marketing professor Alan Andreasen, found some interesting spending patterns when people approach major life events.

    [W]hen some customers were going through a major life event, like graduating from college or getting a new job or moving to a new town, their shopping habits became flexible in ways that were both predictable and potential gold mines for retailers. The study found that when someone marries, he or she is more likely to start buying a new type of coffee. When a couple move into a new house, they're more apt to purchase a different kind of cereal. When they divorce, there’s an increased chance they'll start buying different brands of beer.

    These findings turned out to be the backbone of work by statistician Andrew Pole, who was hired by Target to analyze their data and increase sales. Somewhere along the way, the marketing department at Target asked Pole if there was a way to predict that a customer was expecting a child. Birth records are freely available, so it's easy to send baby-related coupons and advertisements to new mothers, but Target wanted first dibs, before that baby came out.

    As you might expect, Pole found 25 products that were strong indicators and soon he had an estimate of pregnancies with a pregnancy prediction score.

    Pole applied his program to every regular female shopper in Target's national database and soon had a list of tens of thousands of women who were most likely pregnant. If they could entice those women or their husbands to visit Target and buy baby-related products, the company’s cue-routine-reward calculators could kick in and start pushing them to buy groceries, bathing suits, toys and clothing, as well. When Pole shared his list with the marketers, he said, they were ecstatic. Soon, Pole was getting invited to meetings above his paygrade. Eventually his paygrade went up.

    Creepy or just good marketing? I say the latter.

    [New York Times | Thanks, Paul]

  • Jeremy Lin is no fluke

    February 11, 2012  |  Statistics

    Nate Silver looks at past players who have scored 20 or more points, had 6 or more assists, and shot better than 50 percent in four or more games in a row. It's an illustrious list of all-stars, including Jordan, Bird, and Magic, with only a handful who were just so-so.

    Like everyone else, I was skeptical. I saw him play with the Warriors, and it was never that impressive. However, watching last night's game against the Lakers it was hard not to buy in to Linsanity. We'll see if he can extend the streak tonight against Minnesota, but even if the Knicks do win, should we read that much into it? Remember, there aren't that many other scoring options on the Knicks right now, two of the past four wins were against horrible teams (New Jersey and Washington) and the other two, the Lakers and the Jazz, were teams just slightly above .500.

    [Nate Silver]

  • An action plan for data science, a decade ago

    February 3, 2012  |  Statistics

    Data science has been covered at length during the past couple of years, and we tend to think of it as a field of study just a couple of years older than that. Jeff Hammerbacher and DJ Patil have played roles in further propagating the term as an actual profession in roughly the same timespan. So I was surprised to come across this rarely-cited 2001 paper by statistician William Cleveland, Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics [pdf].

    This document describes a plan to enlarge the major areas of technical work of the field of statistics. Because the plan is ambitious and implies substantial change, the altered field will be called "data science."

    For those unfamiliar, Cleveland's work on graphical perception might ring a bell.
    Continue Reading

  • Challenges measuring crime worldwide

    February 1, 2012  |  Statistics

    You would think that something so concrete, carefully recorded by authorities, wouldn't be too tough to tabulate, even if at a large scale. Not so.

    Homicide is a "serious crime that many people are concerned with, it is well-measured, and it is to a large degree well-reported and -recorded," says Alfred Blumstein, a criminologist at Carnegie Mellon University. "That is not to say that there aren't a variety of ways for fudging the measurement."

    Among the factors that cloud homicide numbers: gaps between police-reported numbers and counts by public-health organizations. The discrepancy is wide in many African countries and some Caribbean ones. The United Nations attributes the disparity to several factors, including definitional differences—whether honor killings should count—a lack of public-health infrastructure in some countries, and undercounting—possibly deliberate—by police.

    I think this is something the common public often doesn't understand about data. The numbers are entered and analyzed on a computer, so it's easy to mistake data for mechanical output. It must be accurate, right? That's usually not the case though, especially when it comes to data collection outside a controlled lab setting.

    The game always changes when humans are involved. Not everyone responds to surveys, definitions of events vary across organizations, estimation methods change every year, and the list goes on.

    For those who do stuff with data, you have to deal with that uncertainty, and as data consumers, you have to remember that numbers don't automatically mean fact.

    [Wall Street Journal]

  • Texting on the toilet

    January 30, 2012  |  Data Sources

    I thought this riveting post on the New York Times Bits blog about the rise of the toilet texter deserved a graphic. Since their graphics department is no doubt busy with elections, I took the liberty. I am — the 91 percent.

    I got the numbers straight from the Bits post, but you can download the full report from 11mark for all the demographics. You have to register though, and I didn't want to be the guy who creates an online account to just read a report on what people do while they make dooty. I have standards.

  • More people want to learn statistics

    January 27, 2012  |  Statistics

    Data is hot right now, so as you would expect, more people are signing up and applying to learn about it. Quentin Hardy for The New York Times reports.

    At North Carolina State, an advanced analytics program lasting 10 months has, since its founding in 2006, placed over 90 percent of its students annually. The average graduate’s starting salary for an entry-level job is $73,000. Its current class of 40 students had 185 applicants, and next year’s applications are already twice that. In 2009, Harvard awarded four undergraduate degrees in statistics. Two graduates went into finance, one to political polling and one became a substitute teacher. There were nine graduates in 2010, 13 last year. They headed into Google, biosciences and Wall Street, as well as Stanford's literature department.

    And in 2011, just about everywhere.

    [New York Times via @jsteeleeditor]

  • The Fixie Bike Index and hipsters

    January 27, 2012  |  Statistics

    Hipster places in America

    Priceonomics takes the association of fixie bikes to hipsters, and creates the Fixie Bike Index. After starting with New York, they branch out to national numbers.

    In short, fixed gear bikes = hipsters, and New York boroughs that have more fixies per capita should have more hipsters per capita. We sampled our data to see the number of used bikes for sale per capita in each borough with the term "fixie" or "fixed gear" in the product title to create the Fixie Index.

    I don't know about these numbers. I lived in Modesto for a year and don't remember people riding bikes — or hipsters, and riding your bike in Los Angeles kind of sucks.


  • Social network analysis used to convict slumlords

    January 19, 2012  |  Statistics

    social network analysis

    In working with tenants to help their city attorney convict a group of slumlords, an economic justice organization collected public data on housing violations that were going unfixed. They tried standard mind mapping and organization software, but the relationships were too complex to unearth anything useful. So they eventually used social network analysis, revealing money exchanging hands in such a way that allowed owners to strip the value from buildings without actually fixing them.

    The analysis results, combined with the city's investigation, allowed key convictions and court-awarded finances for tenants to move elsewhere.

    Sounds like a good reason for Data Without Borders.

    [Valdis Krebs via kottke]

Copyright © 2007-2014 FlowingData. All rights reserved. Hosted by Linode.