• Fear of Big Brother and Government Surveys

    April 3, 2012  |  Statistics

    Other than ten-year population estimates, the United States Census Bureau annually collects information about how people live in the country through the American Community Survey. It's an eleven-page survey [pdf] that asks about your housing situation, education, and job, and there are 60 Republican members of Congress who want to make this currently mandatory survey optional.

    The ACS will reach 3.5 million households this year, using dozens of detailed questions—including asking about a household's use of flush toilets, wood fuel and carpools—to determine the need for various government programs. The survey's mandatory status, along with telephone and in-person follow-ups to initial mailings, helps keep response rates near 100%.

    Now, 60 Republican members of Congress, including presidential candidate and Texas Rep. Ron Paul, are challenging the survey's mandatory status, with a bill that would make it voluntary to complete the ACS. The push is fueled by privacy concerns and the very detailed nature of the questions.

    Find the full details of the bill on the Library of Congress site. Things got interesting when I searched for this link.
    Continue Reading

  • 1940 Census Individual Records Released

    April 3, 2012  |  Data Sources

    Racial ethnic diversity

    The 72-year mark has arrived, and the United States Census released individual records from 1940 yesterday. So you can now, for example, see that J.D. Salinger lived at 1133 Park Avenue.

  • Freakonomics Critique and Rebuttal

    March 22, 2012  |  Statistics

    Whoa. What did I just read?

    I think most of you know of Freakonomics, but in case you don't, it started as a book in 2005, by economist Steven Levitt and journalist Stephen Dubner. The book examines corners of life (like cheating in sumo) through data. It's a good read. SuperFreakonomics was the follow-up in 2009. Freakonomics has since grown up into a media company, complete with documentary, radio show, and blog. Needless to say, it's had a lot of success.

    In the latest issue of American Scientist, statisticians Kaiser Fung and Andrew Gelman wrote a strong critique of Levitt and Dubner's work.

    In our analysis of the Freakonomics approach, we encountered a range of avoidable mistakes, from back-of-the-envelope analyses gone wrong to unexamined assumptions to an uncritical reliance on the work of Levitt’s friends and colleagues. This turns accessibility on its head: Readers must work to discern which conclusions are fully quantitative, which are somewhat data driven and which are purely speculative.

    Fung and Gelman then cite examples that they believe erroneous.

    It's not mean-spirited, but Gelman has a way of offending even if he doesn't mean to, so I knew a third of the way through that this could not end well.

    Dubner replied. (Skip part II, which addresses a different issue that shouldn't have been an issue in the first place.) He assesses — after explaining why almost everything that Fung and Gelman wrote is wrong — that they were blinded by their want to disprove.

    [O]nce they'd picked up a hammer, did everything look like a nail?

    Dubner continues:

    I can certainly understand why Freakonomics is an appealing target for someone like Gelman-Fung. As I noted earlier, there are strong incentives to attack, particularly in the public sphere, where one can get a ton of attention in a blink by assailing the reputation of someone who's been plugging away for years. Whether in the academy, the media, the political arena, or elsewhere, public discourse these days often seems little more than a tit-for-tat game in which you wait for someone or something to achieve a certain momentum and then shout as loudly as you can that it’s "wrong!" Or, in written form: Epic fail.

    I've only read the first book, which like I said was really good, so I can't really go with either side, but Dubner provides some compelling arguments, and I have a feeling most people will believe him more.

    Update: Gelman replies to the reply and Fung adds to that.

  • New iPad battery size is huge

    March 16, 2012  |  Mistaken Data

    ipad expanded battery

    From Gizmodo, this shows battery size in the new iPad versus that of the iPad 2. The battery in the former is 70 percent bigger than that of the latter. Something's not right here.

    [Thanks, David]

  • Stephen Colbert on Target and predictive analytics

    February 27, 2012  |  Statistics

    "Target doesn't just know when you're buying sheets. They know what you're doing in between them."

    [Comedy Central via @alexlundry]

  • Companies learn your secrets with data about you

    February 16, 2012  |  Statistics

    In the 1980s, students and researchers at UCLA, led by marketing professor Alan Andreasen, found some interesting spending patterns when people approach major life events.

    [W]hen some customers were going through a major life event, like graduating from college or getting a new job or moving to a new town, their shopping habits became flexible in ways that were both predictable and potential gold mines for retailers. The study found that when someone marries, he or she is more likely to start buying a new type of coffee. When a couple move into a new house, they're more apt to purchase a different kind of cereal. When they divorce, there’s an increased chance they'll start buying different brands of beer.

    These findings turned out to be the backbone of work by statistician Andrew Pole, who was hired by Target to analyze their data and increase sales. Somewhere along the way, the marketing department at Target asked Pole if there was a way to predict that a customer was expecting a child. Birth records are freely available, so it's easy to send baby-related coupons and advertisements to new mothers, but Target wanted first dibs, before that baby came out.

    As you might expect, Pole found 25 products that were strong indicators and soon he had an estimate of pregnancies with a pregnancy prediction score.

    Pole applied his program to every regular female shopper in Target's national database and soon had a list of tens of thousands of women who were most likely pregnant. If they could entice those women or their husbands to visit Target and buy baby-related products, the company’s cue-routine-reward calculators could kick in and start pushing them to buy groceries, bathing suits, toys and clothing, as well. When Pole shared his list with the marketers, he said, they were ecstatic. Soon, Pole was getting invited to meetings above his paygrade. Eventually his paygrade went up.

    Creepy or just good marketing? I say the latter.

    [New York Times | Thanks, Paul]

  • Jeremy Lin is no fluke

    February 11, 2012  |  Statistics

    Nate Silver looks at past players who have scored 20 or more points, had 6 or more assists, and shot better than 50 percent in four or more games in a row. It's an illustrious list of all-stars, including Jordan, Bird, and Magic, with only a handful who were just so-so.

    Like everyone else, I was skeptical. I saw him play with the Warriors, and it was never that impressive. However, watching last night's game against the Lakers it was hard not to buy in to Linsanity. We'll see if he can extend the streak tonight against Minnesota, but even if the Knicks do win, should we read that much into it? Remember, there aren't that many other scoring options on the Knicks right now, two of the past four wins were against horrible teams (New Jersey and Washington) and the other two, the Lakers and the Jazz, were teams just slightly above .500.

    [Nate Silver]

  • An action plan for data science, a decade ago

    February 3, 2012  |  Statistics

    Data science has been covered at length during the past couple of years, and we tend to think of it as a field of study just a couple of years older than that. Jeff Hammerbacher and DJ Patil have played roles in further propagating the term as an actual profession in roughly the same timespan. So I was surprised to come across this rarely-cited 2001 paper by statistician William Cleveland, Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics [pdf].

    This document describes a plan to enlarge the major areas of technical work of the field of statistics. Because the plan is ambitious and implies substantial change, the altered field will be called "data science."

    For those unfamiliar, Cleveland's work on graphical perception might ring a bell.
    Continue Reading

  • Challenges measuring crime worldwide

    February 1, 2012  |  Statistics

    You would think that something so concrete, carefully recorded by authorities, wouldn't be too tough to tabulate, even if at a large scale. Not so.

    Homicide is a "serious crime that many people are concerned with, it is well-measured, and it is to a large degree well-reported and -recorded," says Alfred Blumstein, a criminologist at Carnegie Mellon University. "That is not to say that there aren't a variety of ways for fudging the measurement."

    Among the factors that cloud homicide numbers: gaps between police-reported numbers and counts by public-health organizations. The discrepancy is wide in many African countries and some Caribbean ones. The United Nations attributes the disparity to several factors, including definitional differences—whether honor killings should count—a lack of public-health infrastructure in some countries, and undercounting—possibly deliberate—by police.

    I think this is something the common public often doesn't understand about data. The numbers are entered and analyzed on a computer, so it's easy to mistake data for mechanical output. It must be accurate, right? That's usually not the case though, especially when it comes to data collection outside a controlled lab setting.

    The game always changes when humans are involved. Not everyone responds to surveys, definitions of events vary across organizations, estimation methods change every year, and the list goes on.

    For those who do stuff with data, you have to deal with that uncertainty, and as data consumers, you have to remember that numbers don't automatically mean fact.

    [Wall Street Journal]

  • Texting on the toilet

    January 30, 2012  |  Data Sources

    I thought this riveting post on the New York Times Bits blog about the rise of the toilet texter deserved a graphic. Since their graphics department is no doubt busy with elections, I took the liberty. I am — the 91 percent.

    I got the numbers straight from the Bits post, but you can download the full report from 11mark for all the demographics. You have to register though, and I didn't want to be the guy who creates an online account to just read a report on what people do while they make dooty. I have standards.

  • More people want to learn statistics

    January 27, 2012  |  Statistics

    Data is hot right now, so as you would expect, more people are signing up and applying to learn about it. Quentin Hardy for The New York Times reports.

    At North Carolina State, an advanced analytics program lasting 10 months has, since its founding in 2006, placed over 90 percent of its students annually. The average graduate’s starting salary for an entry-level job is $73,000. Its current class of 40 students had 185 applicants, and next year’s applications are already twice that. In 2009, Harvard awarded four undergraduate degrees in statistics. Two graduates went into finance, one to political polling and one became a substitute teacher. There were nine graduates in 2010, 13 last year. They headed into Google, biosciences and Wall Street, as well as Stanford's literature department.

    And in 2011, just about everywhere.

    [New York Times via @jsteeleeditor]

  • The Fixie Bike Index and hipsters

    January 27, 2012  |  Statistics

    Hipster places in America

    Priceonomics takes the association of fixie bikes to hipsters, and creates the Fixie Bike Index. After starting with New York, they branch out to national numbers.

    In short, fixed gear bikes = hipsters, and New York boroughs that have more fixies per capita should have more hipsters per capita. We sampled our data to see the number of used bikes for sale per capita in each borough with the term "fixie" or "fixed gear" in the product title to create the Fixie Index.

    I don't know about these numbers. I lived in Modesto for a year and don't remember people riding bikes — or hipsters, and riding your bike in Los Angeles kind of sucks.

    [Priceonomics]

  • Social network analysis used to convict slumlords

    January 19, 2012  |  Statistics

    social network analysis

    In working with tenants to help their city attorney convict a group of slumlords, an economic justice organization collected public data on housing violations that were going unfixed. They tried standard mind mapping and organization software, but the relationships were too complex to unearth anything useful. So they eventually used social network analysis, revealing money exchanging hands in such a way that allowed owners to strip the value from buildings without actually fixing them.

    The analysis results, combined with the city's investigation, allowed key convictions and court-awarded finances for tenants to move elsewhere.

    Sounds like a good reason for Data Without Borders.

    [Valdis Krebs via kottke]

  • Lego mathematics and growing complexity in networks

    January 12, 2012  |  Statistics

    lego curve

    Legos are the best toys ever invented. That's indisputable fact. So it's no surprise that Mark Changizi et al. at Duke University used the toys in their study of growing complexity of systems and networks. They looked at 389 Lego sets and compared the number of pieces in the set to the number of piece types, as shown above.
    Continue Reading

  • Predicting the future of prediction

    January 9, 2012  |  Statistics

    Tarot cards don't cut it anymore as a predictors. We turn to data for a look to the future:

    "We're finally in a position where people volunteer information about their specific activities, often their location, who they're with, what they're doing, how they're feeling about what they're doing, what they're talking about," said Johan Bollen, a professor at the School of Informatics and Computing at Indiana University Bloomington who developed a way to predict the ups and downs of the stock market based on Twitter activity. "We've never had data like that before, at least not at that level of granularity." Bollen added: "Right now it’s a gold rush."

    Or you could just get yourself a flux capacitor and save yourself some time.

    [Boston]

  • Teamwork and collaboration that built Watson

    January 8, 2012  |  Statistics

    Team lead, David Ferrucci, recalls the early days of putting together the team that built Watson:

    Likewise, the scientists would have to reject an ego-driven perspective and embrace the distributed intelligence that the project demanded. Some were still looking for that silver bullet that they might find all by themselves. But that represented the antithesis of how we would ultimately succeed. We learned to depend on a philosophy that embraced multiple tracks, each contributing relatively small increments to the success of the project.

    As I sit here reading about egos within IBM, with the NFL playoffs in front of me, I can't help but smirk.

    [New York Times via Simply Statistics]

  • Algorithm estimates who’s in control

    January 4, 2012  |  Statistics

    Jon Kleinberg, whose work influenced Google's PageRank, is working on ranking something else. Kleinberg et al. developed an algorithm that ranks people, based on how they speak to each other.

    "We show that in group discussions, power differentials between participants are subtly revealed by how much one individual immediately echoes the linguistic style of the person they are responding to," say Kleinberg and co.

    The key to this is an idea called linguistic co-ordination, in which speakers naturally copy the style of their interlocutors. Human behaviour experts have long studied the way individuals can copy the body language or tone of voice of their peers, some have even studied how this effect reveals the power differences between members of the group.

    Now Kleinberg and co say the same thing happens with language style.

    That's why I just don't talk at all. Introvert to the max.

    [Technology Review]

  • When numbers are too factual

    December 19, 2011  |  Statistics

    Carl Bialik, for The Wall Street Journal, reports on PSAs and the use of scary numbers:

    The Ad Council usually avoids statistics in PSAs. "We know from our experience that effective advertising has to have an emotional component and statistics-based campaigns can be very rational," Conlon said. "We’ve also found that people tend not to believe statistics."

    And sometimes they just don’t care much about them. "When we were developing our underage drinking prevention campaign," Conlon recalled, "we found that it doesn't resonate with parents to learn about how many children are drinking underage. It's too easy for them to say 'it's not my child.' We found that it was much more compelling to include a statistic that was more about the consequences of underage drinking: Those who start drinking before age 15 are six times more likely to have alcohol problems as adults than those who start drinking at age 21 or older."

    The well-known Stalin quote comes to mind.

    [The Numbers Guy]

  • Causation is real, people

    December 15, 2011  |  Statistics

    amusing correlations

    Stop global warming. Decrease the National Science Foundation's R&D budget. It's so easy. More lessons on correlation and causation found here.

  • What Facebook knows about you

    December 14, 2011  |  Data Sources

    Facebook privacy

    Facebook logs and saves a lot of data about you and what you do on their site. This shouldn't be surprising given the more time people spend on Facebook, the greater the cash flow, but just how much data do they store? Austrian law student Max Schrems, because European law states that citizens can do this, requested all the data Facebook had about him. He got back a CD with 1,222 PDF files.
    Continue Reading

Copyright © 2007-2014 FlowingData. All rights reserved. Hosted by Linode.
7ads6x98y