• A new kind of resource

    December 3, 2012  |  Statistics

    Jer Thorp talks ethics in the data-as-new-oil metaphor:

    [W]e need to change the way that we collectively think about data, so that it is not a new oil, but instead a new kind of resource entirely. For this to occur we need to foster a deep understanding of data in society. As it happens, humanity has a mechanism for this kind of broad cultural change: the arts. As we proceed towards profit and progress with data, let us encourage artists, novelists, performers and poets to take an active role in the conversation. In doing so we may avoid some of the mistakes that we made with the old oil.

    See also: Jer's talk on the human side of data.

  • Machines and built-in morality

    November 29, 2012  |  Statistics

    With Google's driverless cars now street legal in California, Florida, and Nevada, Gary Marcus for the New Yorker ponders a world where machines need a built-in morality system.

    That moment will be significant not just because it will signal the end of one more human niche, but because it will signal the beginning of another: the era in which it will no longer be optional for machines to have ethical systems. Your car is speeding along a bridge at fifty miles per hour when errant school bus carrying forty innocent children crosses its path. Should your car swerve, possibly risking the life of its owner (you), in order to save the children, or keep going, putting all forty kids at risk? If the decision must be made in milliseconds, the computer will have to make the call.

    Data analysis seems to be headed in the same direction. Where machines will have to start making human-like decisions, data represents more of the real world and looks less like snippets in time. As the gap between numbers and what they represent shrinks, the more we have to think about ethics, privacy, and whether or not what we're doing is right.

  • Archive of datasets bundled with R

    November 20, 2012  |  Data Sources

    R comes with a lot of datasets, some with the core distribution and others with packages, but you'd never know which ones unless you went through all the examples found at the end of help documents. Luckily, Vincent Arel-Bundock cataloged 596 of them in an easy-to-read page, and you can quickly download them as CSV files.

    Many of the datasets are dated, going back to the original distribution of R, but it's a great resource for teaching or if you're just looking for some data to play with.

  • Incredibly divided nation in a map

    November 9, 2012  |  Mistaken Data

    Divided nation

    I knew things were bad, but I didn't know they were this bad. Obama has his work cut out for him. [Thanks, @adamsinger]

  • How Silver predictions performed

    November 7, 2012  |  Statistics

    By way of Rafa Irizarry from Simply Statistics, a plot of Nate Silver's probabilities for Barack Obama winning a state versus the percentage of vote in each state, as of midnight EST.

    I guess that's pretty (100%) good. Looks like the folks at Princeton didn't do half bad either. It's a win for Obama and a win for statistics. Well, good statistics, at least. (Looking at you, University of Colorado.)

    Update: Drew Linzer at Emory and the Huffington Post Pollster also did well. All in all, it was a good night for statistics.

  • A quick lesson on making predictions

    October 31, 2012  |  Statistics

    Political analyst and statistician Nate Silver has gotten some flack lately for consistently projecting a 70-plus percent chance of a Barack Obama win this election. But as Jeff Leek explains, the criticism doesn't spawn from Silver being wrong. Rather, it comes from the critics' misunderstanding of statistics. Leek provides a quick lesson on how Silver makes his predications and how the methods apply to other things, like the weather.

    Now, this might seem like a goofy way to come up with a "percent chance" with simulated elections and all. But it turns out it is actually a pretty important thing to know and relevant to those of us on the East Coast right now. It turns out weather forecasts (and projected hurricane paths) are based on the same sort of thing — simulated versions of the weather are run and the "percent chance of rain" is the fraction of times it rains in a particular place.

    So Romney may still win and Obama may lose — and Silver may still get a lot of it right. But regardless, the approach taken by Silver is not based on politics, it is based on statistics.

    Don't fear the black box.

  • Data on decades of Boy Scout expulsions released

    October 22, 2012  |  Data Sources

    Allegations in the Boy Scouts

    The Los Angeles Times released nearly 5,000 records of allegations from the Boy Scouts of America as a browseable map and searchable list. You can also download the data.

    This data­base con­tains in­form­a­tion on about 5,000 men and a hand­ful of wo­men who were ex­pelled from the Boy Scouts of Amer­ica between 1947 and Janu­ary 2005 on sus­pi­cion of sexu­al ab­use. The dots on the map in­dic­ate the loc­a­tion of troops con­nec­ted in some way to the ac­cused. The timeline be­low shows the volume of cases opened by year; however, an un­known num­ber of files were purged by the Scouts pri­or to the early 1990s

    The interactive map helps you narrow down by city, but it's kind of hard to see cases on a country-wide perspective. Here's a quick look.

    The worst part is that a lot of the cases went unreported.

  • The birthday problem explained

    October 5, 2012  |  Statistics

    How many people does it take for there to be a 50% chance that a pair in the group has the same birthday? Only 23 people. What about a 99% chance? Maybe even more shocking: 57 people. This is the birthday problem, which every undergrad who's taken a stat course has seen. Steven Strogataz explains the logic and calculations.

    Intuitively, how can 23 people be enough? It’s because of all the combinations they create, all the opportunities for luck to strike. With 23 people, there are 253 possible pairs of people (see the notes for why), and that turns out to be enough to push the odds of a match above 50 percent.

    Incidentally, if you go up to 43 people — the number of individuals who have served as United States president so far — the odds of a match increase to 92 percent. And indeed two of the presidents do have the same birthday: James Polk and Warren Harding were both born on Nov. 2.

    The Johnny Carson clip referenced in the article is worth watching. Carson tries to test the results with the audience, but goes about it the wrong way.

  • Data for good, not bad

    September 21, 2012  |  Statistics

    I'm so glad there are people like Jake Porway in the world. The founder and executive director of DataKind gives his quick pitch on "using data in the service of humanity."

  • Hiring a data scientist

    September 19, 2012  |  Statistics

    Thomas H. Davenport and D.J. Patil give the rundown on what a data scientist is, what to look for and how to hire them. It's an article in Harvard Business Review, so it's geared towards managers, and I felt like I was reading a horoscope at times, but there are some interesting tidbits in there.

    Data scientists don’t do well on a short leash. They should have the freedom to experiment and explore possibilities. That said, they need close relationships with the rest of the business. The most important ties for them to forge are with executives in charge of products and services rather than with people overseeing business functions. As the story of Jonathan Goldman illustrates, their greatest opportunity to add value is not in creating reports or presentations for senior executives but in innovating with customer-facing products and processes.

    I still call myself a statistician. The main difference between data scientist and statistician seems to be programming skills, but if you're doing statistics without code, I'm not sure what you're doing (other than theory).

    Update: This recent panel from DataGotham also discusses the data scientist hiring process. [Thanks, Drew]

  • Humans predicting the weather

    September 10, 2012  |  Statistics

    Nate Silver says the weatherman is not a moron.

    Still, most people take their forecasts for granted. Like a baseball umpire, a weather forecaster rarely gets credit for getting the call right. Last summer, meteorologists at the National Hurricane Center were tipped off to something serious when nearly all their computer models indicated that a fierce storm was going to be climbing the Northeast Corridor. The eerily similar results between models helped the center amplify its warning for Hurricane Irene well before it touched down on the Atlantic shore, prompting thousands to evacuate their homes. To many, particularly in New York, Irene was viewed as a media-manufactured nonevent, but that was largely because the Hurricane Center nailed its forecast. Six years earlier, the National Weather Service also made a nearly perfect forecast of Hurricane Katrina, anticipating its exact landfall almost 60 hours in advance. If public officials hadn’t bungled the evacuation of New Orleans, the death toll might have been remarkably low.

    I like the bit later in the article that describes the number crunching machine and how humans are involved in the analysis. The National Weather Service has heavy-duty computing power to process data coming from weather stations across the country, but the computer is still bad at doing a lot of things.

    To most people, statistics means plugging numbers into an advanced calculator that spits out values, without much thought involved. Those people don't work with data.

  • Analyzing text messages to save lives

    September 5, 2012  |  Statistics

    Nancy Lublin, CEO of Do Something, gives a five-minute TED talk on the potential in analyzing text messages. During a texting campaign, Do Something started to receive texts from troubled teenagers, that ranged from bullying to rape, which led to the organization's work in setting up a texting hotline. Lublin hopes that, once the system is built, the data gathered from these messages can be used as a census of problems, and can perhaps be used in the same way that Target uses data to figure out if women are pregnant — but to save lives, instead of figuring out what coupons to send.

    [Thanks, Tommy]

  • Poker is a game of skill, not luck

    August 28, 2012  |  Statistics

    King nine offsuit

    Randal Heeb convinced a New York City judge that poker isn't a game of luck, using a 120-page report full of analysis and charts.

    Judge Weinstein, relying on the research of Randal Heeb - an economist, statistician and poker player - found that while luck determines what cards a player gets, skill plays the bigger role in a player's ultimate success. With such charts as "Win-Rate Comparison: King Nine Offsuit,' the 91-year-old judge delved into the complexities of the argument more thoroughly than any past court has. John Pappas was pleased with the result, and impressed with the methodology.

    The king nine offsuit chart he refers to is above. Simulated earnings for better players are shown on the left in blue, and earnings for not so good players are shown on the right. Although both groups are likely to lose with the hand, skill appears to decrease expected losses.

    Check out the full report here [pdf].

  • Twitter vs. Facebook: What people share

    August 23, 2012  |  Statistics

    flowingdata-metaviz

    Edwin Chen, a data scientist at Twitter, took an in-depth look at what people are more inclined to tweet on Twitter and like on Facebook. He used FlowingData as his main data source, but also analyzed Quora, xkcd, and New Scientist. The main finding:

    Twitter is still for the techies: articles where the number of tweets greatly outnumber FB likes tend to revolve around software companies and programming. Facebook, on the other hand, appeals to everyone else: yeah, to the masses, and to non-software technical folks in general as well.

    I saw the analysis when it was posted over a year ago but never got around to sharing it. It crossed my desktop again recently. The results still seem to apply.

    From a practical standpoint, I don't think about whether or not people are going to share something more on Twitter or Facebook before I post it. I just link to what I think is interesting. However, when I post something with a poop or fart joke in it (so half the time, basically), I make sure I share it on Facebook, which I have to do manually. Because you know, bowel movements have universal appeal.

  • Network analysis on high school hierarchy of friends

    August 13, 2012  |  Statistics

    Brian Ball and M. E. J. Newman analyzed friendship data from a high school and junior high, and found a hierarchy similar to the one in Mean Girls.

    Here we analyze a large collection of such networks representing friendships among students at US high and junior-high schools and show that the pattern of unreciprocated friendships is far from random. In every network, without exception, we find that there exists a ranking of participants, from low to high, such that almost all unreciprocated friendships consist of a lower-ranked individual claiming friendship with a higher-ranked one.

    So someone higher up on the totem poll had more people saying they were friends with him or her, but the popular one didn't necessarily feel the same.

    I told my wife this, and her reaction was basically, "Uh, yeah. And?"

  • Fox News continues charting excellence

    August 6, 2012  |  Mistaken Data

    Bush cuts

    Fox News tried to show the change in the top tax rate if the Bush tax cuts expire, so they showed the rate now and what'd it be in 2013. Wow, it'll be around five times higher. Wait. No.

    The value axis starts at 34 percent instead of zero, which you don't do with bar charts, because length is the visual cue. That is to say, when you look at this chart, you compare how high each bar is. Fox News might as well have started the vertical axis at 34.9 percent. That would've been more dramatic.

    Here's what the bar chart is supposed to look like:

    With a difference of 4.6 percentage points, the change doesn't look so crazy.

    [via Effective Graphs]

  • From statistics to data science, and vice versa

    July 26, 2012  |  Statistics

    Carnegie Mellon statistics professor Cosma Shalizi considers the differences and similarities between statistics and data science.

    If people want to call those who do such jobs "data scientists" rather than "statisticians" because it sounds more dignified, or gets them more money, or makes them easier to hire, then more power to them. If they want to avoid the suggestion that you need a statistics degree to do this work, they have a point but it seems a clumsy way to make it. If, however, the name "statistician" is avoided because that connotes not a powerful discipline which transforms profound ideas about learning from experience into practical tools, but rather, a meaningless conglomeration of rituals better conducted with twenty-sided dice, then we as a profession have failed ourselves and, more importantly, the public, and the blame lies with us. Since what we have to offer is really quite wonderful, we should not let that happen.

    Some time during the past couple of years, statistics became data science's older, more boring sibling that always plays by the rules. There are a lot of statisticians who now call themselves data scientists. I still call myself a statistician.

    But I think we're getting closer to that part in the movie when the older, more stuffy character learns from the young whipper snapper that loosening up could be a good thing, and when the young one realizes that some elbow grease and tradition can go a long way.

  • Computing for data analysis

    July 24, 2012  |  Statistics

    If you want to learn visualization, you should learn data. To learn data, you should learn statistics. Where to begin? The free analysis courses offered on Coursera, by Johns Hopkins professors is probably a good place to start. Currently available: Computing for Data Analysis with biostatistics professor Roger D. Peng and Data Analysis with Jeff Leek, also a biostatistics professor.

    There's also a handful of data-related courses from other university professors that might be worth a look.

  • How consumers suck at math

    July 18, 2012  |  Statistics

    Derek Thompson for The Atlantic on how retail uses our numeric biases to their advantage:

    Now that I've just told you that consumers try to avoid additional payments, I should add that there are two additional payments we love: rebates and warranties. The first buys the illusion of wealth ("I'm being paid money to spend money!"). The second buys peace of mind ("Now I can own this thing forever without worrying about it!"). Both are basically tricks. "Instead of buying something and getting a rebate," Poundstone writes, "why not just pay a lower price in the first place?'

    "[Warranties] make no rational sense," Harvard economist David Cutler told the Washington Post. "The implied probability that [a product] will break has to be substantially greater than the risk that you can't afford to fix it or replace it. If you're buying a $400 item, for the overwhelming number of consumers that level of spending is not a risk you need to insure under any circumstances."

    Other tidbits: our obsession with prices ending with a nine and how we justify purchases of things that are more expensive but aren't necessarily better than the cheaper item.

  • Data plural versus data singular

    July 12, 2012  |  Statistics

    Kevin Drum on data is or data are:

    Now, I know that lots of people continue to foolishly disagree with me about this, but I'm curious how far they're willing to push things. If you had, say, five bits of information, would you say I only have five data? If you really, truly believe that data is a plural noun, you'd have no problem with this. But does anyone actually do it?

    This was in response to the Wall Street Journal's style guy saying that they can go either way, as the word as has evolved to also mean a singular collection of numbers.

    Here's what the New York Times style guide has to say about it:

    [D]ata is acceptable as a singular term for information: The data was persuasive. In its traditional sense, meaning a collection of facts and figures, the noun can still be plural: They tabulate the data, which arrive from bookstores nationwide. (In this sense, the singular is datum, a word both stilted and deservedly obscure.)

    I say data is. The plural version sounds weird to me.

Unless otherwise noted, graphics and words by me are licensed under Creative Commons BY-NC. Contact original authors for everything else.