• Lies people tell in online dating

    August 5, 2010  |  Statistics

    Male height distribution graph on OkCupid

    Online dating site OkCupid continues with amusing yet thorough analysis of their 1.51 million users. This time around, they cover the lies people tell:

    People do everything they can in their OkCupid profiles to make themselves seem awesome, and surely many of our users genuinely are. But it's very hard for the casual browser to tell truth from fiction. With our behind-the-scenes perspective, we're able to shed some light on some typical claims and the likely realities behind them.

    Among the findings:

    • People exaggerate their height by about two inches.
    • If someone says they make $100k per year, they probably mean $80k.
    • The more attractive a picture, the older it is.
    • Most self-identified bisexuals (80%) only like one gender.

    Buyer beware.

  • Afghanistan war logs revealed and mapped

    July 27, 2010  |  Data Sources, Mapping

    Afghanistan incidents from war logs

    This past Sunday, well-known whistle-blower site Wikileaks released over 91,000 secret US military reports, covering the war in Afghanistan. Each report contains the time, geographic location, and details of an event the US military thought was important enough to put on paper.
    Continue Reading

  • Tardiness solves statistics theorems

    July 21, 2010  |  Statistics

    Yeah, you read that right. Tardiness makes the world go 'round:

    One day in 1939, Berkeley doctoral candidate George Dantzig arrived late for a statistics class taught by Jerzy Neyman. He copied down the two problems on the blackboard and turned them in a few days later, apologizing for the delay — he’d found them unusually difficult. Distracted, Neyman told him to leave his homework on the desk.

    On a Sunday morning six weeks later, Neyman banged on Dantzig’s door. The problems that Dantzig had assumed were homework were actually unproved statistical theorems that Neyman had been discussing with the class — and Dantzig had proved both of them. Both were eventually published, with Dantzig as coauthor.

    Other benefits include more hours of sleep, exercise while power-walking to your destination, and all-around warm, fuzzy feelings knowing that you live by nobody's schedule. You might also supposedly inspire films like Good Will Hunting.

    Who knew?

    [via Bobulate]

  • Open data doesn’t empower communities

    July 5, 2010  |  Data Sharing

    internet.artizans reflects on the usefulness of open data:

    I'm inspired by the idea that nuggets of opened data could seed guerilla public services, plugging gaps left by government, but i don't see any of that in the data.gov.uk apps list. The reasons aren't technical but psychosocial - the people and communities who could use this data to help tackle their own disadvantage and marginalisation don't have the self-confident sense of entitlement that makes for successful civic hacktivism.

    The groups that really need it also often don't have the tech or know-how to make use of - or even collect useful data - to make a case for anything. People like us, the data and tech-savvy can help.

    [via migurski]

  • Data and its impact on journalism

    June 7, 2010  |  Data Sources, Statistics

    In regards to the UK's recent boom in open data, Simon Rogers of the Guardian, ponders data's role in journalism, and the opportunities this new found information could bring:

    The impact on journalism is expected to be great. The Chicago-based web developer and founder of the neighbourhood news site EveryBlock, Adrian Holovaty, says it's going to be challenging but exciting for journalists. "As more governments open their data, journalists lose privileged status as gatekeepers of information – but the need for their work as curators and explainers increases. The more data that's available in the world, the more essential it is for somebody to make sense of it."

    This need not only creates a fresh brand of news, but also a new type of journalist:

    I once prided myself on my lack of maths knowledge. Now I find myself editing a datajournalism site, the Guardian's datablog: a site where we use Google Spreadsheets to post key datasets. We make the data properly accessible, then encourage our users to take the numbers, produce graphics and applications and help us look for stories.

    Priding yourself on a lack of know-how on how to deal with data is a little weird, but okay.

    In any case, people always ask me how to get into information design, infographics, visualization etc. Journalism is one of those choices, and there's a lot of opportunity there if you've got the skills.

  • Egregious Citations Issued to BP

    June 6, 2010  |  Data Sources

    BP processes about 1.5 million barrels of crude oil per day, across six refineries in the United States. In total, 150 refineries in the United States process just under 18 million barrels per day, so BP processes about 8.5 percent of it. However, as reported by the Center for Public Integrity, 97 percent of the most dangerous violations found by OSHA were on BP properties.
    Continue Reading

  • Data Science is catching on

    June 2, 2010  |  Statistics

    Maybe there's something to this whole data science thing after all. Mike Loukides describes data science and where it's headed on O'Reilly Radar. It's a good read, but statisticians get clumped into suits crunching numbers like actuarial drones:

    Using data effectively requires something different from traditional statistics, where actuaries in business suits perform arcane but fairly well-defined kinds of analysis. What differentiates data science from statistics is that data science is a holistic approach. We're increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others.

    What is data science? It's what well-rounded statisticians do.

  • Live webcast: Community Health Data Initiative

    June 2, 2010  |  Data Sources, News

    Health and Human Services (HHS) is about to announce the launch of their Community Health Data Initiative over in DC right now. The point is to make health data more usable for consumers and communities.

    Today groups will be presenting how they've made use of the data in the past few weeks from about 9:30 to 10:30 - as in right now. I've embedded the live webcast below.

    They're just going through the formalities of thank yous and intros right now, but the good stuff should start soon.
    Continue Reading

  • Junk food equivalents of sugary drinks

    May 28, 2010  |  Statistics

    Men's Health takes a look at America's most sugary drinks and their junk food equivalents. A Peppermint White Chocolate Mocha with whipped cream (venti size) from Starbucks has the same amount of sugar as 8½ scoops of Edy’s Slow Churned Rich and Creamy Coffee Ice Cream. Calorie-wise, the picture might look a little different. Still though, that's a lot of sugar.

    Be careful what you drink, boys and girls.

    [via Boing Boing]

  • Instant electric bike and data collector

    May 26, 2010  |  Data Sharing, Self-surveillance

    When you ride your bicycle around, I bet you always wish for two things. First: "I wish this was electric so that I didn't have to pedal so much." Second: "I wish I could use my bicycle as a data collection device." Well guess what. Your dreams have come true. The Copenhagen Wheel, conceived by the MIT SENSEable City Lab, will do just that. With everything rolled up into one hub, a quick and simple installation turns your plain old bicycle into an electric data collection device.
    Continue Reading

  • Why context is as important as the data itself

    May 21, 2010  |  Design, Statistics

    John Allen Paulos, a math professor at Temple University, explains, in the New York Times, the importance of the before and after of when you get that data blobby thing in your hands.

    The problem isn’t with statistical tests themselves but with what we do before and after we run them. First, we count if we can, but counting depends a great deal on previous assumptions about categorization. Consider, for example, the number of homeless people in Philadelphia, or the number of battered women in Atlanta, or the number of suicides in Denver. Is someone homeless if he’s unemployed and living with his brother’s family temporarily? Do we require that a women self-identify as battered to count her as such? If a person starts drinking day in and day out after a cancer diagnosis and dies from acute cirrhosis, did he kill himself?

    In a nutshell, statistics is a game of estimation. More often than not, the numbers in front of you aren't an exact count. They could easily change if you shift the criteria of what was counted. As a result, there's always some amount of uncertainty attached to your data, and it's the statistician, analyst, and data scientist's job to minimize that uncertainty.

    So the next time you see a list of rankings like "fattest city" or "dumbest town," don't take it for absolute truth. Instead, think of it as an educated guess. Similarly, when you analyze and visualize, remember the context of your data.

    Catch Paulos' full article here.

  • Wait. Something isn’t right here…

    May 14, 2010  |  Mistaken Data

    All the same

    No clue where this is from, but something seems sort of off, no? I guess we should take the title literally. By the numbers... only.

    I'm going to give the benefit of the doubt though, and assume this was just an honest mistake. Here's my guess about what happened. A deadline was coming up quick, and a graphics editor put this together to get a feel for what the final design would look like. He then saved it as a different file, and then went to work. Except when it came time to send the file to the printers, the editor sent the wrong file. Actually, now that I think about it, I'm surprised this doesn't happen more often.

    [via @EagerEyes]

  • Write your own TED talk with lies, damned lies and statistics

    May 12, 2010  |  Statistics

    Sebastian Wernicke, an engagement manager at Oliver Wyman and former bioinformatics researcher, explains the results from his pseudo-analysis of TED talks. The result: a guide on how to give the ultimate TED talk. Go as long as you can, grow your hair out and wear glasses, and cover happy ideas that are easy to relate to. Or better yet, use Wernicke's tedPAD to formulaically write your own talk to drive the audience wild - or boo at you emphatically.
    Continue Reading

  • How open data saved $3.2 billion

    May 12, 2010  |  Statistics

    This is a story of fake charities and tax shelters. In an analysis of data from the Canada Revenue Agency (CRA), it was found that billions of dollars in donations were collected by fraudulent organizations, with only a tiny portion going to the actual causes. In one case, only $1 out of every $100 went to helping the homeless. The rest of the money went to a tax shelter. Shameful.

    All told, my colleague estimated that these illegally operating charities alone sheltered roughly half a billion dollars in 2005. Indeed, newspapers later confirmed that in 2007, fraudulent donations were closer to a billion dollars a year, with some 3.2 billion dollars illegally sheltered, a sum that accounts for 12% of all charitable giving in Canada.

    Not only did this lead to the exposure of fraud, but also negligence on the part of the CRA charity division (now under new leadership). How did this go on for so long? A simple sort on the data would have raised questions immediately. Instead, it took a freelance consultant, poking around out of curiosity, and journalists, who were aware of fishy behavior, to move things along.

    [via @datamarket]

  • How men and women label colors

    May 4, 2010  |  Infographics, Statistics

    Along the same lines of Dolores Labs' color experiment, Randall Munroe of xkcd reveals the results of his color survey. He took a slightly different approach though. Here are some of the basic findings:

    If you ask people to name colors long enough, they go totally crazy.

    “Puke” and “vomit” are totally real colors.

    Colorblind people are more likely than non-colorblind people to type “fuck this” (or some variant) and quit in frustration.

    Indigo was totally just added to the rainbow so it would have 7 colors and make that “ROY G. BIV” acronym work, just like you always suspected. It should really be ROY GBP, with maybe a C or T thrown in there between G and B depending on how the spectrum was converted to RGB.

    A couple dozen people embedded SQL ‘drop table’ statements in the color names. Nice try, kids.

    Nobody can spell “fuchsia”.

    Continue Reading

  • Twitter data buffet is back in business

    April 28, 2010  |  Data Sources

    Almost a year and a half ago, Infochimps, the data repository slash marketplace, released a giant scrape of Twitter data representing 2.7 million users, 10 million tweets, and 58 million connections. Twitter soon requested that they take it down while they figured out how they wanted to handle licensing, privacy, etc.

    That was in 2008, before Twitter really started booming. Fast forward to now. Twitter and Infochimps have figured out what they want to do, and the Twitter census data is back up. It's no longer a measly 2.7 million users anymore though. The population has grown to 35 million.
    Continue Reading

  • R is an ‘epic fail’ – or how to make statisticians mad

    April 22, 2010  |  Software, Statistics

    Statisticians are mad and out for blood. Someone called R an epic fail and said it wasn't the next big thing.

    I know that R is free and I am actually a Unix fan and think Open Source software is a great idea. However, for me personally and for most users, both individual and organizational, the much greater cost of software is the time it takes to install it, maintain it, learn it and document it. On that, R is an epic fail. It does NOT fit with the way the vast majority of people in the world use computers. The vast majority of people are NOT programmers. They are used to looking at things and clicking on things.

    How dare she, right? Here's the thing. She's right. Wait, wait, hear me out. For the general audience - the people who use Excel as their analysis tool - R is not for them. In this case, the one that appeals to non-statistician analysts, R, as they say, is an epic fail (and that is the last time I will say that stupid phrase).

    However, R wasn't designed to enable everyday users to dig into data. It was designed to enable statisticians with computing power. It's a statistical computing language largely based on S, which was developed in the 1970s by the super smart John Chambers of Bell Labs. The 1970s. Weren't people using slide rules still? Or maybe it was the abacus. Can't remember. Oh wait, I wasn't born yet. In any case, there's really no need to get into the whole R-for-general-audience conversation — just like we don't need to talk about why The SpongeBob SquarePants Movie lacked emotional depth.
    Continue Reading

  • World data released ‘is a dream come true’

    April 20, 2010  |  Data Sources

    mortality

    In another step towards open data and all that jazz, the World Bank released World Development Indicators 2010 today, which is meant to serve as a progress report of the world.

    The WDI provides a valuable statistical picture of the world and how far we've come in advancing development," said Justin Yifu Lin, the World Bank’s Chief Economist and the Senior Vice President for Development Economics. “Making this comprehensive data free for all is a dream come true.

    More importantly though, this comes with the launch of the freely available online database and public API to 1,000+ indicators. There used to be a big fee for this data. I can't speak for the API, but the website is well-designed. It has profile pages for each country, links to download the indicators in Excel and XML, and hey, are those graphs implemented in HTML5? I spy <canvas> tags.
    Continue Reading

  • TransparencyData makes campaign finance data easier to access

    April 14, 2010  |  Data Sources

    Anyone who's looked at campaign finance data knows it can get messy really quick (especially if you're getting it directly from the FEC). Sunlight Labs' newly launched TransparencyData aims to make the process a lot easier.

    They've merged state data from FollowTheMoney and federal data from OpenSecrets and made it easy to search with a clickable interface. Select from a number of filters such as amount, recipient, or contributor, and then download data in bulk or make use of the API.
    Continue Reading

  • Twitter predicts the future?

    April 13, 2010  |  Statistics

    twitter-prediction

    A recent study [pdf] by Sitaram Asur and Bernardo A. Huberman at HP Labs found that it's possible to use Twitter chatter to predict first-weekend box office revenues simply based on volume of tweets. The predictions were even more accurate when they introduced sentiment analysis (i.e. classified tweets as positive or negative).
    Continue Reading

Copyright © 2007-2014 FlowingData. All rights reserved. Hosted by Linode.