• Stock market predictions with Twitter

    February 3, 2011  |  Statistics

    Apparently moods on Twitter can be used to predict the ups and downs of the stock market, according to work from Johan Bollen and Huina Mao of Indiana University-Bloomington: "Measuring how calm the Twitterverse is on a given day can foretell the direction of changes to the Dow Jones Industrial Average three days later with an accuracy of 86.7 percent."

    I can't wait until Twitter is used to predict when I want to eat and sleep, and my robot can cook me gourmet meals and provide turn down service accordingly. And it better be accurate to the minute. Anything less is failure.

  • Predicting crime before it happens

    February 3, 2011  |  Statistics

    Christopher Beam for Slate explains research being done at UCLA in collaboration with the LAPD on predictive policing:

    Predictive policing is based on the idea that some crime is random—but a lot isn't. For example, home burglaries are relatively predictable. When a house gets robbed, the likelihood of that house or houses near it getting robbed again spikes in the following days. Most people expect the exact opposite, figuring that if lightning strike once, it won't strike again. "This type of lightning does strike more than once," says Brantingham. Other crimes, like murder or rape, are harder to predict. They're more rare, for one thing, and the crime scene isn't always stationary, like a house. But they do tend to follow the same general pattern. If one gang member shoots another, for example, the likelihood of reprisal goes up.

    This happened in my neighborhood when I was in fifth grade. We lived in a pretty quiet neighborhood, but one morning a window was open. Someone had come into our house while we were sleeping and stole whatever was in immediate reach. They also stole my dad's brand new bicycle from the garage. Same thing happened to my neighbor two days later.

    [Slate via @amstatnews]

  • Find more of the data you need with DataMarket

    January 31, 2011  |  Data Sources, Online Applications

    Add another online destination to find the data that you need. DataMarket launched back in May with Icelandic data, but just a few days ago relaunched with data of the international variety. They tout 100 million time series datasets and 600 million facts. I'm not totally sure what that means (100 million lines, sets of lines?), but I take it that means a lot.

    Just over 2 years and countless cups of coffee after we started coding, DataMarket.com launches with international data. You can now find, visualize and download data from many of the world’s most important data providers on our site.

    At first glance DataMarket feels a lot like now defunct Swivel. Search for the data you want and you get back a list of datasets. The focus on only time series though is actually a plus in that they can provide more specific tools to visualize and explore. The current toolset isn't going to blow you away, but it's not bad.
    Continue Reading

  • Open thread: Charts during the State of the Union address

    January 26, 2011  |  Discussion, Mistaken Data

    Bubble chart during SOTU

    President Barack Obama delivered his State of the Union address yesterday, and this year it was "enhanced" by charts and graphs. Basically, as Obama spoke, graphics that you could equate to Powerpoint slides showed up on the side. What'd you think of the enhancement? Did it add or detract from the message? Were the graphics used honestly and effectively?

    One thing's for sure: there's something wrong with that bubble chart. Uh oh.
    Continue Reading

  • Tracking space garbage with Space Fence

    January 20, 2011  |  Statistics

    Space Fence

    Lockheed Martin's Space Fence, expected to be in initial operation in 2015, will track the junk floating in space:

    Space Fence is envisioned as a network of ground-based S-band radars that will detect, track, measure and catalog thousands of objects in low-Earth orbit. Expected to begin initial operation in 2015, the system will replace the existing Air Force Space Surveillance System, or VHF Fence, which has been in service since the early 1960s. A leader in S-band radar development, Lockheed Martin's high-powered radar systems will find and follow the course of thousands of pieces of space debris to an accuracy of just meters.

    They provide this video (below) to explain the concept, of which I'm pretty sure most of is fake, but let's pretend it's real. It's more exciting that way.
    Continue Reading

  • A guide for scraping data

    January 17, 2011  |  Data Sources

    Data is rarely in the format you want it. Dan Nguyen, for ProPublica, provides a thorough guide on how to scrape data from Flash, HTML, and PDF. [via @JanWillemTulp]

  • The Joy of Stats available in its entirety

    December 30, 2010  |  Statistics

    The Joy of Stats with Hans Rosling

    The Joy of Stats, hosted by Hans Rosling, is now viewable in its entirety (video below):

    Hans Rosling says there’s nothing boring about stats, and then goes on to prove it. Only with statistics can we make sense of the world and harness the data deluge to serve us rather than drown in its confusion.
    A one-hour long documentary produced by Wingspan Productions and broadcast by BBC, 2010.

    Originally, it was only viewable in the UK, and then there were some clips, but finally, you can watch the whole thing. It's an hour long so you might want to bookmark it for later, but it's entertaining all the way through. Plus, it's the week between Christmas and New Year's so I know you're not working.
    Continue Reading

  • Why the other lines always seem to move faster than yours

    December 24, 2010  |  Statistics

    Waiting lines

    Why does it almost always seem like you're in the slow line at the grocery store or in the driving lane with the most cars on the freeway? Bill "Engineer Guy" Hammack explains in terms of queuing theory in the video below:

    Bill reveals how "queueing theory" - developed by engineers to route phone calls - can be used to find the most efficient arrangement of cashiers and check out lines. He reports on the work of Agner Erlang, a Danish engineer who, at the opening of the 20th century, helped the Copenhagen Telephone Company provide the best level of service at the lowest price.

    Erlang found out how many telephone lines the company needed, given the average number of calls per hour. Similarly, you can figure out how many checkout lines you need, given the average number of customers. It turns out the best arrangement is to have a single line, and the next customer goes to the next available register. There's less chance of blockage from a single delay.

    But people don't like doing that apparently, and so assuming random selection, ending up in the slow line comes down to simple probability.
    Continue Reading

  • Right versus wrong bubble size

    December 17, 2010  |  Mistaken Data

    Subsidize This from Good Magazine

    I was going to post this graphic from Good when it came out, but decided not to. I made the same mistake when I first started out. It was another case of wrongly sized bubbles. But they fixed the problem, so now we can see what a big difference it makes. Continue Reading

  • Data analysis is the future of journalism

    December 8, 2010  |  Statistics

    Tim Berners-Lee, credited with inventing the Web, says analyzing data is the future of journalism:

    "Journalists need to be data-savvy. It used to be that you would get stories by chatting to people in bars, and it still might be that you'll do it that way some times.

    "But now it's also going to be about poring over data and equipping yourself with the tools to analyse it and picking out what's interesting. And keeping it in perspective, helping people out by really seeing where it all fits together, and what's going on in the country."

    The Guardian post focuses on current journalists learning new skills, but what we're also going to see is a new type of person — computer scientists, statisticians, and interaction designers — become the storytellers.

  • Jon Stewart explains Wikileaks’ Cablegate

    December 2, 2010  |  Data Sources, News

    You've probably already heard and read about Wikileaks' Cablegate. If not, Andy Baio has a fine roundup with significant coverage and events to get you caught up quick. Alternatively, you can watch Jon Stewart and The Daily Show explain in the clip below (slightly NSFW, because it mentions a body part).
    Continue Reading

  • The Joy of Stats with Hans Rosling

    November 30, 2010  |  Statistics, Visualization

    Hans Rosling on development

    The Joy of Stats, a one-hour documentary, hosted by none other than the charismatic Hans Rosling, explores the growing importance of statistics:

    [W]ithout statistics we are cast adrift on an ocean of confusion, but armed with stats we can take control of our lives, hold our rulers to account and see the world as it really is. What's more, Hans concludes, we can now collect and analyse such huge quantities of data and at such speeds that scientific method itself seems to be changing.

    From the description, it sounds like they'll touch on Crimespotting by Stamen, Google Translation, among other data-driven projects. Whatever they cover, it's bound to be interesting with Rosling at the front.
    Continue Reading

  • How do people use Firefox?

    November 30, 2010  |  Data Sources, News

    Mozilla Labs just released a bunch of anonymized browsing data for their open data visualization competition:

    This competition is based on Mozilla's own open data program, Test Pilot. Test Pilot is a user research platform that collects structured user data through Firefox. All data is gathered through pre-defined Test Pilot studies, which aim to explore how people use their web browser and the Internet.

    There are two datasets in various formats. The first is browsing behavior from 27,000 users, including on/off private browsing that we saw a few months ago. The second dataset is from 160,000 users and is on how they actually use the Firefox interface.

    Additionally, both sets have survey answers to questions like "How long have you used Firefox?" which could make for some fun and interesting breakdowns.

    The deadline is December 17.

    [Mozilla Labs]

  • Statistics vs. Stories

    November 29, 2010  |  Statistics

    Professor of Mathematics at Temple University, John Allen Paulos describes the differences between statistics and stories:

    [T]here is a tension between stories and statistics, and one under-appreciated contrast between them is simply the mindset with which we approach them. In listening to stories we tend to suspend disbelief in order to be entertained, whereas in evaluating statistics we generally have an opposite inclination to suspend belief in order not to be beguiled.

    And he concludes:

    The focus of stories is on individual people rather than averages, on motives rather than movements, on point of view rather than the view from nowhere, context rather than raw data. Moreover, stories are open-ended and metaphorical rather than determinate and literal.

    Which way do we go when we start telling stories with data?

    [New York Times via @joandimicco]

  • R is the need-to-know stat software

    November 17, 2010  |  Software, Statistics

    This Forbes post on the greatness that is R is being passed around by every statistician and his mother today.

    It's not that this type of analysis wasn't possible before — statisticians have existed, and commercial software has been available to support them, for decades. The fact that R is free to use, free to modify, and its source is open to view, extend and improve means students, stock traders-in-training and fantasy football junkies can familiarize themselves with the software. They can write programs against it. They're likely to continue that usage into their professional lives. When they share their work, the community, down the line, benefits. And the virtuous cycle strengthens.

    What's your favorite (graphical) use of R?

  • Recalls for March

    Making recalls and market withdrawals more accessible

    Last week I found out that the FDA has a feed for all product recalls and market withdrawals since 2009 and an RSS feed with…
  • Simple analysis makes Expedia extra $12m

    November 5, 2010  |  Statistics

    There was a problem on Expedia where a lot of people were choosing their itinerary, entering their information and then dropping off after they clicked on the Buy Now button. It's like getting to the cash register at a store, and the cashier says they can't take your money.

    So analysts took a look and found that the field to enter your company was confusing people, leading to the input of an incorrect address. "After we realised that we just went onto the site and deleted that field — overnight there was a step function [change], resulting in $12m of profit a year, simply by deleting a field."

    Not bad for a little bit of data digging. I hope the analysts got a bonus.

    That said, not every decision has to be driven by data. Balance is good.

    [Silicon via @jpmarcum]

  • Stat concepts to the tune of Gershwin

    October 29, 2010  |  Statistics

    Stat people will probably find this amusing. For the rest, this might make your head explode. Gurdeep Stephens and Michael Greenacre perform classic songs but use statistical concepts for lyrics. Here's Summertime, originally by George Gershwin, turned into a song about statistical modeling (video below).

    It's summertime,
    Statistical modelling is easy,
    Data are fitting,
    Explained variance is high.
    Your data are rich,
    And your model's good-looking,
    So hush, statisticians, don't you cry...

    Continue Reading

  • Opportunities in Government 2.0

    October 27, 2010  |  Data Sources

    Vivek Wadhwa talks government data and the (financial) opportunities ripe for the picking:

    What is happening with the opening up of government data is nothing less than a silent revolution. There are literally thousands of new opportunities to improve government and to improve society—and to make a fortune while doing it. Unlike the Web 2.0 space, which is overcrowded, Gov 2.0 is uncharted territory: a new frontier to explore, grow things on, and settle on. It’s fresh soil for unlikely seedling ideas that, if they take root, could lead to very successful ventures. So I encourage entrepreneurs to stake their claims as soon as they can.

    Wait a minute. Hold up. You can do more with government data than awkward dashboards? Bring it.

    [TechCrunch via @ucdatalab]

  • A different analytical wall

    October 14, 2010  |  Statistics

    In reference to the wall between reporting data and understanding it, Martin Theus proposes a different one:

    Once you start to explore the data, the whole thing stops to be linear but gets to be very iterative, jumping over the wall every now and then. I.e., you may find out that the data cleaning is insufficient, or the model you have in mind needs some other transformation of the data, or you might want to collect additional or other data altogether.

    The wall does exist, but I think it is more separating two kinds of people / thinking.

    Theus finishes:

    One thing is for sure: we won’t succeed if analysts continue to build useful but technically insufficient tools and computer scientists still build fancy tools that merely help the analysts.

    Or even better: analyst and tool builder become the same person. That'll take much longer though, so communication is a good place to start.


Copyright © 2007-2014 FlowingData. All rights reserved. Hosted by Linode.