• Most underrated films

    May 6, 2014  |  Data Sources

    Rotten Tomatoes film ratingsBen Moore was curious about overrated and underrated films.

    "Overrated" and "underrated" are slippery terms to try to quantify. An interesting way of looking at this, I thought, would be to compare the reviews of film critics with those of Joe Public, reasoning that a film which is roundly-lauded by the Hollywood press but proved disappointing for the real audience would be "overrated" and vice versa.

    Through the Rotten Tomatoes API, he found data to make such a comparison. Then he plotted one against the other, along with a quick calculation of the differences between the percentage of official critics who liked and that of the Rotten Tomatoes audience. The most underrated: Facing the Giants, Diary of a Mad Black Woman, and Grandma's Boy. The most overrated: Spy Kids, 3 Backyards, and Stuart Little 2.

    The plot would be better without the rainbow color scheme and a simple reference line through the even-rating diagonal. But this gets bonus points for sharing the code snippet to access the Rotten Tomatoes API in R, which you can generalize.

  • Bike share data in New York, animated

    April 1, 2014  |  Data Sources

    Citi Bike, also known as NYC Bike Share, is releasing monthly data dumps for station check-outs and check-ins, which gives you a sense of where and when people move about the city. Jeff Ferzoco, Sarah Kaufman, and Juan Francisco Saldarriaga mapped 24 hours of activity in the video below.

    [Thanks, Jeff]

  • ProPublica opened a data store

    March 4, 2014  |  Data Sources

    One of the main challenges of any data project is getting the data. It seems obvious, but the effort to get the right data to answer a question seems to catch people off guard. Even data that's "free" to download can be a huge pain that ends up completely useless. ProPublica, the non-profit newsroom, deals with this stuff on a regular basis and hopes that some of their efforts can turn into a source of funding through the Data Store.

    Like most newsrooms, we make extensive use of government data — some downloaded from "open data" sites and some obtained through Freedom of Information Act requests. But much of our data comes from our developers spending months scraping and assembling material from web sites and out of Acrobat documents. Some data requires months of labor to clean or requires combining datasets from different sources in a way that's never been done before.

    In the Data Store you'll find a growing collection of the data we've used in our reporting. For raw, as-is datasets we receive from government sources, you'll find a free download link that simply requires you agree to a simplified version of our Terms of Use. For datasets that are available as downloads from government websites, we've simply linked to the sites to ensure you can quickly get the most up-to-date data.

    For datasets that are the result of significant expenditures of our time and effort, we're charging a reasonable one-time fee: In most cases, it's $200 for journalists and $2,000 for academic researchers.

    I hope it works.

  • Texting data to save lives

    February 6, 2014  |  Data Sources

    Remember that TED talk from a couple of years ago on texting patterns to a crisis hotline? The TED talker Nancy Lublin proposed the analysis of these text messages to potentially help the individuals texting. Her group, the Crisis Text Line, plans to release anonymized aggregates in the coming months.

    Ms. Lublin said texts also provided real-time information that showed patterns for people in crisis.

    Crisis Text Line's data, she said, suggests that children with eating disorders seek help more often Sunday through Tuesday, that self-cutters do not wait until after school to hurt themselves, and that depression is reported three times as much in El Paso as in Chicago.

    This spring, Crisis Text Line intends to make the aggregate data available to the public. "My dream," Ms. Lublin said, "is that public health officials will use this data and tailor public policy solutions around it."

    Keeping an eye on this.

  • Cancer data for the U.S. released

    November 1, 2013  |  Data Sources

    The Centers for Disease Control and Prevention released their most recent cancer data a few days ago. It's the numbers for 2010, which feels dated. However, the annual data goes back to 1999, across demographics and states, which makes this data worth a look. You can download the delimited files here.

    A browser accompanies the release, as shown below. It's really just that though, leaving analysis up to you, and it's rough around the edges.

    Cancer statistics

    So if you're looking for a weekend project, this is a good place to go. I'd probably start with the age breakdowns and work from there.

  • Government data shutdown

    October 2, 2013  |  Data Sources

    Census shutdown

    When you go to the United States Census site, Data.gov, or similar government-run sites, you see this. "Due to the lapse in government funding, census.gov sites, services, and all online survey collection requests will be unavailable until further notice." Now it's personal.

  • Data.gov revamp

    July 31, 2013  |  Data Sources

    Data dot gov revamp

    After budget cuts a couple of years ago, I assumed Data.gov was all but dead, but apparently there's a new site in the works.

    The original version of Data.gov was hard to use, and you rarely found the data you wanted. I always ended up on Google and landed on the department's source instead. It looks like they improved the interface, and their aim is towards a community built around the data where people can share projects and analyses.

    However, the data available on the site still looks slim and dated, which was a challenge with the original version. I mean the homepage says you can search 100s of APIs and over 75,000 datasets, but then click over to the Data Catalog and it says only 409 datasets found. So there's still work to be done.

    I'm glad the project is still alive though. We'll have to see where this goes.

  • Medicare provider charge data released

    May 28, 2013  |  Data Sources

    NYT hospital browserThe Centers for Medicare and Medicaid Services released billing data for more than 3,000 U.S. hospitals, showing high variance in cost of health scare across the country and even between nearby hospitals.

    As part of the Obama administration’s work to make our health care system more affordable and accountable, data are being released that show significant variation across the country and within communities in what hospitals charge for common inpatient services.

    The data provided here include hospital-specific charges for the more than 3,000 U.S. hospitals that receive Medicare Inpatient Prospective Payment System (IPPS) payments for the top 100 most frequently billed discharges, paid under Medicare based on a rate per discharge using the Medicare Severity Diagnosis Related Group (MS-DRG) for Fiscal Year (FY) 2011. These DRGs represent almost 7 million discharges or 60 percent of total Medicare IPPS discharges.

    The data is downloadable as CSV or Excel files and is surprisingly usable and worth a look.

    The New York Times has a useful per-hospital browser and The Washington Post provides quick comparisons by state.

  • Archive of datasets bundled with R

    November 20, 2012  |  Data Sources

    R comes with a lot of datasets, some with the core distribution and others with packages, but you'd never know which ones unless you went through all the examples found at the end of help documents. Luckily, Vincent Arel-Bundock cataloged 596 of them in an easy-to-read page, and you can quickly download them as CSV files.

    Many of the datasets are dated, going back to the original distribution of R, but it's a great resource for teaching or if you're just looking for some data to play with.

  • Data on decades of Boy Scout expulsions released

    October 22, 2012  |  Data Sources

    Allegations in the Boy Scouts

    The Los Angeles Times released nearly 5,000 records of allegations from the Boy Scouts of America as a browseable map and searchable list. You can also download the data.

    This data­base con­tains in­form­a­tion on about 5,000 men and a hand­ful of wo­men who were ex­pelled from the Boy Scouts of Amer­ica between 1947 and Janu­ary 2005 on sus­pi­cion of sexu­al ab­use. The dots on the map in­dic­ate the loc­a­tion of troops con­nec­ted in some way to the ac­cused. The timeline be­low shows the volume of cases opened by year; however, an un­known num­ber of files were purged by the Scouts pri­or to the early 1990s

    The interactive map helps you narrow down by city, but it's kind of hard to see cases on a country-wide perspective. Here's a quick look.

    The worst part is that a lot of the cases went unreported.

  • Losing American Community Survey would be ‘disastrous’

    June 11, 2012  |  Data Sources

    Many want to get rid of the American Community Survey, a Census program which releases region-specific data annually. University of Michigan professor William Frey explains why cutting the survey would be a mistake.

  • A Future Without Key Social and Economic Statistics for the Country

    May 13, 2012  |  Data Sources

    Robert Groves, director of the U.S. Census Bureau, on the Appropriations Bill:

    The Appropriations Bill eliminates the Economic Census, which measures the health of our economy. It terminates the American Community Survey, which produces the social and demographic information that monitors the impact of economic trends on communities throughout the country. It halts crucial development of ways to save money on the next decennial census. In the last three years the Census Bureau has reacted to budget and technological challenges by mounting aggressive operational efficiency programs to make these key statistical cornerstones of the country more cost efficient. Eliminating them halts all the progress to build 21st century statistical tools through those innovations. This bill thus devastates the nation’s statistical information about the status of the economy and the larger society.

    A lot of the negative comments following the post are from people who have never used Census data, or any substantial amount of data for that matter, and have no clue how a dataset can feed into a model to make other estimates. Then there's the people who don't want to answer questions about their toilets. I wonder what their Facebook profiles look like.

  • CNN transcript collection, 2000-2012

    May 9, 2012  |  Data Sources

    Thanks to the Internet Archive and CNN, thirteen years of transcripts, about a gigabyte compressed, is available to download as one file.

    For over a decade, CNN (Cable News Network) has been providing transcripts of shows, events and newscasts from its broadcasts. The archive has been maintained and the text transcripts have been dependably available at transcripts.cnn.com. This is a just-in-case grab of the years of transcripts for later study and historical research.

    Changes in news coverage and CNN's focus over the years, anyone?

    [via @A_L]

  • 1940 Census Individual Records Released

    April 3, 2012  |  Data Sources

    Racial ethnic diversity

    The 72-year mark has arrived, and the United States Census released individual records from 1940 yesterday. So you can now, for example, see that J.D. Salinger lived at 1133 Park Avenue.

  • Texting on the toilet

    January 30, 2012  |  Data Sources

    I thought this riveting post on the New York Times Bits blog about the rise of the toilet texter deserved a graphic. Since their graphics department is no doubt busy with elections, I took the liberty. I am — the 91 percent.

    I got the numbers straight from the Bits post, but you can download the full report from 11mark for all the demographics. You have to register though, and I didn't want to be the guy who creates an online account to just read a report on what people do while they make dooty. I have standards.

  • What Facebook knows about you

    December 14, 2011  |  Data Sources

    Facebook privacy

    Facebook logs and saves a lot of data about you and what you do on their site. This shouldn't be surprising given the more time people spend on Facebook, the greater the cash flow, but just how much data do they store? Austrian law student Max Schrems, because European law states that citizens can do this, requested all the data Facebook had about him. He got back a CD with 1,222 PDF files.
    Continue Reading

  • Geo API from Infochimps brings you closer to mapping fun

    August 31, 2011  |  Data Sources

    Summarizer from infochimps

    Mostly because of the popularity of smartphones, location data is all the rage nowadays. You're almost always connected no matter where you are. Rich location data can help provide you a new sense of place, and at the same time, this sort of data can paint an interesting picture of what's going on in your country or around the world. Hence, Infochimps, the one-stop shop for data folk and developers, just announced their new Geo API.
    Continue Reading

  • Reporters make it easier to access Census data

    August 29, 2011  |  Data Sources

    Census data can provide valuable information, but the datasets are not always the easiest to access. So you often end up spending a lot of time getting your data in order before you actually get to do anything with it. Investigative Reporters and Editors has released the next phase in their Census project to make Census 2010 more accessible via a simple interface. Easily download data in bulk as CSV or shapefiles or build it into your applications with the API.

    [census.ire.org via @bryanboyer]

  • Get a coffee, give a coffee API

    August 7, 2011  |  Data Sources

    Jonathan Stark, a mobile application consultant, is running an interesting social experiment with his Starbucks card:

    Jonathan's Card is an experiment in social sharing of physical goods using digital currency on mobile phones. I stumbled on the idea while doing research for a blog post about Broadcasting Mobile Currency.

    Based on the similarity to the "take a penny, leave a penny" trays at convenience stores in the US, I've adopted a similar "get a coffee, give a coffee" terminology for Jonathan's Card.

    Simply save the picture of Jonathan's Starbucks card onto your smartphone and use it to buy your coffee. If you like, add money to the card so that someone else can buy a coffee.

    The best part is that Stark provides a simple API that returns the balance on the card every minute. When do people buy coffee? How do people give and take? Are people more likely to give when there's a large balance or when there's nothing left? Lots of fun things to look at.

    [Jonathan's Card via @kn0thing]

  • Pew Research raw survey data now available

    May 25, 2011  |  Data Sources

    The Pew Research churns out a lot of interesting results from a number of surveys about online and American culture, but they usually only shared aggregated results, pre-made charts and graphs. This is well and good for the information-consuming public; however, these results can spawn curiosities that are fun to dig into. Luckily, the Pew Research Center launched a Data Sets section that provides raw survey responses and the questions in a variety of easy-to-use data formats.

    Our raw data, previously posted only as SPSS files, is now available in comma-delimited (.csv) format for all reports going back to 2003. We hope that making our data available in this open-source format will make analysis easier for researchers who don’t own a copy of SPSS to analyze our data.

    This should be fun. Recent datasets include the social side of the Internet, health tracking habits, and reputation management.

    [Pew Research via @kzickhur]

Copyright © 2007-2014 FlowingData. All rights reserved. Hosted by Linode.