Guide for dealing with bad data

Posted to Statistics  |  Tags: ,  |  Nathan Yau

Enter the real world of data and statistics, and you find that files aren’t always neatly wrapped with a bow and delimited fields. Christopher Groskopf, who recently joined Quartz, provides an “exhaustive reference” to deal with the real stuff.

Most of these problems can be solved. Some of them can’t be solved and that means you should not use the data. Others can’t be solved, but with precautions you can continue using the data. In order to allow for these ambiguities, this guide is organized by who is best equipped to solve the problem: you, your source, an expert, etc. In the description of each problem you may also find suggestions for what to do if that person can’t help you.

The guide is aimed at journalists but easily applies to general data meanderings. I think we can all easily relate to problems such as missing data (“Where did the rest go?”), sample bias (“The population is who?”), and data in a difficult-to-manage format (“They gave you how many PDF files?”).

Bookmark it, read it, and keep it in your digital pocket.

Favorites

Watching the growth of Walmart – now with 100% more Sam’s Club

The ever so popular Walmart growth map gets an update, and yes, it still looks like a wildfire. Sam’s Club follows soon after, although not nearly as vigorously.

Where Bars Outnumber Grocery Stores

A closer look at the age old question of where there are more bars than grocery stores, and vice versa.

Famous Movie Quotes as Charts

In celebration of their 100-year anniversary, the American Film Institute selected the 100 most memorable quotes from American cinema, and …

Reviving the Statistical Atlas of the United States with New Data

Due to budget cuts, there is no plan for an updated atlas. So I recreated the original 1870 Atlas using today’s publicly available data.