Data cleaning tips

Posted to Statistics  |  Tags:  |  Nathan Yau

When you first learn statistics, visualization, or any data-related subject, the data usually is given to you in a ready-to-use format. This is so that you can spend most of your time on the topic of interest. But once you step outside the learning bubble, data rarely comes in the format you want.

Marc Bellemare, an associate professor in the Department of Applied Economics at the University of Minnesota, provides some practical tips on how to deal with this. Bellemare’s parting advice:

Really, there is no big secret to cleaning data other than “Document everything” and to save everything in different files and in different locations (i.e., your computer, Dropbox, Google Drive), and there is no other way to learn data cleaning than by doing it.

Yep.

Some of the tips are in the context of specific software environment, but you can easily apply them to more general situations.

Favorites

How You Will Die

So far we’ve seen when you will die and how other people tend to die. Now let’s put the two together to see how and when you will die, given your sex, race, and age.

Reviving the Statistical Atlas of the United States with New Data

Due to budget cuts, there is no plan for an updated atlas. So I recreated the original 1870 Atlas using today’s publicly available data.

A Day in the Life of Americans

I wanted to see how daily patterns emerge at the individual level and how a person’s entire day plays out. So I simulated 1,000 of them.

The Most Unisex Names in US History

Moving on from the most trendy names in US history, let’s look at the most unisex ones. Some names have …