Large-ish data packages in R

Jul 24, 2014

If you’ve played around with R enough, there comes a time when you just need some data to mess around with. Maybe it’s to learn a new method or to make one of your own. R offers some small-ish, clean datasets to poke at, but sometimes you need bigger, messier data. Hadley Wickham from RStudio released four popular large-ish datasets in package form to help you with that.

I’ve released four new data packages to CRAN: babynames, fueleconomy, nasaweather and nycflights13. The goal of these packages is to provide some interesting, and relatively large, datasets to demonstrate various data analysis challenges in R. The package source code (on github, linked above) is fully reproducible so that you can see some data tidying in action, or make your own modifications to the data.



Visualizing the Uncertainty in Data

Data is an abstraction, and it’s impossible to encapsulate everything it represents in real life. So there is uncertainty. Here are ways to visualize the uncertainty.

Causes of Death

There are many ways to die. Cancer. Infection. Mental. External. This is how different groups of people died over the past 10 years, visualized by age.

Years You Have Left to Live, Probably

The individual data points of life are much less predictable than the average. Here’s a simulation that shows you how much time is left on the clock.

How to Spot Visualization Lies

Many charts don’t tell the truth. This is a simple guide to spotting them.