Download data for 1.7 billion Reddit comments

Jul 21, 2015

There’s been all sorts of weird stuff going on at Reddit lately, but who’s got time for that when you can download 1.6 billion comments left on Reddit, since 2007 through May 2015?

This is an archive of Reddit comments from October of 2007 until May of 2015 (complete month). This reflects 14 months of work and a lot of API calls. This dataset includes nearly every publicly available Reddit comment. Approximately 350,000 comments out of ~1.65 billion were unavailable due to Reddit API issues.

Timestamp, comment ids, controversiality score, and of course the comment text. It’s 5 gigabytes compressed and available over torrent.

Git er done.

Favorites

Divorce and Occupation

Some jobs tend towards higher divorce rates. Some towards lower. Salary also probably plays a role.

Divorce Rates for Different Groups

We know when people usually get married. We know who never marries. Finally, it’s time to look at the other side: divorce and remarriage.

Causes of Death

There are many ways to die. Cancer. Infection. Mental. External. This is how different groups of people died over the past 10 years, visualized by age.

10 Best Data Visualization Projects of 2017

It was a rough year, which brought about a lot of good work. Here are my favorite data visualization projects of the year.