Subreddit math with r/The_Donald helps show topic breakdowns

Posted to Statistics  |  Tags: ,  |  Nathan Yau

Trevor Martin for FiveThirtyEight used latent semantic analysis to do math with subreddits, specifically r/The_Donald.

We’ve adapted a technique that’s used in machine learning research — called latent semantic analysis — to characterize 50,323 active subreddits based on 1.4 billion comments posted from Jan. 1, 2015, to Dec. 31, 2016, in a way that allows us to quantify how similar in essence one subreddit is to another. At its heart, the analysis is based on commenter overlap: Two subreddits are deemed more similar if many commenters have posted often to both. This also makes it possible to do what we call “subreddit algebra”: adding one subreddit to another and seeing if the result resembles some third subreddit, or subtracting out a component of one subreddit’s character and seeing what’s left.



Best Data Visualization Projects of 2016

Here are my favorites for the year.

Reviving the Statistical Atlas of the United States with New Data

Due to budget cuts, there is no plan for an updated atlas. So I recreated the original 1870 Atlas using today’s publicly available data.

How to Spot Visualization Lies

Many charts don’t tell the truth. This is a simple guide to spotting them.

How We Spend Our Money, a Breakdown

We know spending changes when you have more money. Here’s by how much.