Subreddit math with r/The_Donald helps show topic breakdowns

Posted to Statistics  |  Tags: ,  |  Nathan Yau

Trevor Martin for FiveThirtyEight used latent semantic analysis to do math with subreddits, specifically r/The_Donald.

We’ve adapted a technique that’s used in machine learning research — called latent semantic analysis — to characterize 50,323 active subreddits based on 1.4 billion comments posted from Jan. 1, 2015, to Dec. 31, 2016, in a way that allows us to quantify how similar in essence one subreddit is to another. At its heart, the analysis is based on commenter overlap: Two subreddits are deemed more similar if many commenters have posted often to both. This also makes it possible to do what we call “subreddit algebra”: adding one subreddit to another and seeing if the result resembles some third subreddit, or subtracting out a component of one subreddit’s character and seeing what’s left.



Famous Movie Quotes as Charts

In celebration of their 100-year anniversary, the American Film Institute …

Visualizing the Uncertainty in Data

Data is an abstraction, and it’s impossible to encapsulate everything it represents in real life. So there is uncertainty. Here are ways to visualize the uncertainty.

Life expectancy changes

The data goes back to 1960 and up to the most current estimates for 2009. Each line represents a country.

10 Best Data Visualization Projects of 2017

It was a rough year, which brought about a lot of good work. Here are my favorite data visualization projects of the year.