Subreddit math with r/The_Donald helps show topic breakdowns

Posted to Statistics  |  Tags: ,  |  Nathan Yau

Trevor Martin for FiveThirtyEight used latent semantic analysis to do math with subreddits, specifically r/The_Donald.

We’ve adapted a technique that’s used in machine learning research — called latent semantic analysis — to characterize 50,323 active subreddits based on 1.4 billion comments posted from Jan. 1, 2015, to Dec. 31, 2016, in a way that allows us to quantify how similar in essence one subreddit is to another. At its heart, the analysis is based on commenter overlap: Two subreddits are deemed more similar if many commenters have posted often to both. This also makes it possible to do what we call “subreddit algebra”: adding one subreddit to another and seeing if the result resembles some third subreddit, or subtracting out a component of one subreddit’s character and seeing what’s left.



Marrying Age

People get married at various ages, but there are definite trends that vary across demographic groups. What do these trends look like?

Shifting Incomes for American Jobs

For various occupations, the difference between the person who makes the most and the one who makes the least can be significant.

Pizza Place Geography

Most of the major pizza chains are within a 5-mile …

Best Data Visualization Projects of 2016

Here are my favorites for the year.