Subreddit math with r/The_Donald helps show topic breakdowns

Posted to Statistics  |  Tags: ,  |  Nathan Yau

Trevor Martin for FiveThirtyEight used latent semantic analysis to do math with subreddits, specifically r/The_Donald.

We’ve adapted a technique that’s used in machine learning research — called latent semantic analysis — to characterize 50,323 active subreddits based on 1.4 billion comments posted from Jan. 1, 2015, to Dec. 31, 2016, in a way that allows us to quantify how similar in essence one subreddit is to another. At its heart, the analysis is based on commenter overlap: Two subreddits are deemed more similar if many commenters have posted often to both. This also makes it possible to do what we call “subreddit algebra”: adding one subreddit to another and seeing if the result resembles some third subreddit, or subtracting out a component of one subreddit’s character and seeing what’s left.



Causes of Death

There are many ways to die. Cancer. Infection. Mental. External. This is how different groups of people died over the past 10 years, visualized by age.

Shifting Incomes for American Jobs

For various occupations, the difference between the person who makes the most and the one who makes the least can be significant.

A Day in the Life of Americans

I wanted to see how daily patterns emerge at the individual level and how a person’s entire day plays out. So I simulated 1,000 of them.

The Changing American Diet

See what we ate on an average day, for the past several decades.