How Netflix creates movie micro-genres

Posted to Statistics  |  Tags: , ,  |  Nathan Yau

Alexis Madrigal and Ian Bogost for The Atlantic reverse engineered the Netflix genre generator, analyzed the data, and then made their own. Then they talked to Todd Yellin, the guy at Netflix who created the micro-genre system. It's no accident when you see altgenres like "Visually-striking Goofy Action & Adventure" and "Sentimental set in Europe Dramas from the 1970s" in your browser.

The Netflix Quantum Theory doc spelled out ways of tagging movie endings, the "social acceptability" of lead characters, and dozens of other facets of a movie. Many values are "scalar," that is to say, they go from 1 to 5. So, every movie gets a romance rating, not just the ones labeled "romantic" in the personalized genres. Every movie's ending is rated from happy to sad, passing through ambiguous. Every plot is tagged. Lead characters' jobs are tagged. Movie locations are tagged. Everything. Everyone.

That's the data at the base of the pyramid. It is the basis for creating all the altgenres that I scraped. Netflix's engineers took the microtags and created a syntax for the genres, much of which we were able to reproduce in our generator.

Be sure to play around with Bogost's generator at the top. It will amuse.


Real Chart Rules to Follow

There are rules—usually for specific chart types meant to be read in a specific way—that you shouldn’t break. When they are, everyone loses. This is that small handful.

How We Spend Our Money, a Breakdown

We know spending changes when you have more money. Here’s by how much.

Think Like a Statistician – Without the Math

I call myself a statistician, because, well, I’m a statistics graduate student. However, the most important things I’ve learned are less formal, but have proven extremely useful when working/playing with data.

Reviving the Statistical Atlas of the United States with New Data

Due to budget cuts, there is no plan for an updated atlas. So I recreated the original 1870 Atlas using today’s publicly available data.