From the machine learning course on Udacity, an a cappella group sings a Thriller parody on overfitting. At first you’re like, “Is this real? Am I dreaming?” Then you’re like,…
Statistics
More than mean and standard deviation.

Machine learning a cappella on overfitting

A majority of your email in Gmail, even if you don’t use it
For reasons of autonomy, control, and privacy, Benjamin Mako Hill runs his own email server. After a closer look though, he realized that much of the email he sends ends…

Newborn false positives
Shutterfly sent promotional emails that congratulate new parents and encourage them to send thank you cards. The problem: a lot of people on that list weren’t new parents. Several tipsters…

Random things that correlate
This is fun. Tyler Vigen wrote a program that attempts to automatically find things that correlate. As of writing this, 4,000 correlations were found so far (and actually over 100…

Type I and II errors simplified
“Type I” and “Type II” errors, names first given by Jerzy Neyman and Egon Pearson to describe rejecting a null hypothesis when it’s true and accepting one when it’s not,…

Naked Statistics
Naked Statistics by Charles Wheelan promises a fun, nonboring introduction to statistics that doesn’t leave you drifting off into space, thinking about anything that is not statistics. From the book…

Most underrated films
Ben Moore was curious about overrated and underrated films. “Overrated” and “underrated” are slippery terms to try to quantify. An interesting way of looking at this, I thought, would be…

Hip hop vocabulary compared between artists
Matt Daniels compared rappers’ vocabularies to find out who knows the most words. Literary elites love to rep Shakespeare’s vocabulary: across his entire corpus, he uses 28,829 words, suggesting he…

Hiding a pregnancy from advertisers
You probably remember how Target used purchase histories to predict pregnancies among their customer base (although, don’t forget the false positives). Janet Vertesi, an assistant professor of sociology at Princeton…

A principal component analysis stepbystep
Sebastian Raschka offers a stepbystep tutorial for a principal component analysis in Python. The main purposes of a principal component analysis are the analysis of data to identify patterns and…

Analysis of Bob Ross paintings
As a lesson on conditional probability for himself, Walt Hickey watched 403 episodes of “The Joy of Painting” with Bob Ross, tagged them with keywords on what Ross painted, and…

Porn views for red versus blue states
Pornhub continues their analysis of porn viewing demographics in their latest comparison of pageviews per capita between red and blue states (SFW for most, I think). The main question: Who…

Using Census survey data properly
The American Community Survey, an ongoing survey that the Census administers to millions per year, provides detailed information about how Americans live now and decades ago. There are tons of…

Bracket picks of the masses versus sports pundits
Stephen Pettigrew and Reuben FischerBaum, for Regressing, compared 11 million brackets on ESPN.com against those of pundits. To evaluate how much better (or worse) the experts were at predicting this…

Fox News bar chart gets it wrong
Because Fox News. See also this, this, and this. [Thanks, Meron]…

Big data, same statistical challenges
Tim Harford for Financial Times on big data and how the same problems for small data still apply: The multiplecomparisons problem arises when a researcher looks at many possible patterns.…

Bike share data in New York, animated
Citi Bike, also known as NYC Bike Share, is releasing monthly data dumps for station checkouts and checkins, which gives you a sense of where and when people move about…

Dead links on the Million Dollar Homepage
Remember the Million Dollar Homepage from 2005? It sold ad space to anyone who was interested for one dollar per pixel, and there were one million pixels available. All spots…

Gambling data as a proxy for excitement in sports
After he noticed gambling odds fluctuate wildly at the end of a football game, Todd Schneider realized a correlation between betting odds and game excitement. The Gambletron 2000 is a…

Where time comes from
The Atlantic interviewed Dr. Demetrios Matsakis, Chief Scientist for Time Services at the US Naval Observatory about where time comes from, the precision required and how they obtain it, and…

How people really read and share online
Tony Haile discusses how we read and share online, based on actual data. It’s not as click and pageviewbased as you might think. A widespread assumption is that the more…

The important parts of data analysis
There’s plenty of software to muck around with data, but to gain the skills to really get something out of it, that takes time and experience. Mikio Braun, a post…

Statistical concepts explained through dance
Forget bell curves, jellybeans, and coin flips to explain statistical concepts. Dancing Statistics is a video series that demonstrates variance, correlation, and sampling through coreographed movements. The dance below explains…

ProPublica opened a data store
One of the main challenges of any data project is getting the data. It seems obvious, but the effort to get the right data to answer a question seems to…

Game theory to win game shows
I like how a little bit of game theory has crept into Jeopardy! with contestant Arthur Chu. He bounces around the board in search of Daily Doubles and bets to…

A visual explanation of conditional probability
Victor Powell, who has visualized the Central Limit Theorem and Simpson’s Paradox, most recently provided a visual explainer for conditional probability. Two bars, one blue and one red, represent two…

Basketball analytics
Kirk Goldsberry talks the rise of analytics usage in the NBA. With cameras above every court recording player movements, there’s a higher granularity analysis that is now possible, beyond the…

Texting data to save lives
Remember that TED talk from a couple of years ago on texting patterns to a crisis hotline? The TED talker Nancy Lublin proposed the analysis of these text messages to…

How R came to be
Statistician John Chambers, the creator of S and a core member of R, talks about how R came to be in the short video below. Warning: Super nerdy waters ahead.…

Facebook debunks Princeton study
Researchers at Princeton released a study that said that Facebook was on the way out, based primarily on Google search data. Naturally, Facebook didn’t appreciate it much and followed up…