Posted to

Statistics

Lessons from improperly anonymized taxi logs

Through a Freedom of Information request Chris Whong received and eventually released NYC taxi logs starting in 2013 (about 173 million trips). Vijay Pandurangan looked at the data a little…

Data grab bag

— When you deal with data, you can think like a statistician, even if you don't know the math (although it will certainly help a lot). Jonathan Stray brings up…

How to Make Government Data Sites Better

Accessing government data from the source is frustrating. If you've done it, or at least tried to, you know the pain that is oddly formatted files, search that doesn't work,…

What a few thousand fake followers gets you

There are a lot of fake, spammy accounts on Twitter that come in a variety of forms. Some tweet links to junk, some serve as retweet and faving bots, and…

GDP rises in the UK after spending on illegal activities counted

The gross domestic product for the United Kingdom rose by 5% seemingly overnight, after spending on cocaine and prostitution was (roughly) accounted for. Naturally there's been a bit of fuss…

What pregnant women want

In another take on the game of what Google suggests while searching, Seth Stephens-Davidowitz for The New York Times looked at queries related to pregnant women. Some searches were similar…

Strava Metro aims to help cities improve biking routes

Last month, Strava, which allows users to track their bike rides and runs, launched an interactive map that shows where people move worldwide. That seems to be a lead-in to…

Machine learning a cappella on overfitting

From the machine learning course on Udacity, an a cappella group sings a Thriller parody on overfitting. At first you're like, "Is this real? Am I dreaming?" Then you're like,…

A majority of your email in Gmail, even if you don’t use it

For reasons of autonomy, control, and privacy, Benjamin Mako Hill runs his own email server. After a closer look though, he realized that much of the email he sends ends…

Newborn false positives

Shutterfly sent promotional emails that congratulate new parents and encourage them to send thank you cards. The problem: a lot of people on that list weren't new parents. Several tipsters…

Random things that correlate

This is fun. Tyler Vigen wrote a program that attempts to automatically find things that correlate. As of writing this, 4,000 correlations were found so far (and actually over 100…

Type I and II errors simplified

"Type I" and "Type II" errors, names first given by Jerzy Neyman and Egon Pearson to describe rejecting a null hypothesis when it's true and accepting one when it's not,…

Naked Statistics

Naked Statistics by Charles Wheelan promises a fun, non-boring introduction to statistics that doesn't leave you drifting off into space, thinking about anything that is not statistics. From the book…

Most underrated films

Ben Moore was curious about overrated and underrated films. "Overrated" and "underrated" are slippery terms to try to quantify. An interesting way of looking at this, I thought, would be…

Hip hop vocabulary compared between artists

Matt Daniels compared rappers' vocabularies to find out who knows the most words. Literary elites love to rep Shakespeare's vocabulary: across his entire corpus, he uses 28,829 words, suggesting he…

Hiding a pregnancy from advertisers

You probably remember how Target used purchase histories to predict pregnancies among their customer base (although, don't forget the false positives). Janet Vertesi, an assistant professor of sociology at Princeton…

A principal component analysis step-by-step

Sebastian Raschka offers a step-by-step tutorial for a principal component analysis in Python. The main purposes of a principal component analysis are the analysis of data to identify patterns and…

Analysis of Bob Ross paintings

As a lesson on conditional probability for himself, Walt Hickey watched 403 episodes of "The Joy of Painting" with Bob Ross, tagged them with keywords on what Ross painted, and…

Porn views for red versus blue states

Pornhub continues their analysis of porn viewing demographics in their latest comparison of pageviews per capita between red and blue states (SFW for most, I think). The main question: Who…

Using Census survey data properly

The American Community Survey, an ongoing survey that the Census administers to millions per year, provides detailed information about how Americans live now and decades ago. There are tons of…

Bracket picks of the masses versus sports pundits

Stephen Pettigrew and Reuben Fischer-Baum, for Regressing, compared 11 million brackets on ESPN.com against those of pundits. To evaluate how much better (or worse) the experts were at predicting this…

Fox News bar chart gets it wrong

Because Fox News. See also this, this, and this. [Thanks, Meron]…