Lessons from improperly anonymized taxi logs

Posted to Statistics  |  Tags: ,  |  Nathan Yau

Through a Freedom of Information request Chris Whong received and eventually released NYC taxi logs starting in 2013 (about 173 million trips). Vijay Pandurangan looked at the data a little closer and deanonymized the logs to link hashed license numbers to the driver names. It didn’t take much to do it. Pandurangan described the process and lessons organizations can learn when they release data.

Someone on Reddit pointed out that one specific driver seemed to be doing an incredible amount of business. When faced with anomalous data like that, it’s good practice to weed out data error before jumping to conclusions about cheating taxi drivers. Also, I couldn’t shake the feeling that there was something about that encoded id number: “CFCD208495D565EF66E7DFF9F98764DA.” After a little bit of poking around, I realised that that code is actually the MD5 hash of the character ‘0’. This proved my suspicion that this was actually a data collection error, but also made me immediately realise that the entire anonymization process was flawed and could easily be reversed.

He also provided the code snippet he used to do it.

Favorites

Think Like a Statistician – Without the Math

I call myself a statistician, because, well, I’m a statistics graduate student. However, the most important things I’ve learned are less formal, but have proven extremely useful when working/playing with data.

Where People Run in Major Cities

There are many exercise apps that allow you to keep track of your running, riding, and other activities. Record speed, …

Causes of Death

There are many ways to die. Cancer. Infection. Mental. External. This is how different groups of people died over the past 10 years, visualized by age.

The Most Unisex Names in US History

Moving on from the most trendy names in US history, let’s look at the most unisex ones. Some names have …