Consequences of big data exclusions

Posted to Statistics  |  Tags: ,  |  Nathan Yau

Big data, in all its glory, promises insights into the soul of humankind. There’s a hefty restriction though. Data only tells you about the population and actions of individuals it represents, which inevitably excludes part of the population. Jonas Lerman considers two hypothetical people. The first one:

The first is a thirty-year-old white-collar resident of Manhattan. She participates in modern life in all the ways typical of her demographic: smartphone, Google, Gmail, Netflix, Spotify, Amazon. She uses Facebook, with its default privacy settings, to keep in touch with friends. She dates through the website OkCupid. She travels frequently, tweeting and posting geotagged photos to Flickr and Instagram. Her wallet holds a debit card, credit cards, and a MetroCard for the subway and bus system. On her keychain are plastic barcoded cards for the “customer rewards” programs of her grocery and drugstore. In her car, a GPS sits on the dash, and an E‑ZPass transponder (for bridge, tunnel, and highway tolls) hangs from the windshield.

That’s a lot of data. The second person:

He lives two hours southwest of Manhattan, in Camden, New Jersey, America’s poorest city. He is underemployed, working part-time at a restaurant, paid under the table in cash. He has no cell phone, no computer, no cable. He rarely travels and has no passport, car, or GPS. He uses the Internet, but only at the local library on public terminals. When he rides the bus, he pays the fare in cash.

The second person has fewer data flows.

These days, big data exclusion almost sounds like a good thing — if you’re intent on avoiding all marketing-related data collection — but when policy-making, fund allocation, etc. come into play, it’s possible the excluded aren’t counted. That’s not to say people should hurriedly sign up for Facebook and opt-in to every tracking study. It’s the opposite. Those in charged of the data and those who decide based on what they see in the data are responsible for knowing the background of their source.

Favorites

Real Chart Rules to Follow

There are rules—usually for specific chart types meant to be read in a specific way—that you shouldn’t break. When they are, everyone loses. This is that small handful.

A Day in the Life of Americans

I wanted to see how daily patterns emerge at the individual level and how a person’s entire day plays out. So I simulated 1,000 of them.

Jobs Charted by State and Salary

Jobs and pay can vary a lot depending on where you live, based on 2013 data from the Bureau of Labor Statistics. Here’s an interactive to look.

Think Like a Statistician – Without the Math

I call myself a statistician, because, well, I’m a statistics graduate student. However, the most important things I’ve learned are less formal, but have proven extremely useful when working/playing with data.