Tim Harford for Financial Times on big data and how the same problems for small data still apply:
The multiple-comparisons problem arises when a researcher looks at many possible patterns. Consider a randomised trial in which vitamins are given to some primary schoolchildren and placebos are given to others. Do the vitamins work? That all depends on what we mean by “work”. The researchers could look at the children’s height, weight, prevalence of tooth decay, classroom behaviour, test scores, even (after waiting) prison record or earnings at the age of 25. Then there are combinations to check: do the vitamins have an effect on the poorer kids, the richer kids, the boys, the girls? Test enough different correlations and fluke results will drown out the real discoveries.
You're usually in for a fluffy article about drowning and social media when 'big data' is in the title. This one is worth the full read.
Citi Bike, also known as NYC Bike Share, is releasing monthly data dumps for station check-outs and check-ins, which gives you a sense of where and when people move about the city. Jeff Ferzoco, Sarah Kaufman, and Juan Francisco Saldarriaga mapped 24 hours of activity in the video below.
Remember the Million Dollar Homepage from 2005? It sold ad space to anyone who was interested for one dollar per pixel, and there were one million pixels available. All spots were filled, and it gave a burst of bunch of other million dollar homepages that turned out to be zero dollar homepages.
David Yanofsky for Quartz returned to the homepage to look at link rot. 22 percent of links on the homepage are dead.
After he noticed gambling odds fluctuate wildly at the end of a football game, Todd Schneider realized a correlation between betting odds and game excitement. The Gambletron 2000 is a fun look into the proxy.
It occurred to me then that variance in gambling market odds is a good way to quantify how exciting a game is. Modern betting exchanges allow gamblers to bet throughout the course of a game. The odds, which can also be expressed as win probabilities, continually readjust as the game progresses. My claim is that the more the odds fluctuate during a game, the more exciting that game is.
Games and odds update automatically up to the minute, with a highlight on the "hotness" of games, or the amount of variation over time. A blowout game shows a line that heads towards 100 percent probability that a team will win, whereas a comeback game shows a dip towards 100 percent for one team and then a trend back towards 100 percent for the opposition.
I had the odds for the Golden State-Portland game open for part of the time tonight, and it was kind of a fun accompaniment.
Mobile alert app for sports, anyone? Current offerings are abysmal.
The Atlantic interviewed Dr. Demetrios Matsakis, Chief Scientist for Time Services at the US Naval Observatory about where time comes from, the precision required and how they obtain it, and why we need such precision. Five seconds into it, my wife commented, "That sounds nerdy." That's how you know it's gonna be good.
Tony Haile discusses how we read and share online, based on actual data. It's not as click- and pageview-based as you might think.
A widespread assumption is that the more content is liked or shared, the more engaging it must be, the more willing people are to devote their attention to it. However, the data doesn’t back that up. We looked at 10,000 socially-shared articles and found that there is no relationship whatsoever between the amount a piece of content is shared and the amount of attention an average reader will give that content.
When we combined attention and traffic to find the story that had the largest volume of total engaged time, we found that it had fewer than 100 likes and fewer than 50 tweets. Conversely, the story with the largest number of tweets got about 20% of the total engaged time that the most engaging story received.
There's plenty of software to muck around with data, but to gain the skills to really get something out of it, that takes time and experience. Mikio Braun, a post doc in machine learning, explains.
For a number of reasons, I don’t think that you cannot "toolify" data analysis that easily. I wished it would be, but from my hard-won experience with my own work and teaching people this stuff, I'd say it takes a lot of experience to be done properly and you need to know what you're doing. Otherwise you will do stuff which breaks horribly once put into action on real data.
And I don't write this because I don't like the projects which exists, but because I think it is important to understand that you can't just give a few coders new tools and they will produce something which works. And depending on how you want to use data analysis in your company, this might break or make your company.
Braun breaks it down into four bullet points worth a read, but the tl;dr version is that analysis isn't simple, and no tool is going to do everything for you. It's simple with simple data, but you can almost always go deeper with more data, and it takes experience to ask the right questions. So try not to be too content with that software output.
One of the main challenges of any data project is getting the data. It seems obvious, but the effort to get the right data to answer a question seems to catch people off guard. Even data that's "free" to download can be a huge pain that ends up completely useless. ProPublica, the non-profit newsroom, deals with this stuff on a regular basis and hopes that some of their efforts can turn into a source of funding through the Data Store.
Like most newsrooms, we make extensive use of government data — some downloaded from "open data" sites and some obtained through Freedom of Information Act requests. But much of our data comes from our developers spending months scraping and assembling material from web sites and out of Acrobat documents. Some data requires months of labor to clean or requires combining datasets from different sources in a way that's never been done before.
For datasets that are the result of significant expenditures of our time and effort, we're charging a reasonable one-time fee: In most cases, it's $200 for journalists and $2,000 for academic researchers.
I hope it works.
I like how a little bit of game theory has crept into Jeopardy! with contestant Arthur Chu. He bounces around the board in search of Daily Doubles and bets to tie in final Jepoardy. Chu doesn't know much about game theory himself but applies rules promoted by a past contestant.
The ultimate champion, Ken Jennings, praises Chu on Slate.
But in fact, plenty of nice white boys on Jeopardy! have been pilloried by viewers for using Arthur Chu's signature technique: bopping around the game board seemingly at whim, rather than choosing the clues from top to bottom, as most contestants do. This is Chu's great crime, the kind of anarchy that hard-core Jeopardy! fans will not countenance. The technique was pioneered in 1985 by a five-time champ named Chuck Forrest, whose law school roommate suggested it. The "Forrest bounce," as fans still call it, kept opponents off balance. He would know ahead of time where the next clue would pop up; they’d be a second slow.
I don't watch Jeopardy! much, but it's pretty fun to watch Chu dominate.
Then there's the most recent RadioLab. The first part talks about a game show called Golden Balls and the prisoner's dilemma, and how a guy — who plays and wins game shows for a living — won this one. The whole show is entertaining as usual, but this first part is of particular interest. After listening to that, watch the Golden Balls clip to see how it played out.
Two bars, one blue and one red, represent two events that can happen together or independently of the other. When a ball hits a bar the corresponding event occurs. What is the probability that one event occurs given that the other does and vice versa? If the probability of both events increases and decreases, how does that change the separate probabilities? Sliders and options let you experiment, and the visual and counters change to help you learn.
A fun one to tinker with.
Kirk Goldsberry talks the rise of analytics usage in the NBA. With cameras above every court recording player movements, there's a higher granularity analysis that is now possible, beyond the box score. One of the key metrics is expected possession value, or EPV, which estimates the number of points a possession is worth, given where everyone is on the court and where the ball is.
But the clearest application of EPV is quantifying a player's overall offensive value, taking into account every single action he has performed with the ball over the course of a game, a road trip, or even a season. We can use EPV to collapse thousands of actions into a single value and estimate a player's true value by asking how many points he adds compared with a hypothetical replacement player, artificially inserted into the exact same basketball situations. This value might be called "EPV-added" or "points added."
As a basketball fan, I hope this makes the game more fun and interesting to watch, and as a statistician, I hope this work can be applied to other facets of life like traffic or local movements. If just the latter, that'd be fine too.
Remember that TED talk from a couple of years ago on texting patterns to a crisis hotline? The TED talker Nancy Lublin proposed the analysis of these text messages to potentially help the individuals texting. Her group, the Crisis Text Line, plans to release anonymized aggregates in the coming months.
Ms. Lublin said texts also provided real-time information that showed patterns for people in crisis.
Crisis Text Line's data, she said, suggests that children with eating disorders seek help more often Sunday through Tuesday, that self-cutters do not wait until after school to hurt themselves, and that depression is reported three times as much in El Paso as in Chicago.
This spring, Crisis Text Line intends to make the aggregate data available to the public. "My dream," Ms. Lublin said, "is that public health officials will use this data and tailor public policy solutions around it."
Keeping an eye on this.
Statistician John Chambers, the creator of S and a core member of R, talks about how R came to be in the short video below. Warning: Super nerdy waters ahead.
I've heard this story before, but it was nice to hear it again, since it is about something I use almost every day. I would also like to hear about the invention of the toilet. [via Revolutions]
Researchers at Princeton released a study that said that Facebook was on the way out, based primarily on Google search data. Naturally, Facebook didn't appreciate it much and followed up with their own "study" that debunks the Princeton analysis, blasted with a healthy dose of sarcasm. They also showed that Princeton is on their way to zero-enrollment.
This trend suggests that Princeton will have only half its current enrollment by 2018, and by 2021 it will have no students at all, agreeing with the previous graph of scholarly scholarliness. Based on our robust scientific analysis, future generations will only be able to imagine this now-rubble institution that once walked this earth.
While we are concerned for Princeton University, we are even more concerned about the fate of the planet — Google Trends for "air" have also been declining steadily, and our projections show that by the year 2060 there will be no air left
Crud. Dibs on the oxygen tanks.
Remember when Amy Webb created a bunch of fake male profiles to scrape data from two dating sites and analyze it to find a husband? Mathematician Chris McKinlay took a similar route to find a girlfriend (and now fiancee). However, unlike Webb who used a relatively small sample, McKinlay scraped data for thousands of profiles in his area and analyzed the data more thoroughly, in search of the perfect mate.
For McKinlay's plan to work, he’d have to find a pattern in the survey data—a way to roughly group the women according to their similarities. The breakthrough came when he coded up a modified Bell Labs algorithm called K-Modes. First used in 1998 to analyze diseased soybean crops, it takes categorical data and clumps it like the colored wax swimming in a Lava Lamp. With some fine-tuning he could adjust the viscosity of the results, thinning it into a slick or coagulating it into a single, solid glob.
He played with the dial and found a natural resting point where the 20,000 women clumped into seven statistically distinct clusters based on their questions and answers. "I was ecstatic," he says. "That was the high point of June."
He selected the two clusters most to his liking, looked at what interested the women, and then adjusted his profile accordingly. He didn't lie. He just emphasized the traits that he possessed and that women tended to like. Then he waited for women to notice him.
It's kind of like he built a targeted advertising system for himself and then cast a really wide net. Even though McKinlay is engaged now, I still wonder if it actually worked or if something similar might have happened if he left it to chance. I like to believe in the latter. He did after all go on dates with 87 other people before finding a match.
You can now wear a MagicBand when you enter Disneyland to get a more personalized experience, and in return, the park gets to know what their customers are up to. John Foreman, the chief data scientist at MailChimp, describes the new data toy after a trip to the happiest place on Earth.
What does Disney get out of the deal? In short, it tracks everything you do, everything you buy, everything you eat, everything you ride, everywhere you go in the park. If the goal is to keep you in the park longer so you’ll spend more money, it can build AI models on itineraries, show schedules, line length, weather, etc., to figure out what influences stay length and cash expenditure. Perhaps there are a few levers they can pull to get money out of you.
I knew Disney imagineers kept track of park activity, such as line length and congestion areas, but this takes it to the next level. Is it weird that I'm curious how this would work at home?
Alexis Madrigal and Ian Bogost for The Atlantic reverse engineered the Netflix genre generator, analyzed the data, and then made their own. Then they talked to Todd Yellin, the guy at Netflix who created the micro-genre system. It's no accident when you see altgenres like "Visually-striking Goofy Action & Adventure" and "Sentimental set in Europe Dramas from the 1970s" in your browser.
The Netflix Quantum Theory doc spelled out ways of tagging movie endings, the "social acceptability" of lead characters, and dozens of other facets of a movie. Many values are "scalar," that is to say, they go from 1 to 5. So, every movie gets a romance rating, not just the ones labeled "romantic" in the personalized genres. Every movie's ending is rated from happy to sad, passing through ambiguous. Every plot is tagged. Lead characters' jobs are tagged. Movie locations are tagged. Everything. Everyone.
That's the data at the base of the pyramid. It is the basis for creating all the altgenres that I scraped. Netflix's engineers took the microtags and created a syntax for the genres, much of which we were able to reproduce in our generator.
Be sure to play around with Bogost's generator at the top. It will amuse.
Luba Gloukhov of Revolution Analytics used k-means clustering to find groups of single malt Scotch whiskies. Because you know, New Year's morning is when whisky is on everyone's mind.
The first time I had an Islay single malt, my mind was blown. In my first foray into the world of whiskies, I took the plunge into the smokiest, peatiest beast of them all — Laphroig. That same night, dreams of owning a smoker were replaced by the desire to roam the landscape of smoky single malts.
As an Islay fan, I wanted to investigate whether distilleries within a given region do in fact share taste characteristics. For this, I used a dataset profiling 86 distilleries based on 12 flavor categories.
The result is essentially a mini recommendation system for the fine liquor, and the code is there, so you can see how it works.