• Facebook debunks Princeton study

    January 24, 2014  |  Statistics

    Princeton declining interest

    Researchers at Princeton released a study that said that Facebook was on the way out, based primarily on Google search data. Naturally, Facebook didn't appreciate it much and followed up with their own "study" that debunks the Princeton analysis, blasted with a healthy dose of sarcasm. They also showed that Princeton is on their way to zero-enrollment.

    This trend suggests that Princeton will have only half its current enrollment by 2018, and by 2021 it will have no students at all, agreeing with the previous graph of scholarly scholarliness. Based on our robust scientific analysis, future generations will only be able to imagine this now-rubble institution that once walked this earth.

    While we are concerned for Princeton University, we are even more concerned about the fate of the planet — Google Trends for "air" have also been declining steadily, and our projections show that by the year 2060 there will be no air left

    Crud. Dibs on the oxygen tanks.

  • Using data to find a girlfriend

    January 23, 2014  |  Statistics

    Remember when Amy Webb created a bunch of fake male profiles to scrape data from two dating sites and analyze it to find a husband? Mathematician Chris McKinlay took a similar route to find a girlfriend (and now fiancee). However, unlike Webb who used a relatively small sample, McKinlay scraped data for thousands of profiles in his area and analyzed the data more thoroughly, in search of the perfect mate.

    For McKinlay's plan to work, he’d have to find a pattern in the survey data—a way to roughly group the women according to their similarities. The breakthrough came when he coded up a modified Bell Labs algorithm called K-Modes. First used in 1998 to analyze diseased soybean crops, it takes categorical data and clumps it like the colored wax swimming in a Lava Lamp. With some fine-tuning he could adjust the viscosity of the results, thinning it into a slick or coagulating it into a single, solid glob.

    He played with the dial and found a natural resting point where the 20,000 women clumped into seven statistically distinct clusters based on their questions and answers. "I was ecstatic," he says. "That was the high point of June."

    He selected the two clusters most to his liking, looked at what interested the women, and then adjusted his profile accordingly. He didn't lie. He just emphasized the traits that he possessed and that women tended to like. Then he waited for women to notice him.

    It's kind of like he built a targeted advertising system for himself and then cast a really wide net. Even though McKinlay is engaged now, I still wonder if it actually worked or if something similar might have happened if he left it to chance. I like to believe in the latter. He did after all go on dates with 87 other people before finding a match.

  • Disney MagicBands track your theme park activities

    January 21, 2014  |  Statistics

    You can now wear a MagicBand when you enter Disneyland to get a more personalized experience, and in return, the park gets to know what their customers are up to. John Foreman, the chief data scientist at MailChimp, describes the new data toy after a trip to the happiest place on Earth.

    What does Disney get out of the deal? In short, it tracks everything you do, everything you buy, everything you eat, everything you ride, everywhere you go in the park. If the goal is to keep you in the park longer so you’ll spend more money, it can build AI models on itineraries, show schedules, line length, weather, etc., to figure out what influences stay length and cash expenditure. Perhaps there are a few levers they can pull to get money out of you.

    I knew Disney imagineers kept track of park activity, such as line length and congestion areas, but this takes it to the next level. Is it weird that I'm curious how this would work at home?

  • How Netflix creates movie micro-genres

    January 3, 2014  |  Statistics

    Alexis Madrigal and Ian Bogost for The Atlantic reverse engineered the Netflix genre generator, analyzed the data, and then made their own. Then they talked to Todd Yellin, the guy at Netflix who created the micro-genre system. It's no accident when you see altgenres like "Visually-striking Goofy Action & Adventure" and "Sentimental set in Europe Dramas from the 1970s" in your browser.

    The Netflix Quantum Theory doc spelled out ways of tagging movie endings, the "social acceptability" of lead characters, and dozens of other facets of a movie. Many values are "scalar," that is to say, they go from 1 to 5. So, every movie gets a romance rating, not just the ones labeled "romantic" in the personalized genres. Every movie's ending is rated from happy to sad, passing through ambiguous. Every plot is tagged. Lead characters' jobs are tagged. Movie locations are tagged. Everything. Everyone.

    That's the data at the base of the pyramid. It is the basis for creating all the altgenres that I scraped. Netflix's engineers took the microtags and created a syntax for the genres, much of which we were able to reproduce in our generator.

    Be sure to play around with Bogost's generator at the top. It will amuse.

  • Clusters of single malt Scotch whiskies

    January 1, 2014  |  Statistics

    Luba Gloukhov of Revolution Analytics used k-means clustering to find groups of single malt Scotch whiskies. Because you know, New Year's morning is when whisky is on everyone's mind.

    The first time I had an Islay single malt, my mind was blown. In my first foray into the world of whiskies, I took the plunge into the smokiest, peatiest beast of them all — Laphroig. That same night, dreams of owning a smoker were replaced by the desire to roam the landscape of smoky single malts.

    As an Islay fan, I wanted to investigate whether distilleries within a given region do in fact share taste characteristics. For this, I used a dataset profiling 86 distilleries based on 12 flavor categories.

    The result is essentially a mini recommendation system for the fine liquor, and the code is there, so you can see how it works.

  • Statistics Done Wrong, a guide to common analysis mistakes

    December 31, 2013  |  Statistics

    Alex Reinhart, a PhD statistics student at Carnegie Mellon University, covers some of the common analysis mistakes in Statistics Done Wrong.

    Statistics Done Wrong is a guide to the most popular statistical errors and slip-ups committed by scientists every day, in the lab and in peer-reviewed journals. Many of the errors are prevalent in vast swathes of the published literature, casting doubt on the findings of thousands of papers. Statistics Done Wrong assumes no prior knowledge of statistics, so you can read it before your first statistics course or after thirty years of scientific practice.

    The text is available for free online, and there's a physical book version on the way.

  • Iron Maiden uses piracy data for tour locations

    December 24, 2013  |  Statistics

    When you hear "piracy data" and "music" in the same sentence, it usually ends with exorbitant fines. Iron Maiden took a different route.

    In the case of Iron Maiden, still a top-drawing band in the U.S. and Europe after thirty years, it noted a surge in traffic in South America. Also, it saw that Brazil, Venezuela, Mexico, Columbia, and Chile were among the top 10 countries with the most Iron Maiden Twitter followers. There was also a huge amount of BitTorrent traffic in South America, particularly in Brazil.

    Rather than send in the lawyers, Maiden sent itself in. The band has focused extensively on South American tours in recent years, one of which was filmed for the documentary "Flight 666." After all, fans can't download a concert or t-shirts. The result was massive sellouts. The São Paolo show alone grossed £1.58 million (US$2.58 million) alone.

  • Data scientist surpasses statistician on Google Trends

    December 18, 2013  |  Statistics

    Statisician vs Data Scientist on Google Trends

    The relative interest in data scientist surpassed statistician this month. It was also higher in April and September of this year, so it's not new, but it does seem like it's ready to be a consistent thing, at least least for a little while. That said, it doesn't seem like statistician is losing interest to data scientist, as the former has been fairly consistent for the past few years, so take that how you want.

  • Easy text classification

    December 10, 2013  |  Statistics

    etcML

    Text can be a great source of data, but it can be a challenge to glean information from an analysis standpoint. etcML can help you with that. Browse Twitter trends, classify your own text with existing machine learning classifiers, or upload your own training data.

    But most importantly, you can use etcML to learn interesting new things about whatever text data you're already working with in your job or research. Say you're a social scientist with written and multiple-choice survey responses — you can quickly see how well participants' written text allows you to guess their multiple-choice response. Or say you're a literary scholar who wants to know what distinguishes an author's early and late periods — you can train a classifier and visualize the most predictive words for each category.

    Saved for later.

  • Prediction of sexual orientation through Facebook friends

    November 13, 2013  |  Statistics

    Carter Jernigan and Behram F.T. Mistree found that sexual orientation of an individual is strongly correlated to the sexual orientation of the individual's friends on Facebook.

    After analyzing 4,080 Facebook profiles from the MIT network, we determined that the percentage of a given user's friends who self–identify as gay male is strongly correlated with the sexual orientation of that user, and we developed a logistic regression classifier with strong predictive power. Although we studied Facebook friendship ties, network data is pervasive in the broader context of computer–mediated communication, raising significant privacy issues for communication technologies to which there are no neat solutions.

    Gay men had around seven times more gay friends than straight men on average. I imagine similar results for other demographics such as age and race. Although, their predictions weren't full proof with 17 percent of gay men in their test data misclassified as straight and 22 percent of straight men misclassified as gay, which carries plenty of real world problems.

    In any case, the results aren't surprising. Just filing this under the things that others can know about you without actually knowing you.

  • Up all night to get data, a music video parody

    November 12, 2013  |  Statistics

    Neuroscience students at the University of California, San Diego made a music video parody of Daft Punk's "Get Lucky." It's about gathering data in the lab. Graduate students are such nerds.

  • Global status tracker for open government data

    November 5, 2013  |  Data Sharing

    Open Data Index

    The Open Knowledge Foundation launched the Open Data Index, so you can see what data countries provide to their citizens.

    An increasing number of governments have committed to open up data, but how much key information is actually being released? Is the available data legally and technically usable so that citizens, civil society and businesses can realise the full benefits of the information? Which countries are the most advanced and which are lagging in relation to open data? The Open Data Index has been developed to help answer such questions by collecting and presenting information on the state of open data around the world - to ignite discussions between citizens and governments.

    Based on community editor contributions, the index assesses the availability of datasets such as transportation timetables, election results, and legislation, and provides a single-number score. The higher the score is, the more data a government makes available to the public. Of the 70 participating countries, the UK leads the way, followed by the United States and Denmark.

  • Cancer data for the U.S. released

    November 1, 2013  |  Data Sources

    The Centers for Disease Control and Prevention released their most recent cancer data a few days ago. It's the numbers for 2010, which feels dated. However, the annual data goes back to 1999, across demographics and states, which makes this data worth a look. You can download the delimited files here.

    A browser accompanies the release, as shown below. It's really just that though, leaving analysis up to you, and it's rough around the edges.

    Cancer statistics

    So if you're looking for a weekend project, this is a good place to go. I'd probably start with the age breakdowns and work from there.

  • U.S. Open Data Institute

    October 29, 2013  |  Data Sharing

    With a $250,000 grant from the Knight Foundation, Waldo Jaquith pushes forward with the U.S. Open Data Institute, an effort to link government data sources and organizations over the next year.

    I'm convinced that we already have many of the right people, organizations and businesses working on open data in the United States. They just don't know about each other. (The organization certainly won't duplicate any of the efforts of the folks in this space.) And we have nearly all of the necessary software, but so much of it is only known within its narrow domain, despite its broad applicability. The institute will connect all of these entities, promote the work of those who are leading the way and provide supportive, nonjudgmental assistance to those who need help. We don't have all the answers, but we know the folks who do. We want to amplify their message and connect them to new collaborators and clients.

    This could be fun.

  • Degrees of separation between athletes from different sports

    October 23, 2013  |  Statistics

    You've probably heard of the six degrees of Kevin Bacon. The idea is that you can name any actor and trace back to Kevin Bacon through actors who have worked together. Ben Blatt for Slate applied this idea to sports and put together an interactive that finds the number degrees between athletes. The fun part is that you can enter two athletes from different professional sports: basketball, football, and baseball.

    What's even more remarkable is that it's possible to connect players who didn't even play the same sport. Cross-sport athletes like Deion Sanders and Bo Jackson are exceedingly rare, and some combinations of sports are hardly seen at all. Of these 18 athletes, all but one—Bud Grant—played baseball as one of his two pro careers, proving either that the stars of the diamond are athletic enough to master other sports or that anyone athletic enough to play basketball or football can also handle baseball. Hockey is the opposite, as there has never been a pro hockey player who also played top-level basketball, football, or baseball. As a result, hockey is a closed system. But once you get off the ice, it's possible to link every pro baseball, basketball, and football star.

    I like how it only takes 18 players (well, actually probably fewer) to pull double-time to make this possible. To link Yao Ming (basketball) and Joe Montana (football), it only took six hops, with Mark Hendrickson as a link between basketball and baseball and Deion Sanders as the link between baseball and football.

    Surprising? Kind of, but then again, in 2011, almost all pairs of people on Facebook could be linked with just six hops, too. The barebones interactive is still a lot of fun to play with though if you follow sports.

  • Government data shutdown

    October 2, 2013  |  Data Sources

    Census shutdown

    When you go to the United States Census site, Data.gov, or similar government-run sites, you see this. "Due to the lapse in government funding, census.gov sites, services, and all online survey collection requests will be unavailable until further notice." Now it's personal.

  • Consequences of big data exclusions

    October 2, 2013  |  Statistics

    Big data, in all its glory, promises insights into the soul of humankind. There's a hefty restriction though. Data only tells you about the population and actions of individuals it represents, which inevitably excludes part of the population. Jonas Lerman considers two hypothetical people. The first one:

    The first is a thirty-year-old white-collar resident of Manhattan. She participates in modern life in all the ways typical of her demographic: smartphone, Google, Gmail, Netflix, Spotify, Amazon. She uses Facebook, with its default privacy settings, to keep in touch with friends. She dates through the website OkCupid. She travels frequently, tweeting and posting geotagged photos to Flickr and Instagram. Her wallet holds a debit card, credit cards, and a MetroCard for the subway and bus system. On her keychain are plastic barcoded cards for the “customer rewards” programs of her grocery and drugstore. In her car, a GPS sits on the dash, and an E‑ZPass transponder (for bridge, tunnel, and highway tolls) hangs from the windshield.

    That's a lot of data. The second person:

    He lives two hours southwest of Manhattan, in Camden, New Jersey, America’s poorest city. He is underemployed, working part-time at a restaurant, paid under the table in cash. He has no cell phone, no computer, no cable. He rarely travels and has no passport, car, or GPS. He uses the Internet, but only at the local library on public terminals. When he rides the bus, he pays the fare in cash.

    The second person has fewer data flows.

    These days, big data exclusion almost sounds like a good thing — if you're intent on avoiding all marketing-related data collection — but when policy-making, fund allocation, etc. come into play, it's possible the excluded aren't counted. That's not to say people should hurriedly sign up for Facebook and opt-in to every tracking study. It's the opposite. Those in charged of the data and those who decide based on what they see in the data are responsible for knowing the background of their source.

  • Tracking criminal movements and predicting hot spots

    September 23, 2013  |  Statistics

    Finding crime hot spots

    In the latest SIAM Journal on Applied Mathematics, Chaturapruek, et al. describe modeling criminal movements based on where potential criminals live and areas of interest.

    Data available on distance between criminals' homes and their targets shows that burglars are willing to travel longer distances for high-value targets, and tend to employ different means of transportation to make these long trips. Of course, this tendency differs among types of criminals. Professionals and older criminals may travel further than younger amateurs. A group of professional burglars planning to rob a bank, for instance, would reasonably be expected to follow a Lévy flight.

    "There is actually a relationship between how far these criminals are willing to travel for a target and the ability for a hotspot to form," explain Kolokolnikov and McCalla.

    I hear the RV and Pontiac Aztec is the preferred mode of transportation among high school chemistry teachers turned meth cooks.

    Full paper here, if you're into that.

  • A visual explanation of Simpson’s Paradox

    September 19, 2013  |  Statistics

    Simpson paradox

    When you look for overall trends, you often poke around the data in aggregate, but when you zoom out too far, you could miss details or within-category variation. Sometimes when you zoom in, you see a completely opposite trend of what you saw overall. This is known as Simpson's Paradox. Lewis Lehe and Victor Powell explain in a series of small, interactive charts.

    Why does this matter?

    Simpson's paradox usually fools us on tests of performance. In a famous example, researchers concluded that a newer treatment for kidney stones was more effective than traditional surgery, but it was later revealed that the newer treatment was more often being used on small kidney stones. More recently, on elementary school tests, minority students in Texas outperform their peers in Wisconsin, but Texas has so many minority students that Wisconsin beats it in state rankings. It would be a shame if Simpson's paradox led doctors to prescribe ineffective treatments or Texas schools to waste money copying Wisconsin.

    The takeaway lesson: Remember to look at the details. [Thanks, Victor]

  • Dialect quiz shows where others talk like you do

    September 18, 2013  |  Statistics

    Dialect quiz and survey

    North Carolina State statistics graduate student Joshua Katz already mapped dialect across the United States, and now there's a fun addition in quiz form. Answer the 25-question survey (or the more detailed 140-question version if you dare), and you get a map of language similarity. More specifically, the result maps shows the probability that someone in that area understands what you're saying.

    My results were dead on.

Unless otherwise noted, graphics and words by me are licensed under Creative Commons BY-NC. Contact original authors for everything else.