• What data brokers know about you

    March 11, 2013  |  Statistics

    Lois Beckett for ProPublica has a thorough piece on data brokers — companies that collect and sell information about you — and what they know and where they get the data from.

    They start with the basics, like names, addresses and contact information, and add on demographics, like age, race, occupation and "education level," according to consumer data firm Acxiom's overview of its various categories.

    But that's just the beginning: The companies collect lists of people experiencing "life-event triggers" like getting married, buying a home, sending a kid to college — or even getting divorced.

    Credit reporting giant Experian has a separate marketing services division, which sells lists of "names of expectant parents and families with newborns" that are "updated weekly."

    The companies also collect data about your hobbies and many of the purchases you make. Want to buy a list of people who read romance novels? Epsilon can sell you that, as well as a list of people who donate to international aid charities.

    So if you're wondering why you received that catalog in the mail, it was probably because a store sold your purchase data to a broker.

  • Using search data to find drug side effects

    March 8, 2013  |  Statistics

    Along the same lines as Google Flu Trends, researchers at Microsoft, Stanford and Columbia University are investigating whether search data can be used to find interactions between drugs. They recently found an interaction.

    Using automated software tools to examine queries by six million Internet users taken from Web search logs in 2010, the researchers looked for searches relating to an antidepressant, paroxetine, and a cholesterol lowering drug, pravastatin. They were able to find evidence that the combination of the two drugs caused high blood sugar.

    The idea is that people are searching for symptoms and medications, and this data is stored in anonymized search logs. They then followed a suspicion that using the two drugs at the same time might cause hyperglycemia. Those that searched for the two drugs were more likely to search for hyperglycemia than the control group (probably those who didn't search for hyperglycemia).

    The work is still in its infancy, but it'll be interesting to see how this sort of data can be used to supplement existing work by the Food and Drug Administration.

  • Netflix data and puppets

    March 4, 2013  |  Statistics

    Andrew Leonard for Salon fears what might come of the creative process if movies are based on algorithms and data and that we might turn into puppets.

    For years Netflix has been analyzing what we watched last night to suggest movies or TV shows that we might like to watch tomorrow. Now it is using the same formula to prefabricate its own programming to fit what it thinks we will like. Isn't the inevitable result of this that the creative impulse gets channeled into a pre-built canal?

    Because tastes never change? We don't have any choice but to watch what is handed to us? Will creators stop making things that go against the norm? Leonard concludes with us stuck in a trance, in front of our televisions.

    The companies that figure out how to generate intelligence from that data will know more about us than we know ourselves, and will be able to craft techniques that push us toward where they want us to go, rather than where we would go by ourselves if left to our own devices. I'm guessing this will be good for Netflix's bottom line, but at what point do we go from being happy subscribers, to mindless puppets?

    Again, the assumption is that we have no say in the matter. But when a company or service suggests that we buy or watch something, we don't have to follow.

    Netflix in particular thrives by providing a service that shows us what they think we might want to watch from a selection of thousands of options. Part of that algorithm depends on our own movie ratings and preferences. If Netflix offers poor suggestions, you can leave the service. Yeah. You can stop paying 8 bucks a month.

    Let's turn it around. What if Netflix analyzed viewing data not to offer their best viewing suggestions or to make shows and movies that people like but to expand people's viewing windows? Let's say that the data shows that you watch a lot of "witty, critically acclaimed comedies", so Netflix suggests you watch more "romantic dramas" to make you more well-rounded. Are you a mindless puppet if you take the suggestion, even if you end up hating the movie? Are you a mindless puppet if you ignore the suggestion and continue watching what you know you like?

    From the production perspective, it makes sense to try to make something a lot of people like. From the consumer perspective, we still get to decide what we want to spend our money on.

    It's good to be concerned about how companies use personal data. Data privacy, ownership, and ethics are important issues, but it shouldn't mean a fear of all things data.

  • This pie chart is amazing.

    March 1, 2013  |  Mistaken Data

    Best part of Super Bowl

    From the Winnipeg Sun. Something isn't right here. [via]

  • Porn star demographics

    February 15, 2013  |  Statistics

    Porn star hair color

    Jon Millward explored porn star demographics using a data scrape from the Internet Adult Film Database: hair color, race, and birthplace, among other things. (There aren't any dirty pictures, but there's some terminology that might be NSFW.)

    The average measurements?

    I thought that maybe if the women are overestimating how light they are, they might also be a bit too generous when reporting their measurements. It turns out they probably aren’t though, because the most common bra size for a female porn star is a surprisingly handleable 34B. Not double-D, not even a D. Double-D actually came in 4th, behind B, C and D. The most common set of measurements for the women was 34–24-34.

    So, if the average female porn star is a 5'5" woman who weighs 117lbs and has B-cup breasts, what colour is her hair? Blonde, presumably, if my friends' guesses were anything to go by.

    Apparently not. Dark-haired porn stars outnumber blonde ones almost 2-to-1.

    Millward doesn't look at changes over time a whole lot, but if the BMI of Playboy playmates is any indicator, I bet those measurements have changed over the years.

  • Analysis of LEGO brick prices over the years

    February 7, 2013  |  Statistics

    Cost of LEGO bricks

    Reality Prose has an excellent analysis on the changing price of LEGO bricks over the years and a misconception that cost has gone up. According to the chart above, based on data from BrickSet and adjusted for inflation, the average cost per brick has come down.
    Continue Reading

  • Philosophy of data

    February 6, 2013  |  Statistics

    David Brooks for The New York Times on the philosophy of data and what the future holds:

    If you asked me to describe the rising philosophy of the day, I’d say it is data-ism. We now have the ability to gather huge amounts of data. This ability seems to carry with it certain cultural assumptions — that everything that can be measured should be measured; that data is a transparent and reliable lens that allows us to filter out emotionalism and ideology; that data will help us do remarkable things — like foretell the future.

    Be sure to read the comments. There's actually quite a bit of anti-data talk.

  • The most poisoned name in US history

    January 31, 2013  |  Statistics

    Poisoned names

    Biostatistics PhD candidate Hilary Parker dived into the most poisoned names in US history. Her own name topped the list. There were several fad names such as Deneen, Catina, and Farrah that saw a quick spike and then a plummet, but the trend for Hilary is different.

    "Hilary", though, was clearly different than these flash-in-the-pan names. The name was growing in popularity (albeit not monotonically) for years. So to remove all of the fad names from the list, I chose only the names that were in the top 1000 for over 20 years, and updated the graph (note that I changed the range on the y-axis).

    I think it's pretty safe to say that, among the names that were once stable and then had a sudden drop, "Hilary" is clearly the most poisoned.

    There it is minding its own business, enjoying a steady rise in popularity over a few decades, and then boom, Bill Clinton is elected, and the name dies a quick death.

    Be sure to check out the rest of the analysis. Good stuff. [Thanks, @hspter]

  • Using data to find a husband

    January 15, 2013  |  Statistics

    When it was time to settle down with the right man, Amy Webb joined two dating sites, created a profile, and went on some horrible dates. Her solution was to create fake male profiles and then scrape and analyze data to find out how she could improve her chances.

    Posing as these men, I spent a month using JDate. I interacted with 96 women, cataloging how they behaved and presented themselves online and scraping data from their profiles (such as the language they used or the number of hours they waited before emailing back one of my profiles). Wanting to learn everything I could about my competition, I kept a detailed database, and I recorded which female profiles were popular. While JDate doesn't publicly release its algorithms, at the time of my experiment I observed that the more popular profiles come up higher in search results, allowing one to get a quick-and-dirty ranking of who's hot (or not). I quickly realized that the popular women seemed to know something I didn't; they were clearly attracting the sort of smart, attractive professionals who had been ignoring my profile. Being hypercompetitive, I wasn't about to let some bubblegum-popping blonde steal the neurotic Jewish doctor of my mother's dreams.

    Basically, she pulled an OKCupid for herself. It worked.

  • Data Analysis (with R) on Coursera

    December 21, 2012  |  Statistics

    Jeff Leek, an Assistant Professor of Biostatistics at the Johns Hopkins Bloomberg School of Public Health, is teaching a course on data analysis on Coursera, appropriately named Data Analysis.

    This course is an applied statistics course focusing on data analysis. The course will begin with an overview of how to organize, perform, and write-up data analyses. Then we will cover some of the most popular and widely used statistical methods like linear regression, principal components analysis, cross-validation, and p-values. Instead of focusing on mathematical details, the lectures will be designed to help you apply these techniques to real data using the R statistical programming language, interpret the results, and diagnose potential problems in your analysis.

    The course starts on January 22, 2013.

    You might also be interested in Computing for Data Analysis taught by Roger Peng, who is also a biostatistics professor at John Hopkins. Leek's course is focused on statistical methods, whereas Peng's course is focused on programming. Better take both. [via Revolutions]

  • Statistical network of basketball

    December 12, 2012  |  Statistics

    LA Laker network analysis

    By now, everyone's heard of Moneyball. Applying statistics to baseball to build the best team for the buck. Naturally, there's a lot of interest these days in applying the same data-based philosophy to other sports. Jennifer Fewell and Dieter Armbruster used network analysis to model gameplay in basketball.

    To analyze basketball plays, Fewell and Armbruster used a technique called network analysis, which turns teammates into nodes and exchanges — passes — into paths. From there, they created a flowchart of sorts that showed ball movement, mapping game progression pass by pass: Every time one player sent the ball to another, the flowchart lines accumulated, creating larger and larger and arrows.

    Using data from the 2010 playoffs, Fewell and Armbruster’s team mapped the ball movement of every play. Using the most frequent transactions — the inbound pass to shot-on-basket — they analyzed the typical paths the ball took around the court.

    The challenge with basketball is that play is continuous, whereas baseball events are discrete, so you can't apply the same methods. But if you can model the game properly, you know where to optimize and areas that need work.

  • The differences between machine learning, data mining, and statistics

    December 10, 2012  |  Statistics

    From machine learning to data mining. From statistics to probability. A lot of it seems similar, so what are the differences? Statistician William Briggs explains in an FAQ.

    What's the difference between machine learning, deep learning, big data, statistics, decision & risk analysis, probability, fuzzy logic, and all the rest?

    None, except for terminology, specific goals, and culture. They are all branches of probability, which is to say the understanding and sometime quantification of uncertainty. Probability itself is an extension of logic.

    I was surprised he didn't throw data science into the mix, but you could and the document would pretty much be the same.

  • A new kind of resource

    December 3, 2012  |  Statistics

    Jer Thorp talks ethics in the data-as-new-oil metaphor:

    [W]e need to change the way that we collectively think about data, so that it is not a new oil, but instead a new kind of resource entirely. For this to occur we need to foster a deep understanding of data in society. As it happens, humanity has a mechanism for this kind of broad cultural change: the arts. As we proceed towards profit and progress with data, let us encourage artists, novelists, performers and poets to take an active role in the conversation. In doing so we may avoid some of the mistakes that we made with the old oil.

    See also: Jer's talk on the human side of data.

  • Machines and built-in morality

    November 29, 2012  |  Statistics

    With Google's driverless cars now street legal in California, Florida, and Nevada, Gary Marcus for the New Yorker ponders a world where machines need a built-in morality system.

    That moment will be significant not just because it will signal the end of one more human niche, but because it will signal the beginning of another: the era in which it will no longer be optional for machines to have ethical systems. Your car is speeding along a bridge at fifty miles per hour when errant school bus carrying forty innocent children crosses its path. Should your car swerve, possibly risking the life of its owner (you), in order to save the children, or keep going, putting all forty kids at risk? If the decision must be made in milliseconds, the computer will have to make the call.

    Data analysis seems to be headed in the same direction. Where machines will have to start making human-like decisions, data represents more of the real world and looks less like snippets in time. As the gap between numbers and what they represent shrinks, the more we have to think about ethics, privacy, and whether or not what we're doing is right.

  • Archive of datasets bundled with R

    November 20, 2012  |  Data Sources

    R comes with a lot of datasets, some with the core distribution and others with packages, but you'd never know which ones unless you went through all the examples found at the end of help documents. Luckily, Vincent Arel-Bundock cataloged 596 of them in an easy-to-read page, and you can quickly download them as CSV files.

    Many of the datasets are dated, going back to the original distribution of R, but it's a great resource for teaching or if you're just looking for some data to play with.

  • Incredibly divided nation in a map

    November 9, 2012  |  Mistaken Data

    Divided nation

    I knew things were bad, but I didn't know they were this bad. Obama has his work cut out for him. [Thanks, @adamsinger]

  • How Silver predictions performed

    November 7, 2012  |  Statistics

    By way of Rafa Irizarry from Simply Statistics, a plot of Nate Silver's probabilities for Barack Obama winning a state versus the percentage of vote in each state, as of midnight EST.

    I guess that's pretty (100%) good. Looks like the folks at Princeton didn't do half bad either. It's a win for Obama and a win for statistics. Well, good statistics, at least. (Looking at you, University of Colorado.)

    Update: Drew Linzer at Emory and the Huffington Post Pollster also did well. All in all, it was a good night for statistics.

  • A quick lesson on making predictions

    October 31, 2012  |  Statistics

    Political analyst and statistician Nate Silver has gotten some flack lately for consistently projecting a 70-plus percent chance of a Barack Obama win this election. But as Jeff Leek explains, the criticism doesn't spawn from Silver being wrong. Rather, it comes from the critics' misunderstanding of statistics. Leek provides a quick lesson on how Silver makes his predications and how the methods apply to other things, like the weather.

    Now, this might seem like a goofy way to come up with a "percent chance" with simulated elections and all. But it turns out it is actually a pretty important thing to know and relevant to those of us on the East Coast right now. It turns out weather forecasts (and projected hurricane paths) are based on the same sort of thing — simulated versions of the weather are run and the "percent chance of rain" is the fraction of times it rains in a particular place.

    So Romney may still win and Obama may lose — and Silver may still get a lot of it right. But regardless, the approach taken by Silver is not based on politics, it is based on statistics.

    Don't fear the black box.

  • Data on decades of Boy Scout expulsions released

    October 22, 2012  |  Data Sources

    Allegations in the Boy Scouts

    The Los Angeles Times released nearly 5,000 records of allegations from the Boy Scouts of America as a browseable map and searchable list. You can also download the data.

    This data­base con­tains in­form­a­tion on about 5,000 men and a hand­ful of wo­men who were ex­pelled from the Boy Scouts of Amer­ica between 1947 and Janu­ary 2005 on sus­pi­cion of sexu­al ab­use. The dots on the map in­dic­ate the loc­a­tion of troops con­nec­ted in some way to the ac­cused. The timeline be­low shows the volume of cases opened by year; however, an un­known num­ber of files were purged by the Scouts pri­or to the early 1990s

    The interactive map helps you narrow down by city, but it's kind of hard to see cases on a country-wide perspective. Here's a quick look.

    The worst part is that a lot of the cases went unreported.

  • The birthday problem explained

    October 5, 2012  |  Statistics

    How many people does it take for there to be a 50% chance that a pair in the group has the same birthday? Only 23 people. What about a 99% chance? Maybe even more shocking: 57 people. This is the birthday problem, which every undergrad who's taken a stat course has seen. Steven Strogataz explains the logic and calculations.

    Intuitively, how can 23 people be enough? It’s because of all the combinations they create, all the opportunities for luck to strike. With 23 people, there are 253 possible pairs of people (see the notes for why), and that turns out to be enough to push the odds of a match above 50 percent.

    Incidentally, if you go up to 43 people — the number of individuals who have served as United States president so far — the odds of a match increase to 92 percent. And indeed two of the presidents do have the same birthday: James Polk and Warren Harding were both born on Nov. 2.

    The Johnny Carson clip referenced in the article is worth watching. Carson tries to test the results with the audience, but goes about it the wrong way.

Copyright © 2007-2014 FlowingData. All rights reserved. Hosted by Linode.