• # Convergence of Miss Korea faces

May 20, 2013 to Statistics by Nathan Yau

After seeing a Reddit post on the convergence of Miss Korea faces, supposedly due to high rates of plastic surgery, graduate student Jia-Bin Huang analyzed the faces of 20 contestants. Below is a short video of each face slowly transitioning to the other.

From the video and pictures it's pretty clear that the photos look similar, but Huang took it a step further with a handful of computer vision techniques to quantify the likeness between faces. And again, the analysis shows similarity between the photos, so the gut reaction is that the contestants are nearly identical.

However, you have to assume that the pictures are accurate representations of the contestants, which doesn't seem to pan out at all. It's amazing what some makeup, hair, and photoshop can do.

You gotta consider your data source before you make assumptions about what that data represents.

• # Length of the average dissertation

May 8, 2013 to Statistics by Nathan Yau

On R is My Friend, as a way to procrastinate on his own dissertation, beckmw took a look at dissertation length via the digital archives at the University of Minnesota.

I've selected the top fifty majors with the highest number of dissertations and created boxplots to show relative distributions. Not many differences are observed among the majors, although some exceptions are apparent. Economics, mathematics, and biostatistics had the lowest median page lengths, whereas anthropology, history, and political science had the highest median page lengths. This distinction makes sense given the nature of the disciplines.

I was on the long end of the statistics distribution, around 180 pages. Probably because I had a lot of pictures.

As I was working on my dissertation, people often asked me how many pages I had written and how many pages I had left to write. I never had a good answer, because there's no page limit or required page count. It's just whenever you (and your adviser) feel like there's enough to get a point across. Sometimes that takes 50 pages. Other times it takes 200.

So for those who get that dreaded page-count question, you can wave your finger at this chart and tell people you're somewhere in the distribution.

• # The Numbers Game on National Geographic

April 29, 2013 to Statistics by Nathan Yau

Jake Porway, the founder of DataKind, has a new show on the National Geographic channel called The Numbers Game. I unfortunately don't have the channel, so the clips on the site will have to suffice for now.

Keep in mind this show is for a wide audience though. Jake notes:

Now for those of you who have been writing to me excited that Big Data is finally getting its own TV show, I should point out that this show is a lot more like a science show than a show about data. You won’t find discussions about Hadoop, machine learning, or even the basics of correlation vs. causation here. Instead, the show tries to make the latest statistics accessible to a wide audience of people who may just be dipping their toes in to this new world of data. It’s more Guy Fieri than Carl Sagan, but it’s a blast.

The first of three episodes aired last week, and the second is on tonight. You should watch it.

• # Flexible data

April 17, 2013 to Statistics by Nathan Yau

Data is an abstraction of something that happened in the real world. How people move. How they spend money. How a computer works. The tendency is to approach data and by default, visualization, as rigid facts stripped of joy, humor, conflict, and sadness — because that makes analysis easier. Visualization is easier when you can strip the data down to unwavering fact and then reduce the process to a set of unwavering rules.

The world is complex though. There are exceptions, limitations, and interactions that aren't expressed explicitly through data. So we make inferences with uncertainty attached. We make an educated guess and then compare to the actual thing or stuff that was measured to see if the data and our findings make sense.

Data isn't rigid so neither is visualization.

Are there rules? There are, just like there are in statistics. And you should learn them.

However, in statistics, you eventually learn that there's more to analysis than hypothesis tests and normal distributions, and in visualization you eventually learn that there's more to the process than efficient graphical perception and avoidance of all things round. Design matters, no doubt, but your understanding of the data matters much more.

• # Problematic databases used to track employee theft

April 3, 2013 to Data Sharing by Nathan Yau

Employee theft accounts for billions of dollars of lost merchandise per year, so it's a huge concern for retailers, but it often goes unreported as a crime. If only there were reference databases where business owners could report offenders and look up potential employees to see if they have ever stole anything. It turns out there are, but the systems have proved to be problematic.

"We're not talking about a criminal record, which either is there or is not there — it's an admission statement which is being provided by an employer," said Irv Ackelsberg, a lawyer at Langer, Grogan & Diver who represents Ms. Goode.

Such statements may contain no outright admission of guilt, like one submitted after Kyra Moore, then a CVS employee, was accused of stealing: "picked up socks left them at the checkout and never came back to buy them," it read. When Ms. Moore later applied for a job at Rite Aid, she was deemed "noncompetitive." She is suing Esteem.

On paper, the data sounds great for business owners, and keeping such data also seems like a fine business to run. Thefts go down and owners can focus on other aspects of their business. The challenge and complexity comes when we remember that people are involved.

• # How to become a password cracker in a day

March 26, 2013 to Statistics by Nathan Yau

Deputy editor at Ars Technica Nate Anderson was curious if he could learn to crack passwords in a day. Although there's definitely a difference between advanced and beginner crackers, openly available software and resources make it easy to get started and do some damage.

After my day-long experiment, I remain unsettled. Password cracking is simply too easy, the tools too sophisticated, the CPUs and GPUs too powerful for me to believe that my own basic attempts at beefing up my passwords are a long-term solution. I've resisted password managers in the past over concerns about storing data in the cloud or about the hassle of syncing with other computers or about accessing passwords from a mobile device or because dropping \$50 bucks never felt quite worth it—hacks only happen to other people, right?

But until other forms of authentication take root, the humble password will form a primary defense of our personal information. The time has come for me to find a better solution to generating, storing, and handling them.

• # Odds of a perfect NCAA March Madness bracket

March 22, 2013 to Statistics by Nathan Yau

Math professor Jeff Bergen explains the odds of picking a perfect bracket.

The first probability is based on a 50/50 split of correct picks, which is like using fair coin flips to pick winners. Bergen doesn't really go into how he calculated the second probability, but that smaller number comes up by bumping up the probability of picking the right team for each game. I think he's using an average probability of slightly less than 70% (based on simulation results from this old Wall Street Journal column).

That's why businesses can offer up million dollar prizes. In all likelihood, no one is going to win, which turns out to be a great business model for insurance companies who back these contests:

If millions of people enter a particular contest, it might seem like the chance of someone winning is suddenly in the realm of possibility. But there's a catch: This scenario assumes everyone maximized their chances by picking mostly favorites, so those with the best shot at winning are likely to have identical entries. These contests generally protect themselves from big losses by stating they'll divvy up the loot if there are multiple perfect brackets.

These favorable conditions make insuring these prize offers a good business, as the Dallas company SCA Promotions has discovered. SCA, founded by 11-time world bridge champion Robert D. Hamman, has taken on the insurance risk for roughly 50 perfect-bracket prizes -- including a Sporting News offer of \$1 million in 2001, according to vice president Chris Hamman, the founder's son. In the 12 years it has been doing so, SCA has never had to pay out a claim.

• # Declining songwriter ratings with age

March 21, 2013 to Statistics by Nathan Yau

Do singer-songwriters age well like a fine wine, or does quality decline with age? Kyle Biehle analyzed fan ratings by age.

I understand all of the reasons for not comparing artists in this way. Despite twenty-one Academy Award nominations, Woody Allen never attends the Oscars. His reason is that art isn't competition — judging art is so subjective who's to say who or what is best? After all one man's Poison is another man's Cream. Similarly, Elvis Costello (featured in the viz) is famously credited with saying: "Writing about music is like dancing about architecture - It's a really stupid thing to want to do." I agree that using ratings - whether from fans or critics — to judge artistic merit is at best flawed and at worst a fool's exercise.

But I wanted to do it anyway.

Most peak in their 20s and either stabilize later on or continue to decline. Occasionally, as in the case with Bob Dylan, there's some see-sawing. Take a look at the Tableau interactive for a closer look. [via Waxy]

• # Data hackathon challenges and why questions are important

March 12, 2013 to Statistics by Nathan Yau

Jake Porway, executive director of DataKind on data hackathons and why they require careful planning to actually work:

Any data scientist worth their salary will tell you that you should start with a question, NOT the data. Unfortunately, data hackathons often lack clear problem definitions. Most companies think that if you can just get hackers, pizza, and data together in a room, magic will happen. This is the same as if Habitat for Humanity gathered its volunteers around a pile of wood and said, "Have at it!" By the end of the day you'd be left with a half of a sunroom with 14 outlets in it.

Without subject matter experts available to articulate problems in advance, you get results like those from the Reinvent Green Hackathon. Reinvent Green was a city initiative in NYC aimed at having technologists improve sustainability in New York. Winners of this hackathon included an app to help cyclists "bikepool" together and a farmer's market inventory app. These apps are great on their own, but they don't solve the city's sustainability problems. They solve the participants' problems because as a young affluent hacker, my problem isn't improving the city's recycling programs, it's finding kale on Saturdays.

Without clear direction on what to do with the data or questions worth answering, hackathons can end up being a bust from all angles. From the organizer side, you end up with a hodgepodge of projects that vary a lot in quality and purpose. From the participant side, you're left up to your own devices and have to approach the data blind, without a clear starting point. From the judging side, you almost always end up having to pick a winner when there isn't a clear one, because the criteria of the contest was fuzzy to begin with.

This also applies to hiring freelancers for visualization work. You should have a clear goal or story to tell with your data. If you expect the hire to analyze your data and produce a graphic, you better get someone with a statistics background. Otherwise, you end up with a design-heavy piece with little substance.

Basically, the more specific you can be about what you're looking for, the better.

• # What data brokers know about you

March 11, 2013 to Statistics by Nathan Yau

Lois Beckett for ProPublica has a thorough piece on data brokers — companies that collect and sell information about you — and what they know and where they get the data from.

They start with the basics, like names, addresses and contact information, and add on demographics, like age, race, occupation and "education level," according to consumer data firm Acxiom's overview of its various categories.

But that's just the beginning: The companies collect lists of people experiencing "life-event triggers" like getting married, buying a home, sending a kid to college — or even getting divorced.

Credit reporting giant Experian has a separate marketing services division, which sells lists of "names of expectant parents and families with newborns" that are "updated weekly."

The companies also collect data about your hobbies and many of the purchases you make. Want to buy a list of people who read romance novels? Epsilon can sell you that, as well as a list of people who donate to international aid charities.

So if you're wondering why you received that catalog in the mail, it was probably because a store sold your purchase data to a broker.

• # Using search data to find drug side effects

March 8, 2013 to Statistics by Nathan Yau

Along the same lines as Google Flu Trends, researchers at Microsoft, Stanford and Columbia University are investigating whether search data can be used to find interactions between drugs. They recently found an interaction.

Using automated software tools to examine queries by six million Internet users taken from Web search logs in 2010, the researchers looked for searches relating to an antidepressant, paroxetine, and a cholesterol lowering drug, pravastatin. They were able to find evidence that the combination of the two drugs caused high blood sugar.

The idea is that people are searching for symptoms and medications, and this data is stored in anonymized search logs. They then followed a suspicion that using the two drugs at the same time might cause hyperglycemia. Those that searched for the two drugs were more likely to search for hyperglycemia than the control group (probably those who didn't search for hyperglycemia).

The work is still in its infancy, but it'll be interesting to see how this sort of data can be used to supplement existing work by the Food and Drug Administration.

• # Netflix data and puppets

March 4, 2013 to Statistics by Nathan Yau

For years Netflix has been analyzing what we watched last night to suggest movies or TV shows that we might like to watch tomorrow. Now it is using the same formula to prefabricate its own programming to fit what it thinks we will like. Isn't the inevitable result of this that the creative impulse gets channeled into a pre-built canal?

Because tastes never change? We don't have any choice but to watch what is handed to us? Will creators stop making things that go against the norm? Leonard concludes with us stuck in a trance, in front of our televisions.

The companies that figure out how to generate intelligence from that data will know more about us than we know ourselves, and will be able to craft techniques that push us toward where they want us to go, rather than where we would go by ourselves if left to our own devices. I'm guessing this will be good for Netflix's bottom line, but at what point do we go from being happy subscribers, to mindless puppets?

Again, the assumption is that we have no say in the matter. But when a company or service suggests that we buy or watch something, we don't have to follow.

Netflix in particular thrives by providing a service that shows us what they think we might want to watch from a selection of thousands of options. Part of that algorithm depends on our own movie ratings and preferences. If Netflix offers poor suggestions, you can leave the service. Yeah. You can stop paying 8 bucks a month.

Let's turn it around. What if Netflix analyzed viewing data not to offer their best viewing suggestions or to make shows and movies that people like but to expand people's viewing windows? Let's say that the data shows that you watch a lot of "witty, critically acclaimed comedies", so Netflix suggests you watch more "romantic dramas" to make you more well-rounded. Are you a mindless puppet if you take the suggestion, even if you end up hating the movie? Are you a mindless puppet if you ignore the suggestion and continue watching what you know you like?

From the production perspective, it makes sense to try to make something a lot of people like. From the consumer perspective, we still get to decide what we want to spend our money on.

It's good to be concerned about how companies use personal data. Data privacy, ownership, and ethics are important issues, but it shouldn't mean a fear of all things data.

• # This pie chart is amazing.

March 1, 2013 to Mistaken Data by Nathan Yau

From the Winnipeg Sun. Something isn't right here. [via]

• # Porn star demographics

February 15, 2013 to Statistics by Nathan Yau

Jon Millward explored porn star demographics using a data scrape from the Internet Adult Film Database: hair color, race, and birthplace, among other things. (There aren't any dirty pictures, but there's some terminology that might be NSFW.)

The average measurements?

I thought that maybe if the women are overestimating how light they are, they might also be a bit too generous when reporting their measurements. It turns out they probably aren’t though, because the most common bra size for a female porn star is a surprisingly handleable 34B. Not double-D, not even a D. Double-D actually came in 4th, behind B, C and D. The most common set of measurements for the women was 34–24-34.

So, if the average female porn star is a 5'5" woman who weighs 117lbs and has B-cup breasts, what colour is her hair? Blonde, presumably, if my friends' guesses were anything to go by.

Apparently not. Dark-haired porn stars outnumber blonde ones almost 2-to-1.

Millward doesn't look at changes over time a whole lot, but if the BMI of Playboy playmates is any indicator, I bet those measurements have changed over the years.

• # Analysis of LEGO brick prices over the years

February 7, 2013 to Statistics by Nathan Yau

Reality Prose has an excellent analysis on the changing price of LEGO bricks over the years and a misconception that cost has gone up. According to the chart above, based on data from BrickSet and adjusted for inflation, the average cost per brick has come down.

• # Philosophy of data

February 6, 2013 to Statistics by Nathan Yau

David Brooks for The New York Times on the philosophy of data and what the future holds:

If you asked me to describe the rising philosophy of the day, I’d say it is data-ism. We now have the ability to gather huge amounts of data. This ability seems to carry with it certain cultural assumptions — that everything that can be measured should be measured; that data is a transparent and reliable lens that allows us to filter out emotionalism and ideology; that data will help us do remarkable things — like foretell the future.

Be sure to read the comments. There's actually quite a bit of anti-data talk.

• # The most poisoned name in US history

January 31, 2013 to Statistics by Nathan Yau

Biostatistics PhD candidate Hilary Parker dived into the most poisoned names in US history. Her own name topped the list. There were several fad names such as Deneen, Catina, and Farrah that saw a quick spike and then a plummet, but the trend for Hilary is different.

"Hilary", though, was clearly different than these flash-in-the-pan names. The name was growing in popularity (albeit not monotonically) for years. So to remove all of the fad names from the list, I chose only the names that were in the top 1000 for over 20 years, and updated the graph (note that I changed the range on the y-axis).

I think it's pretty safe to say that, among the names that were once stable and then had a sudden drop, "Hilary" is clearly the most poisoned.

There it is minding its own business, enjoying a steady rise in popularity over a few decades, and then boom, Bill Clinton is elected, and the name dies a quick death.

Be sure to check out the rest of the analysis. Good stuff. [Thanks, @hspter]

• # Using data to find a husband

January 15, 2013 to Statistics by Nathan Yau

When it was time to settle down with the right man, Amy Webb joined two dating sites, created a profile, and went on some horrible dates. Her solution was to create fake male profiles and then scrape and analyze data to find out how she could improve her chances.

Posing as these men, I spent a month using JDate. I interacted with 96 women, cataloging how they behaved and presented themselves online and scraping data from their profiles (such as the language they used or the number of hours they waited before emailing back one of my profiles). Wanting to learn everything I could about my competition, I kept a detailed database, and I recorded which female profiles were popular. While JDate doesn't publicly release its algorithms, at the time of my experiment I observed that the more popular profiles come up higher in search results, allowing one to get a quick-and-dirty ranking of who's hot (or not). I quickly realized that the popular women seemed to know something I didn't; they were clearly attracting the sort of smart, attractive professionals who had been ignoring my profile. Being hypercompetitive, I wasn't about to let some bubblegum-popping blonde steal the neurotic Jewish doctor of my mother's dreams.

Basically, she pulled an OKCupid for herself. It worked.

• # Data Analysis (with R) on Coursera

December 21, 2012 to Statistics by Nathan Yau

Jeff Leek, an Assistant Professor of Biostatistics at the Johns Hopkins Bloomberg School of Public Health, is teaching a course on data analysis on Coursera, appropriately named Data Analysis.

This course is an applied statistics course focusing on data analysis. The course will begin with an overview of how to organize, perform, and write-up data analyses. Then we will cover some of the most popular and widely used statistical methods like linear regression, principal components analysis, cross-validation, and p-values. Instead of focusing on mathematical details, the lectures will be designed to help you apply these techniques to real data using the R statistical programming language, interpret the results, and diagnose potential problems in your analysis.

The course starts on January 22, 2013.

You might also be interested in Computing for Data Analysis taught by Roger Peng, who is also a biostatistics professor at John Hopkins. Leek's course is focused on statistical methods, whereas Peng's course is focused on programming. Better take both. [via Revolutions]

• # Statistical network of basketball

December 12, 2012 to Statistics by Nathan Yau

By now, everyone's heard of Moneyball. Applying statistics to baseball to build the best team for the buck. Naturally, there's a lot of interest these days in applying the same data-based philosophy to other sports. Jennifer Fewell and Dieter Armbruster used network analysis to model gameplay in basketball.

To analyze basketball plays, Fewell and Armbruster used a technique called network analysis, which turns teammates into nodes and exchanges — passes — into paths. From there, they created a flowchart of sorts that showed ball movement, mapping game progression pass by pass: Every time one player sent the ball to another, the flowchart lines accumulated, creating larger and larger and arrows.

Using data from the 2010 playoffs, Fewell and Armbruster’s team mapped the ball movement of every play. Using the most frequent transactions — the inbound pass to shot-on-basket — they analyzed the typical paths the ball took around the court.

The challenge with basketball is that play is continuous, whereas baseball events are discrete, so you can't apply the same methods. But if you can model the game properly, you know where to optimize and areas that need work.

Unless otherwise noted, graphics and words by me are licensed under Creative Commons BY-NC. Contact original authors for everything else.