• An analysis after watching a year’s worth of SportsCenter

    August 1, 2013  |  Statistics

    Winning and mentions

    Patrick Burns for Deadspin watched 23,000 minutes of SportsCenter, keeping track of the specifics of what the show covered over the year, such as what teams, players, and player descriptions.

    The graphic above, by my fellow Deadspinner Reuben Fischer-Baum, shows the correlation between winning percentage—or points, in the case of the possibly nonexistent NHL—and SportsCenter mentions for teams across the four major leagues.* Our focus here is on just what about a team attracted the attention of SportsCenter's all-seeing eyebeams over the course of a normal season. Our conclusion is that there was a reasonably strong correlation between winning percentage and SportsCenter mentions. It was statistically significant for all leagues except the NHL.

    Unfortunately, they did not track how many times commentator predictions were completely wrong.

  • Data.gov revamp

    July 31, 2013  |  Data Sources

    Data dot gov revamp

    After budget cuts a couple of years ago, I assumed Data.gov was all but dead, but apparently there's a new site in the works.

    The original version of Data.gov was hard to use, and you rarely found the data you wanted. I always ended up on Google and landed on the department's source instead. It looks like they improved the interface, and their aim is towards a community built around the data where people can share projects and analyses.

    However, the data available on the site still looks slim and dated, which was a challenge with the original version. I mean the homepage says you can search 100s of APIs and over 75,000 datasets, but then click over to the Data Catalog and it says only 409 datasets found. So there's still work to be done.

    I'm glad the project is still alive though. We'll have to see where this goes.

  • Datalandia, the fictional town saved by data

    July 26, 2013  |  Statistics

    GE has a short video series on a fictional town called Datalandia where machines talk to each other and data is exchanged in a hero-like fashion. "This summer the most cliched movie plots won't be coming to a theatre near you. This summer the most cliched movie plots are about to collide with big data!"

    It's like IBM's Smarter Planet commercials combined with Team America. [Thanks, Chris]

  • Predicting riots

    July 18, 2013  |  Statistics

    Hannah Fry and her group at University College London investigate data from the 2011 London riots and found that the complex activity of rioters is reminiscent of shopping behavior and contagion. They propose a mathematical model for riots that could help prevent escalation.

    In August 2011, several areas of London experienced episodes of large-scale disorder, comprising looting, rioting and violence. Much subsequent discourse has questioned the adequacy of the police response, in terms of the resources available and strategies used. In this article, we present a mathematical model of the spatial development of the disorder, which can be used to examine the effect of varying policing arrangements. The model is capable of simulating the general emergent patterns of the events and focusses on three fundamental aspects: the apparently-contagious nature of participation; the distances travelled to riot locations; and the deterrent effect of policing.

    The video above explains in more general terms. [via Spatial.ly]

  • Dictionary of Numbers extension adds context to numbers

    July 8, 2013  |  Statistics

    We read and hear numbers in the news all the time, but it can be hard to imagine what those numbers mean. For example, big numbers, on the scale of billions, are hard to picture in our head, because we don't typically handle that many things at one time. Most of us have never seen a billion dollars plopped in front of us. The Dictionary of Numbers, a Google Chrome extension by Glen Chiacchieri, can help you out in this department.

    I noticed that my friends who were good at math generally rely on "landmark quantities", quantities they know by heart because they relate to them in human terms. They know, for example, that there are about 315 million people in the United States and that the most damaging Atlantic hurricanes cost anywhere from $20 billion to $100 billion. When they explain things to me, they use these numbers to give me a better sense of context about the subject, turning abstract numbers into something more concrete.

    When I realized they were doing this, I thought this process could be automated, that perhaps through contextual descriptions people could become more familiar with quantities and begin evaluating and reasoning about them.

    Install the extension, and as shown in the video above, it injects inline descriptions next to numbers in articles. You can also use the search box. Enter "100 meters" and you get "about the height of the Statue of Liberty." Although still rough around the edges (It seems to find descriptions for a limited index of numbers.), the Dictionary is an interesting experiment in making numbers for relatable.

  • Statistics jokes

    June 27, 2013  |  Statistics

    There's a fun CrossValidated thread on statistics jokes. Here's the one with the top votes:

    A statistician's wife had twins. He was delighted. He rang the minister who was also delighted. "Bring them to church on Sunday and we'll baptize them," said the minister. "No," replied the statistician. "Baptize one. We'll keep the other as a control.

    This line by George Burns is my favorite though:

    If you live to be one hundred, you've got it made. Very few people die past that age.

    Any other good ones?
    Continue Reading

  • Beer recommendation system in R

    June 21, 2013  |  Statistics

    Using data from Beer Advocate, in the form of 1.5 million reviews, yhat shows how to build a recommendation system in R.

    The goal for our system will be for a user to provide us with a beer that they know and love, and for us to recommend a new beer which they might like. To accomplish this, we're going to use collaborative filtering. We're going to compare 2 beers by ratings submitted by their common reviewers. Then, when one user writes similar reviews for two beers, we'll then consider those two beers to be more similar to one another.

    The simple recommender is at the end of the article. Select a beer you like, a type of beer you want to try, and you get a handful of beers you might like.

    Obviously, the method isn't exclusive to beer reviews, and this is just a start to a more advanced system that you can tailor to your own data. The good news is that the code to scrape data and recommend things is there for your disposal. [via @drewconway]

  • Twitter trend detection algorithm

    June 19, 2013  |  Statistics

    Detecting twitter trends

    Stuff happens, and people tweet about it. Something major happens, and a lot of people tweet about it. Masters student Stanislav Nikolov and his adviser Devavrat Shah are working on ways to algorithmically detect the latter.

    People acting in social networks are reasonably predictable. If many of your friends talk about something, it's likely that you will as well. If many of your friends are friends with person X, it is likely that you are friends with them too. Because the underlying system has, in this sense, low complexity, we should expect that the measurements from that system are also of low complexity. As a result, there should only be a few types of patterns that precede a topic becoming trending. One type of pattern could be "gradual rise"; another could be "small jump, then a big jump"; yet another could be "a jump, then a gradual rise", and so on. But you'll never get a sawtooth pattern, a pattern with downward jumps, or any other crazy pattern.

    And with that, the algorithm compares current patterns to the ones above. If they look like a trending pattern, the algorithm marks something as a trend with some probability. In testing with past trending topics, the algorithm was able to pick correctly over 90 percent of the time.

    The best part is that this method can be applied to other time series data. "We can try this on traffic data to predict the duration of a bus ride, on movie ticket sales, on stock prices, or any other time-varying measurements."

  • Non-statistician analysts are the new norm

    June 17, 2013  |  Statistics

    As data grows cheaper and more easily accessible, the people who analyze it aren't always statisticians. They're likely to not even have had any statistical training. Biostatistics professor Jeff Leek says we need to adapt to this broader audience.

    What does this mean for statistics as a discipline? Well it is great news in that we have a lot more people to train. It also really drives home the importance of statistical literacy. But it also means we need to adapt our thinking about what it means to teach and perform statistics. We need to focus increasingly on interpretation and critique and away from formulas and memorization (think English composition versus grammar). We also need to realize that the most impactful statistical methods will not be used by statisticians, which means we need more fool proofing, more time automating, and more time creating software. The potential payout is huge for realizing that the tide has turned and most people who analyze data aren't statisticians.

    Yep.

    Those who disagree tend to worry what might happen — what kind of data-based decisions will be made — by non-statisticians, and that should definitely be a priority as we move forward. Non-statisticians often make incorrect assumptions about the data, forget about uncertainty, and don't know much about collection methodologies.

    However, as a statistician (or someone who knows statistics), you can shoo everyone else away from the data and gripe when they come back, or you can help them get things right.

  • The differences between a geek and a nerd

    June 14, 2013  |  Statistics

    Geek vs nerd

    Curious about how people use "geek" and "nerd" to describe themselves and if there was any difference between the two terms, Burr Settles analyzed words used in tweets that contained the two. Settles used pointwise mutual information (PMI), which essentially provided a measure of the geekness or nerdiness of a term. The plot above shows the results.

    In broad strokes, it seems to me that geeky words are more about stuff (e.g., “#stuff”), while nerdy words are more about ideas (e.g., “hypothesis”). Geeks are fans, and fans collect stuff; nerds are practitioners, and practitioners play with ideas. Of course, geeks can collect ideas and nerds play with stuff, too. Plus, they aren’t two distinct personalities as much as different aspects of personality. Generally, the data seem to affirm my thinking.

    Or maybe pop culture (geek) versus education (nerd).

  • Hans Rosling explains population growth and climate change

    June 7, 2013  |  Statistics

    Because every day is a good day to listen to Hans Rosling talk numbers. In this short video, Rosling uses Lego bricks to explain population growth and the gaps in wealth and carbon footprint.

  • Myths of big data

    June 4, 2013  |  Statistics

    Microsoft researcher Kate Crawford describes several myths of big data. Myth #4: It makes cities smarter.

    "It's only as good as the people using it," Ms. Crawford said. Many of the sensors that track people as they manage their urban lives come from high-end smartphones, or cars with the latest GPS systems. "Devices are becoming the proxies for public needs," she said, "but there won't be a moment where everyone has access to the same technology." In addition, moving cities toward digital initiatives like predictive policing, or creating systems where people are seen, whether they like it or not, can promote lots of tension between individuals and their governments.

    Yep. I hear those people things can introduce a lot of challenges.

  • Medicare provider charge data released

    May 28, 2013  |  Data Sources

    NYT hospital browserThe Centers for Medicare and Medicaid Services released billing data for more than 3,000 U.S. hospitals, showing high variance in cost of health scare across the country and even between nearby hospitals.

    As part of the Obama administration’s work to make our health care system more affordable and accountable, data are being released that show significant variation across the country and within communities in what hospitals charge for common inpatient services.

    The data provided here include hospital-specific charges for the more than 3,000 U.S. hospitals that receive Medicare Inpatient Prospective Payment System (IPPS) payments for the top 100 most frequently billed discharges, paid under Medicare based on a rate per discharge using the Medicare Severity Diagnosis Related Group (MS-DRG) for Fiscal Year (FY) 2011. These DRGs represent almost 7 million discharges or 60 percent of total Medicare IPPS discharges.

    The data is downloadable as CSV or Excel files and is surprisingly usable and worth a look.

    The New York Times has a useful per-hospital browser and The Washington Post provides quick comparisons by state.

  • Convergence of Miss Korea faces

    May 20, 2013  |  Statistics

    After seeing a Reddit post on the convergence of Miss Korea faces, supposedly due to high rates of plastic surgery, graduate student Jia-Bin Huang analyzed the faces of 20 contestants. Below is a short video of each face slowly transitioning to the other.

    From the video and pictures it's pretty clear that the photos look similar, but Huang took it a step further with a handful of computer vision techniques to quantify the likeness between faces. And again, the analysis shows similarity between the photos, so the gut reaction is that the contestants are nearly identical.

    However, you have to assume that the pictures are accurate representations of the contestants, which doesn't seem to pan out at all. It's amazing what some makeup, hair, and photoshop can do.

    You gotta consider your data source before you make assumptions about what that data represents.

  • Length of the average dissertation

    May 8, 2013  |  Statistics

    Average dissertation

    On R is My Friend, as a way to procrastinate on his own dissertation, beckmw took a look at dissertation length via the digital archives at the University of Minnesota.

    I've selected the top fifty majors with the highest number of dissertations and created boxplots to show relative distributions. Not many differences are observed among the majors, although some exceptions are apparent. Economics, mathematics, and biostatistics had the lowest median page lengths, whereas anthropology, history, and political science had the highest median page lengths. This distinction makes sense given the nature of the disciplines.

    I was on the long end of the statistics distribution, around 180 pages. Probably because I had a lot of pictures.

    As I was working on my dissertation, people often asked me how many pages I had written and how many pages I had left to write. I never had a good answer, because there's no page limit or required page count. It's just whenever you (and your adviser) feel like there's enough to get a point across. Sometimes that takes 50 pages. Other times it takes 200.

    So for those who get that dreaded page-count question, you can wave your finger at this chart and tell people you're somewhere in the distribution.

  • The Numbers Game on National Geographic

    April 29, 2013  |  Statistics

    Jake Porway, the founder of DataKind, has a new show on the National Geographic channel called The Numbers Game. I unfortunately don't have the channel, so the clips on the site will have to suffice for now.

    Keep in mind this show is for a wide audience though. Jake notes:

    Now for those of you who have been writing to me excited that Big Data is finally getting its own TV show, I should point out that this show is a lot more like a science show than a show about data. You won’t find discussions about Hadoop, machine learning, or even the basics of correlation vs. causation here. Instead, the show tries to make the latest statistics accessible to a wide audience of people who may just be dipping their toes in to this new world of data. It’s more Guy Fieri than Carl Sagan, but it’s a blast.

    The first of three episodes aired last week, and the second is on tonight. You should watch it.

  • Flexible data

    April 17, 2013  |  Statistics

    Data is an abstraction of something that happened in the real world. How people move. How they spend money. How a computer works. The tendency is to approach data and by default, visualization, as rigid facts stripped of joy, humor, conflict, and sadness — because that makes analysis easier. Visualization is easier when you can strip the data down to unwavering fact and then reduce the process to a set of unwavering rules.

    The world is complex though. There are exceptions, limitations, and interactions that aren't expressed explicitly through data. So we make inferences with uncertainty attached. We make an educated guess and then compare to the actual thing or stuff that was measured to see if the data and our findings make sense.

    Data isn't rigid so neither is visualization.

    Are there rules? There are, just like there are in statistics. And you should learn them.

    However, in statistics, you eventually learn that there's more to analysis than hypothesis tests and normal distributions, and in visualization you eventually learn that there's more to the process than efficient graphical perception and avoidance of all things round. Design matters, no doubt, but your understanding of the data matters much more.

  • Problematic databases used to track employee theft

    April 3, 2013  |  Data Sharing

    Employee theft accounts for billions of dollars of lost merchandise per year, so it's a huge concern for retailers, but it often goes unreported as a crime. If only there were reference databases where business owners could report offenders and look up potential employees to see if they have ever stole anything. It turns out there are, but the systems have proved to be problematic.

    "We're not talking about a criminal record, which either is there or is not there — it's an admission statement which is being provided by an employer," said Irv Ackelsberg, a lawyer at Langer, Grogan & Diver who represents Ms. Goode.

    Such statements may contain no outright admission of guilt, like one submitted after Kyra Moore, then a CVS employee, was accused of stealing: "picked up socks left them at the checkout and never came back to buy them," it read. When Ms. Moore later applied for a job at Rite Aid, she was deemed "noncompetitive." She is suing Esteem.

    On paper, the data sounds great for business owners, and keeping such data also seems like a fine business to run. Thefts go down and owners can focus on other aspects of their business. The challenge and complexity comes when we remember that people are involved.

  • How to become a password cracker in a day

    March 26, 2013  |  Statistics

    Deputy editor at Ars Technica Nate Anderson was curious if he could learn to crack passwords in a day. Although there's definitely a difference between advanced and beginner crackers, openly available software and resources make it easy to get started and do some damage.

    After my day-long experiment, I remain unsettled. Password cracking is simply too easy, the tools too sophisticated, the CPUs and GPUs too powerful for me to believe that my own basic attempts at beefing up my passwords are a long-term solution. I've resisted password managers in the past over concerns about storing data in the cloud or about the hassle of syncing with other computers or about accessing passwords from a mobile device or because dropping $50 bucks never felt quite worth it—hacks only happen to other people, right?

    But until other forms of authentication take root, the humble password will form a primary defense of our personal information. The time has come for me to find a better solution to generating, storing, and handling them.

    I use 1Password.

  • Odds of a perfect NCAA March Madness bracket

    March 22, 2013  |  Statistics

    Math professor Jeff Bergen explains the odds of picking a perfect bracket.

    The first probability is based on a 50/50 split of correct picks, which is like using fair coin flips to pick winners. Bergen doesn't really go into how he calculated the second probability, but that smaller number comes up by bumping up the probability of picking the right team for each game. I think he's using an average probability of slightly less than 70% (based on simulation results from this old Wall Street Journal column).

    That's why businesses can offer up million dollar prizes. In all likelihood, no one is going to win, which turns out to be a great business model for insurance companies who back these contests:

    If millions of people enter a particular contest, it might seem like the chance of someone winning is suddenly in the realm of possibility. But there's a catch: This scenario assumes everyone maximized their chances by picking mostly favorites, so those with the best shot at winning are likely to have identical entries. These contests generally protect themselves from big losses by stating they'll divvy up the loot if there are multiple perfect brackets.

    These favorable conditions make insuring these prize offers a good business, as the Dallas company SCA Promotions has discovered. SCA, founded by 11-time world bridge champion Robert D. Hamman, has taken on the insurance risk for roughly 50 perfect-bracket prizes -- including a Sporting News offer of $1 million in 2001, according to vice president Chris Hamman, the founder's son. In the 12 years it has been doing so, SCA has never had to pay out a claim.

Copyright © 2007-2014 FlowingData. All rights reserved. Hosted by Linode.