• UK Census at risk

    September 5, 2013  |  Statistics

    There is a possibility the UK Census is scrapped for cheaper options next year.

    The census faces its biggest shake-up in its 200-year history under Office for National Statistics proposals.

    An online survey could replace the study - carried out every 10 years - or information could instead be collated using data already held by government.

    The plans will be fleshed out and put out to consultation this month before Parliament makes a decision in 2014.

    This is both surprising and expected. On the one hand, a decennial Census provides a granular view of a nation that is hard to match. However, on the other, it can be expensive to count everyone, and there are more data sources now than there were 200 years ago that can be drawn upon. The catch is that a lot of data sources use the Census as a baseline or a seed to make estimates.

    The main issue seems to be cost, which is estimated to be £482 million over a ten-year period and comes out to about $1.10 per person per year. In contrast, the 2010 United States Census cost $13 billion, which comes out to about $4.20 per person per year. So it'll be interesting to see what the ONS decides, as I'm sure it's going to get US officials thinking about prices, too, especially since the US Census cost almost four times more per capita.

    In any case, hopefully the UK Census sticks around. A complete cut of the program can't possibly be beneficial at this point.

  • Data in the service of humanity

    September 2, 2013  |  Statistics

    For this rainy Labor Day, here's an uplifting talk by DataKind founder Jake Porway. He talks data and how it can make a worthwhile difference in areas that could use a change.

  • The inventor of modern probability

    August 27, 2013  |  Statistics

    Andrei Kolmogorov is a name unfamiliar to most, but his work had lasting impact. Slava Gerovitch profiled the mathematician, describing the change in thought towards probability theory, which was once more of a joke than a serious approach to evaluate the world. I especially liked the bit about Kolmogorov's appreciation for the arts.

    Music and literature were deeply important to Kolmogorov, who believed he could analyze them probabilistically to gain insight into the inner workings of the human mind. He was a cultural elitist who believed in a hierarchy of artistic values. At the pinnacle were the writings of Goethe, Pushkin, and Thomas Mann, alongside the compositions of Bach, Vivaldi, Mozart, and Beethoven—works whose enduring value resembled eternal mathematical truths. Kolmogorov stressed that every true work of art was a unique creation, something unlikely by definition, something outside the realm of simple statistical regularity. "Is it possible to include [Tolstoy's War and Peace] in a reasonable way into the set of 'all possible novels' and further to postulate the existence of a certain probability distribution in this set?” he asked, sarcastically, in a 1965 article.

  • A master’s degree in statistics is worthwhile

    August 26, 2013  |  Statistics

    Statistician (and brand new PhD student) Jerzy Wieczorek explains the usefulness of a master's degree in statistics.

    There's a huge difference between undergraduate Stats 101 (apply a few standard procedures to nice clean datasets) and real data analysis work (figure out how to clean the data and modify your procedures to the messy context in front of you). So a masters-level mathematical/theoretical stats course, where you learn to prove which estimators have desirable properties or to derive tests that are appropriate in a given situation, is invaluable when you run into non-standard problems. The masters degree will also expose you to many techniques that you probably didn't cover as an undergrad: designing good experiments, computer-intensive methods like the bootstrap, special-use techniques like time series or spatial statistics, other inference philosophies like Bayesian statistics, etc.

    Yep.

    Of course Jerzy and me are slightly biased. Saying a master's degree in statistics isn't worthwhile is like saying we wasted our time, but if you really want to learn data — whether it's for analysis, visualization, journalism, or whatever — statistics helps you get there.

    And whereas the PhD route takes a certain type of person, most master's degrees take only two years to finish, and your analysis skills increase exponentially compared to that of an undergrad. Graduate statistics is also way more interesting, because you focus more on practical usage and less on hypothesis tests.

  • Pickle Index for population estimation

    August 21, 2013  |  Statistics

    Pickle indexAs China moves forward with a plan to move 250 million people to cities, officials developed a need to keep track of how many people are still in rural areas. The problem was that local data from the provinces is unreliable, so instead they use what they call a "pickle index" which banks on the correlation between the amount of pickles eaten by rural residents and population.

    According to the South China Morning Post, the country’s National Development and Reform Commission has found that sales of zha cai, a pickled mustard tuber, provide a better guide to population flows than often unreliable provincial statistical data. As an unnamed planner explained to the Economic Observer:

    "Under normal circumstances, urban consumption levels of convenience foods such as instant noodles and pickled mustard is essentially constant. Therefore, we can assume that volume changes are mainly caused by a city’s floating population."

    Maybe we can work out a hot dog index for the US. [via Waxy]

  • Why everyone is more popular than you

    August 15, 2013  |  Statistics

    Mathematician Hannah Fry is back with another video. She explains why it seems like everyone in your network — on Twitter, Facebook, and in real life — is more popular than you and how we can use this idea to predict the spread of diseases. Fry's understated presentation style totally enhances the interesting subject matter.

  • Google search suggestions by country

    August 8, 2013  |  Statistics

    Search suggestions by country

    Google search suggestions have transformed into a never-ending source of entertainment and a candid peek into what people look for in the world. We've seen insecurities change with age and stereotypes of states in the US. Noah Veltman banked on the locality of suggestions for a country-specific view of the world. He shows suggestions for the same query for the United States, Canada, the United Kingdom, Australia, and New Zealand.

    For example, a search for "why is America" in each country depicts stereotypes and national curiosities about why America is so fat, rich, and better than Canada. Scroll down and you see suggestions for "how to", "why is there", and "why does everyone" which interestingly shows many of the same wonderings.

    Now if you'll excuse me, I have to go eat bacon and swim in my pool of gold coins while I browse through my vastly superior Netflix selection.

  • Behind the Netflix recommendation system

    August 8, 2013  |  Statistics

    Wired has a fun Netflix interview on the behind-the-scenes work on the recommendation engine.

    If you liked 1960s Star Trek, the first non-Trek title that Netflix is likely to suggest to you is the original Mission: Impossible series (the one with the cool Lalo Schifrin soundtrack). Streaming the latest Doctor Who is likely to net you the supernatural TV drama Being Human (the UK version). Watch From Dusk Till Dawn and 300 and say hello to a new row on your homepage: Visually Striking Violent Action & Adventure. Trying to understand the invisible array of algorithms that power your Netflix suggestions has long been a favorite sport, but what’s actually going on in that galaxy of big data, those billions and billions of ratings stars? Turns out there are 800 Netflix engineers working behind the scenes at their Silicon Valley HQ. The company estimates that 75 percent of viewer activity is driven by recommendation.

    Some days you just want slouch back on the couch after a long day's work and watch Hot Tub Time Machine.

  • An analysis after watching a year’s worth of SportsCenter

    August 1, 2013  |  Statistics

    Winning and mentions

    Patrick Burns for Deadspin watched 23,000 minutes of SportsCenter, keeping track of the specifics of what the show covered over the year, such as what teams, players, and player descriptions.

    The graphic above, by my fellow Deadspinner Reuben Fischer-Baum, shows the correlation between winning percentage—or points, in the case of the possibly nonexistent NHL—and SportsCenter mentions for teams across the four major leagues.* Our focus here is on just what about a team attracted the attention of SportsCenter's all-seeing eyebeams over the course of a normal season. Our conclusion is that there was a reasonably strong correlation between winning percentage and SportsCenter mentions. It was statistically significant for all leagues except the NHL.

    Unfortunately, they did not track how many times commentator predictions were completely wrong.

  • Data.gov revamp

    July 31, 2013  |  Data Sources

    Data dot gov revamp

    After budget cuts a couple of years ago, I assumed Data.gov was all but dead, but apparently there's a new site in the works.

    The original version of Data.gov was hard to use, and you rarely found the data you wanted. I always ended up on Google and landed on the department's source instead. It looks like they improved the interface, and their aim is towards a community built around the data where people can share projects and analyses.

    However, the data available on the site still looks slim and dated, which was a challenge with the original version. I mean the homepage says you can search 100s of APIs and over 75,000 datasets, but then click over to the Data Catalog and it says only 409 datasets found. So there's still work to be done.

    I'm glad the project is still alive though. We'll have to see where this goes.

  • Datalandia, the fictional town saved by data

    July 26, 2013  |  Statistics

    GE has a short video series on a fictional town called Datalandia where machines talk to each other and data is exchanged in a hero-like fashion. "This summer the most cliched movie plots won't be coming to a theatre near you. This summer the most cliched movie plots are about to collide with big data!"

    It's like IBM's Smarter Planet commercials combined with Team America. [Thanks, Chris]

  • Predicting riots

    July 18, 2013  |  Statistics

    Hannah Fry and her group at University College London investigate data from the 2011 London riots and found that the complex activity of rioters is reminiscent of shopping behavior and contagion. They propose a mathematical model for riots that could help prevent escalation.

    In August 2011, several areas of London experienced episodes of large-scale disorder, comprising looting, rioting and violence. Much subsequent discourse has questioned the adequacy of the police response, in terms of the resources available and strategies used. In this article, we present a mathematical model of the spatial development of the disorder, which can be used to examine the effect of varying policing arrangements. The model is capable of simulating the general emergent patterns of the events and focusses on three fundamental aspects: the apparently-contagious nature of participation; the distances travelled to riot locations; and the deterrent effect of policing.

    The video above explains in more general terms. [via Spatial.ly]

  • Dictionary of Numbers extension adds context to numbers

    July 8, 2013  |  Statistics

    We read and hear numbers in the news all the time, but it can be hard to imagine what those numbers mean. For example, big numbers, on the scale of billions, are hard to picture in our head, because we don't typically handle that many things at one time. Most of us have never seen a billion dollars plopped in front of us. The Dictionary of Numbers, a Google Chrome extension by Glen Chiacchieri, can help you out in this department.

    I noticed that my friends who were good at math generally rely on "landmark quantities", quantities they know by heart because they relate to them in human terms. They know, for example, that there are about 315 million people in the United States and that the most damaging Atlantic hurricanes cost anywhere from $20 billion to $100 billion. When they explain things to me, they use these numbers to give me a better sense of context about the subject, turning abstract numbers into something more concrete.

    When I realized they were doing this, I thought this process could be automated, that perhaps through contextual descriptions people could become more familiar with quantities and begin evaluating and reasoning about them.

    Install the extension, and as shown in the video above, it injects inline descriptions next to numbers in articles. You can also use the search box. Enter "100 meters" and you get "about the height of the Statue of Liberty." Although still rough around the edges (It seems to find descriptions for a limited index of numbers.), the Dictionary is an interesting experiment in making numbers for relatable.

  • Statistics jokes

    June 27, 2013  |  Statistics

    There's a fun CrossValidated thread on statistics jokes. Here's the one with the top votes:

    A statistician's wife had twins. He was delighted. He rang the minister who was also delighted. "Bring them to church on Sunday and we'll baptize them," said the minister. "No," replied the statistician. "Baptize one. We'll keep the other as a control.

    This line by George Burns is my favorite though:

    If you live to be one hundred, you've got it made. Very few people die past that age.

    Any other good ones?
    Continue Reading

  • Beer recommendation system in R

    June 21, 2013  |  Statistics

    Using data from Beer Advocate, in the form of 1.5 million reviews, yhat shows how to build a recommendation system in R.

    The goal for our system will be for a user to provide us with a beer that they know and love, and for us to recommend a new beer which they might like. To accomplish this, we're going to use collaborative filtering. We're going to compare 2 beers by ratings submitted by their common reviewers. Then, when one user writes similar reviews for two beers, we'll then consider those two beers to be more similar to one another.

    The simple recommender is at the end of the article. Select a beer you like, a type of beer you want to try, and you get a handful of beers you might like.

    Obviously, the method isn't exclusive to beer reviews, and this is just a start to a more advanced system that you can tailor to your own data. The good news is that the code to scrape data and recommend things is there for your disposal. [via @drewconway]

  • Twitter trend detection algorithm

    June 19, 2013  |  Statistics

    Detecting twitter trends

    Stuff happens, and people tweet about it. Something major happens, and a lot of people tweet about it. Masters student Stanislav Nikolov and his adviser Devavrat Shah are working on ways to algorithmically detect the latter.

    People acting in social networks are reasonably predictable. If many of your friends talk about something, it's likely that you will as well. If many of your friends are friends with person X, it is likely that you are friends with them too. Because the underlying system has, in this sense, low complexity, we should expect that the measurements from that system are also of low complexity. As a result, there should only be a few types of patterns that precede a topic becoming trending. One type of pattern could be "gradual rise"; another could be "small jump, then a big jump"; yet another could be "a jump, then a gradual rise", and so on. But you'll never get a sawtooth pattern, a pattern with downward jumps, or any other crazy pattern.

    And with that, the algorithm compares current patterns to the ones above. If they look like a trending pattern, the algorithm marks something as a trend with some probability. In testing with past trending topics, the algorithm was able to pick correctly over 90 percent of the time.

    The best part is that this method can be applied to other time series data. "We can try this on traffic data to predict the duration of a bus ride, on movie ticket sales, on stock prices, or any other time-varying measurements."

  • Non-statistician analysts are the new norm

    June 17, 2013  |  Statistics

    As data grows cheaper and more easily accessible, the people who analyze it aren't always statisticians. They're likely to not even have had any statistical training. Biostatistics professor Jeff Leek says we need to adapt to this broader audience.

    What does this mean for statistics as a discipline? Well it is great news in that we have a lot more people to train. It also really drives home the importance of statistical literacy. But it also means we need to adapt our thinking about what it means to teach and perform statistics. We need to focus increasingly on interpretation and critique and away from formulas and memorization (think English composition versus grammar). We also need to realize that the most impactful statistical methods will not be used by statisticians, which means we need more fool proofing, more time automating, and more time creating software. The potential payout is huge for realizing that the tide has turned and most people who analyze data aren't statisticians.

    Yep.

    Those who disagree tend to worry what might happen — what kind of data-based decisions will be made — by non-statisticians, and that should definitely be a priority as we move forward. Non-statisticians often make incorrect assumptions about the data, forget about uncertainty, and don't know much about collection methodologies.

    However, as a statistician (or someone who knows statistics), you can shoo everyone else away from the data and gripe when they come back, or you can help them get things right.

  • The differences between a geek and a nerd

    June 14, 2013  |  Statistics

    Geek vs nerd

    Curious about how people use "geek" and "nerd" to describe themselves and if there was any difference between the two terms, Burr Settles analyzed words used in tweets that contained the two. Settles used pointwise mutual information (PMI), which essentially provided a measure of the geekness or nerdiness of a term. The plot above shows the results.

    In broad strokes, it seems to me that geeky words are more about stuff (e.g., “#stuff”), while nerdy words are more about ideas (e.g., “hypothesis”). Geeks are fans, and fans collect stuff; nerds are practitioners, and practitioners play with ideas. Of course, geeks can collect ideas and nerds play with stuff, too. Plus, they aren’t two distinct personalities as much as different aspects of personality. Generally, the data seem to affirm my thinking.

    Or maybe pop culture (geek) versus education (nerd).

  • Hans Rosling explains population growth and climate change

    June 7, 2013  |  Statistics

    Because every day is a good day to listen to Hans Rosling talk numbers. In this short video, Rosling uses Lego bricks to explain population growth and the gaps in wealth and carbon footprint.

  • Myths of big data

    June 4, 2013  |  Statistics

    Microsoft researcher Kate Crawford describes several myths of big data. Myth #4: It makes cities smarter.

    "It's only as good as the people using it," Ms. Crawford said. Many of the sensors that track people as they manage their urban lives come from high-end smartphones, or cars with the latest GPS systems. "Devices are becoming the proxies for public needs," she said, "but there won't be a moment where everyone has access to the same technology." In addition, moving cities toward digital initiatives like predictive policing, or creating systems where people are seen, whether they like it or not, can promote lots of tension between individuals and their governments.

    Yep. I hear those people things can introduce a lot of challenges.

Copyright © 2007-2014 FlowingData. All rights reserved. Hosted by Linode.