• A principal component analysis step-by-step

    April 17, 2014  |  Statistics

    Sebastian Raschka offers a step-by-step tutorial for a principal component analysis in Python.

    The main purposes of a principal component analysis are the analysis of data to identify patterns and finding patterns to reduce the dimensions of the dataset with minimal loss of information.

    Here, our desired outcome of the principal component analysis is to project a feature space (our dataset consisting of n x d-dimensional samples) onto a smaller subspace that represents our data "well". A possible application would be a pattern classification task, where we want to reduce the computational costs and the error of parameter estimation by reducing the number of dimensions of our feature space by extracting a subspace that describes our data "best".

    That is, imagine you have a dataset with a lot of variables, some of them important and some of them not so much. A PCA helps you identify which is which, so the source doesn't seem so unwieldy or to reduce overhead.

  • Analysis of Bob Ross paintings

    April 17, 2014  |  Statistics

    Bob Ross keywords

    As a lesson on conditional probability for himself, Walt Hickey watched 403 episodes of "The Joy of Painting" with Bob Ross, tagged them with keywords on what Ross painted, and examined Ross's tendencies.

    I analyzed the data to find out exactly what Ross, who died in 1995, painted for more than a decade on TV. The top-line results are to be expected — wouldn't you know, he did paint a bunch of mountains, trees and lakes! — but then I put some numbers to Ross's classic figures of speech. He didn't paint oaks or spruces, he painted "happy trees." He favored "almighty mountains" to peaks. Once he'd painted one tree, he didn't paint another — he painted a "friend."

    Other findings include cumulus and cirrus cloud breakdowns, hill frequency, and Steve Ross (son of Bob Ross) patterns.

  • Porn views for red versus blue states

    April 14, 2014  |  Statistics

    Top ten viewing statesPornhub continues their analysis of porn viewing demographics in their latest comparison of pageviews per capita between red and blue states (SFW for most, I think). The main question: Who watches more?

    Assuming the porn consumption per capita is normally distributed for each state and that different states have independent distribution of porn consumption per capita, we can say with 99% confidence the hypothesis that the per capita porn consumption of democratic states is higher than the republican states.

    Okay, the result statement sounds a little weird, but when you look at the rates, the conclusion seems clear. The states with the highest viewing per capita is shown above, and for some reason Kansas is significantly higher than everyone else. Way to go.

    For a clearer view, Christopher Ingraham charted the same data but incorporated the percent of Obama voters for each state. Interpret as you wish:

    Obama voting and porn

    Again, note Kansas high on the vertical axis.

    Update: Be sure to read this critique for a better picture of what you see here.

  • Using Census survey data properly

    April 11, 2014  |  Statistics

    The American Community Survey, an ongoing survey that the Census administers to millions per year, provides detailed information about how Americans live now and decades ago. There are tons of data tables on topics such as housing situations, education, and commute. The natural thing to do is to download the data, take it at face value, and carry on with your analysis or visualization.

    However, as is usually the case with data, there's more to it than that. Paul Overberg, a database editor at USA Today, explains in a practical guide on how to get the most out of the survey data (which can be generalized to other survey results).

    Journalists who use ACS a lot have a helpful slogan: "Don't make a big deal out of small differences." Journalists have all kinds of old-fashioned tools to deal with this kind of challenge, starting with adverbs: "about," "nearly," "almost," etc. It's also a good idea to round ACS numbers as a signal to users and to improve readability.

    In tables and visualizations, the job is tougher. These introduce ranking and cutpoints, which create potential pitfalls. For tables, it's often better to avoid rankings and instead create groups—high, middle, low. In visualizations, one workaround is to adapt high-low-close stock charts to show a number and its error margins. Interactive data can provide important details on hover or click.

    If you do any kind of data reporting, whatever field it's in, you should be familiar with most of what Overberg describes. If not, better get your learn on.

  • Bracket picks of the masses versus sports pundits

    April 11, 2014  |  Statistics

    NCAA bracket picking

    Stephen Pettigrew and Reuben Fischer-Baum, for Regressing, compared 11 million brackets on ESPN.com against those of pundits.

    To evaluate how much better (or worse) the experts were at predicting this year's tournament, I considered three criteria: the number of games correctly predicted, the number of points earned for correct picks, and the number of Final Four teams correctly identified. Generally the experts' brackets were slightly better than the non-expert ones, although the evidence isn't especially overwhelming. The analysis suggests that next year you'll have just as good a chance of winning your office pool if you make your own picks as if you follow the experts.

    Due to availability, the expert sample size is a small 53, but it does appear the expert brackets are somewhere in the area of the masses. Still too noisy to know for sure though.

    If anything, this speaks more to the randomness of the tournament than it does about people knowing what teams to pick. It's the same reason why my mom, who knows nothing about basketball or any sports for that matter, often comes out ahead in the work pool. The expert picks are just a point of reference.

  • Fox News bar chart gets it wrong

    April 4, 2014  |  Mistaken Data

    Fox News bar chart

    Because Fox News. See also this, this, and this. [Thanks, Meron]

  • Big data, same statistical challenges

    April 4, 2014  |  Statistics

    Tim Harford for Financial Times on big data and how the same problems for small data still apply:

    The multiple-comparisons problem arises when a researcher looks at many possible patterns. Consider a randomised trial in which vitamins are given to some primary schoolchildren and placebos are given to others. Do the vitamins work? That all depends on what we mean by “work”. The researchers could look at the children’s height, weight, prevalence of tooth decay, classroom behaviour, test scores, even (after waiting) prison record or earnings at the age of 25. Then there are combinations to check: do the vitamins have an effect on the poorer kids, the richer kids, the boys, the girls? Test enough different correlations and fluke results will drown out the real discoveries.

    You're usually in for a fluffy article about drowning and social media when 'big data' is in the title. This one is worth the full read.

  • Bike share data in New York, animated

    April 1, 2014  |  Data Sources

    Citi Bike, also known as NYC Bike Share, is releasing monthly data dumps for station check-outs and check-ins, which gives you a sense of where and when people move about the city. Jeff Ferzoco, Sarah Kaufman, and Juan Francisco Saldarriaga mapped 24 hours of activity in the video below.

    [Thanks, Jeff]

  • Dead links on the Million Dollar Homepage

    April 1, 2014  |  Statistics

    Million dollar homepage

    Remember the Million Dollar Homepage from 2005? It sold ad space to anyone who was interested for one dollar per pixel, and there were one million pixels available. All spots were filled, and it gave a burst of bunch of other million dollar homepages that turned out to be zero dollar homepages.

    David Yanofsky for Quartz returned to the homepage to look at link rot. 22 percent of links on the homepage are dead.

  • Gambling data as a proxy for excitement in sports

    March 17, 2014  |  Statistics

    Gambling data as a proxy

    After he noticed gambling odds fluctuate wildly at the end of a football game, Todd Schneider realized a correlation between betting odds and game excitement. The Gambletron 2000 is a fun look into the proxy.

    It occurred to me then that variance in gambling market odds is a good way to quantify how exciting a game is. Modern betting exchanges allow gamblers to bet throughout the course of a game. The odds, which can also be expressed as win probabilities, continually readjust as the game progresses. My claim is that the more the odds fluctuate during a game, the more exciting that game is.

    Games and odds update automatically up to the minute, with a highlight on the "hotness" of games, or the amount of variation over time. A blowout game shows a line that heads towards 100 percent probability that a team will win, whereas a comeback game shows a dip towards 100 percent for one team and then a trend back towards 100 percent for the opposition.

    I had the odds for the Golden State-Portland game open for part of the time tonight, and it was kind of a fun accompaniment.

    Mobile alert app for sports, anyone? Current offerings are abysmal.

  • Where time comes from

    March 13, 2014  |  Statistics

    The Atlantic interviewed Dr. Demetrios Matsakis, Chief Scientist for Time Services at the US Naval Observatory about where time comes from, the precision required and how they obtain it, and why we need such precision. Five seconds into it, my wife commented, "That sounds nerdy." That's how you know it's gonna be good.

  • How people really read and share online

    March 12, 2014  |  Statistics

    Reading and social activity

    Tony Haile discusses how we read and share online, based on actual data. It's not as click- and pageview-based as you might think.

    A widespread assumption is that the more content is liked or shared, the more engaging it must be, the more willing people are to devote their attention to it. However, the data doesn’t back that up. We looked at 10,000 socially-shared articles and found that there is no relationship whatsoever between the amount a piece of content is shared and the amount of attention an average reader will give that content.

    When we combined attention and traffic to find the story that had the largest volume of total engaged time, we found that it had fewer than 100 likes and fewer than 50 tweets. Conversely, the story with the largest number of tweets got about 20% of the total engaged time that the most engaging story received.

  • The important parts of data analysis

    March 11, 2014  |  Statistics

    There's plenty of software to muck around with data, but to gain the skills to really get something out of it, that takes time and experience. Mikio Braun, a post doc in machine learning, explains.

    For a number of reasons, I don’t think that you cannot "toolify" data analysis that easily. I wished it would be, but from my hard-won experience with my own work and teaching people this stuff, I'd say it takes a lot of experience to be done properly and you need to know what you're doing. Otherwise you will do stuff which breaks horribly once put into action on real data.

    And I don't write this because I don't like the projects which exists, but because I think it is important to understand that you can't just give a few coders new tools and they will produce something which works. And depending on how you want to use data analysis in your company, this might break or make your company.

    Braun breaks it down into four bullet points worth a read, but the tl;dr version is that analysis isn't simple, and no tool is going to do everything for you. It's simple with simple data, but you can almost always go deeper with more data, and it takes experience to ask the right questions. So try not to be too content with that software output.

  • Statistical concepts explained through dance

    March 7, 2014  |  Statistics

    Forget bell curves, jellybeans, and coin flips to explain statistical concepts. Dancing Statistics is a video series that demonstrates variance, correlation, and sampling through coreographed movements. The dance below explains variance.

    Watch the full playlist here. [via infosthetics]

  • ProPublica opened a data store

    March 4, 2014  |  Data Sources

    One of the main challenges of any data project is getting the data. It seems obvious, but the effort to get the right data to answer a question seems to catch people off guard. Even data that's "free" to download can be a huge pain that ends up completely useless. ProPublica, the non-profit newsroom, deals with this stuff on a regular basis and hopes that some of their efforts can turn into a source of funding through the Data Store.

    Like most newsrooms, we make extensive use of government data — some downloaded from "open data" sites and some obtained through Freedom of Information Act requests. But much of our data comes from our developers spending months scraping and assembling material from web sites and out of Acrobat documents. Some data requires months of labor to clean or requires combining datasets from different sources in a way that's never been done before.

    In the Data Store you'll find a growing collection of the data we've used in our reporting. For raw, as-is datasets we receive from government sources, you'll find a free download link that simply requires you agree to a simplified version of our Terms of Use. For datasets that are available as downloads from government websites, we've simply linked to the sites to ensure you can quickly get the most up-to-date data.

    For datasets that are the result of significant expenditures of our time and effort, we're charging a reasonable one-time fee: In most cases, it's $200 for journalists and $2,000 for academic researchers.

    I hope it works.

  • Game theory to win game shows

    February 26, 2014  |  Statistics

    I like how a little bit of game theory has crept into Jeopardy! with contestant Arthur Chu. He bounces around the board in search of Daily Doubles and bets to tie in final Jepoardy. Chu doesn't know much about game theory himself but applies rules promoted by a past contestant.

    The ultimate champion, Ken Jennings, praises Chu on Slate.

    But in fact, plenty of nice white boys on Jeopardy! have been pilloried by viewers for using Arthur Chu's signature technique: bopping around the game board seemingly at whim, rather than choosing the clues from top to bottom, as most contestants do. This is Chu's great crime, the kind of anarchy that hard-core Jeopardy! fans will not countenance. The technique was pioneered in 1985 by a five-time champ named Chuck Forrest, whose law school roommate suggested it. The "Forrest bounce," as fans still call it, kept opponents off balance. He would know ahead of time where the next clue would pop up; they’d be a second slow.

    I don't watch Jeopardy! much, but it's pretty fun to watch Chu dominate.

    Then there's the most recent RadioLab. The first part talks about a game show called Golden Balls and the prisoner's dilemma, and how a guy — who plays and wins game shows for a living — won this one. The whole show is entertaining as usual, but this first part is of particular interest. After listening to that, watch the Golden Balls clip to see how it played out.

  • A visual explanation of conditional probability

    February 18, 2014  |  Statistics

    Conditional probability

    Victor Powell, who has visualized the Central Limit Theorem and Simpson's Paradox, most recently provided a visual explainer for conditional probability.

    Two bars, one blue and one red, represent two events that can happen together or independently of the other. When a ball hits a bar the corresponding event occurs. What is the probability that one event occurs given that the other does and vice versa? If the probability of both events increases and decreases, how does that change the separate probabilities? Sliders and options let you experiment, and the visual and counters change to help you learn.

    A fun one to tinker with.

  • Basketball analytics

    February 12, 2014  |  Statistics

    Kirk Goldsberry talks the rise of analytics usage in the NBA. With cameras above every court recording player movements, there's a higher granularity analysis that is now possible, beyond the box score. One of the key metrics is expected possession value, or EPV, which estimates the number of points a possession is worth, given where everyone is on the court and where the ball is.

    But the clearest application of EPV is quantifying a player's overall offensive value, taking into account every single action he has performed with the ball over the course of a game, a road trip, or even a season. We can use EPV to collapse thousands of actions into a single value and estimate a player's true value by asking how many points he adds compared with a hypothetical replacement player, artificially inserted into the exact same basketball situations. This value might be called "EPV-added" or "points added."

    As a basketball fan, I hope this makes the game more fun and interesting to watch, and as a statistician, I hope this work can be applied to other facets of life like traffic or local movements. If just the latter, that'd be fine too.

  • Texting data to save lives

    February 6, 2014  |  Data Sources

    Remember that TED talk from a couple of years ago on texting patterns to a crisis hotline? The TED talker Nancy Lublin proposed the analysis of these text messages to potentially help the individuals texting. Her group, the Crisis Text Line, plans to release anonymized aggregates in the coming months.

    Ms. Lublin said texts also provided real-time information that showed patterns for people in crisis.

    Crisis Text Line's data, she said, suggests that children with eating disorders seek help more often Sunday through Tuesday, that self-cutters do not wait until after school to hurt themselves, and that depression is reported three times as much in El Paso as in Chicago.

    This spring, Crisis Text Line intends to make the aggregate data available to the public. "My dream," Ms. Lublin said, "is that public health officials will use this data and tailor public policy solutions around it."

    Keeping an eye on this.

  • How R came to be

    January 30, 2014  |  Statistics

    Statistician John Chambers, the creator of S and a core member of R, talks about how R came to be in the short video below. Warning: Super nerdy waters ahead.

    I've heard this story before, but it was nice to hear it again, since it is about something I use almost every day. I would also like to hear about the invention of the toilet. [via Revolutions]

Copyright © 2007-2014 FlowingData. All rights reserved. Hosted by Linode.