Forget bell curves, jellybeans, and coin flips to explain statistical concepts. Dancing Statistics is a video series that demonstrates variance, correlation, and sampling through coreographed movements. The dance below explains variance.
Justin Blinder used New York's city planning dataset and Google Streetview for a before and after view of vacant lots.
Vacated mines and combines different datasets on vacant lots to present a sort of physical facade of gentrification, one that immediately prompts questions by virtue of its incompleteness: “Vacated by whom? Why? How long had they been there? And who’s replacing them?” Are all these changes instances of gentrification, or just some? While we usually think of gentrification in terms of what is new or has been displaced, Vacated highlights the momentary absence of such buildings, either because they’ve been demolished or have not yet been built. All images depicted in the project are both temporal and ephemeral, since they draw upon image caches that will eventually be replaced.
Based on reviews from BeerAdvocate, Beer Viz, a visualization class project, asks you to choose a general style of beer and a beer that you like. Then it shows you beers that are similar, based on appearance, taste, aroma, and overall score. It's like a visual version of the beer recommendation system we saw last year.
The NBA has been kind of gaga over data the past few years, and they recently announced that all 30 teams would have player tracking installed so they can see where they go at night after games. Wait, no. I mean so that there is data on where each player is on the court at any given time. Fathom Information Design played with some of this data for an Oklahoma City versus San Antonio game, with some sketches.
Above are the movements of power forward Tim Duncan, who sticks around the middle of the court throughout a game. A guard on the other hand, runs around the court more. This is obvious if you've watched him play, but sketches like this coupled with spatiotemporal analysis could be interesting.
Also, I get the sense that there's more people who want to know about this data than there are who know how to, so if you're a statistician on the job hunt, there's that.
One of the main challenges of any data project is getting the data. It seems obvious, but the effort to get the right data to answer a question seems to catch people off guard. Even data that's "free" to download can be a huge pain that ends up completely useless. ProPublica, the non-profit newsroom, deals with this stuff on a regular basis and hopes that some of their efforts can turn into a source of funding through the Data Store.
Like most newsrooms, we make extensive use of government data — some downloaded from "open data" sites and some obtained through Freedom of Information Act requests. But much of our data comes from our developers spending months scraping and assembling material from web sites and out of Acrobat documents. Some data requires months of labor to clean or requires combining datasets from different sources in a way that's never been done before.
For datasets that are the result of significant expenditures of our time and effort, we're charging a reasonable one-time fee: In most cases, it's $200 for journalists and $2,000 for academic researchers.
I hope it works.
After noting the later dinner time in Spain, Stefano Maggiolo noted relatively late sunsets for one of the possible reasons, compared to standard time. Then he mapped sunset time versus standard time around the world.
Looking for other regions of the world having the same peculiarity of Spain, I edited a world map from Wikipedia to show the difference between solar and standard time. It turns out, there are many places where the sun rises and sets late in the day, like in Spain, but not a lot where it is very early (highlighted in red and green in the map, respectively). Most of Russia is heavily red, but mostly in zones with very scarce population; the exception is St. Petersburg, with a discrepancy of two hours, but the effect on time is mitigated by the high latitude. The most extreme example of Spain-like time is western China: the difference reaches three hours against solar time. For example, today the sun rises there at 10:15 and sets at 19:45, and solar noon is at 15:01.
Because you get more pizza to eat, and if you don't finish it, you'll have breakfast tomorrow. Other than that fine reason, well, it's geometrically the better deal. Planet Money explains with an interactive that shows the price per square inch for 3,678 pizza places across the United States, based on data from Grubhub.
The math of why bigger pizzas are such a good deal is simple: A pizza is a circle, and the area of a circle increases with the square of the radius.
More pizza more problems
So, for example, a 16-inch pizza is actually four times as big as an 8-inch pizza.
And when you look at thousands of pizza prices from around the U.S., you see that you almost always get a much, much better deal when you buy a bigger pizza.
You get more pizza, and the business gets more money with minimal extra pizza-making effort. Win-win. Although, keep going on the horizontal axis and I bet that curve starts to curl up. Where can I get a ten-foot pizza?
The histogram is one of my favorite chart types, and for analysis purposes, I probably use them the most. Devised by Karl Pearson (the father of mathematical statistics) in the late 1800s, it's simple geometrically, robust, and allows you to see the distribution of a dataset.
If you don't understand what's driving the chart though, it can be confusing, which is probably why you don't see it often in general publications.
Looking for a job in data science, visualization, or statistics? There are openings on the board.
Senior Associate Director, Analytics for the University of Chicago in Chicago, Illinois.
Data Scientist for Thumbtack in San Francisco, California.
Communications Officer, Measurement and Analysis for the Bill and Melinda Gates Foundation in Seattle, Washington.
Senior Graphics Editor for The Wall Street Journal in New York, New York.
Basketball Analyst for the Philadelphia 76ers in Philadelphia, Pennsylvania.
I like how a little bit of game theory has crept into Jeopardy! with contestant Arthur Chu. He bounces around the board in search of Daily Doubles and bets to tie in final Jepoardy. Chu doesn't know much about game theory himself but applies rules promoted by a past contestant.
The ultimate champion, Ken Jennings, praises Chu on Slate.
But in fact, plenty of nice white boys on Jeopardy! have been pilloried by viewers for using Arthur Chu's signature technique: bopping around the game board seemingly at whim, rather than choosing the clues from top to bottom, as most contestants do. This is Chu's great crime, the kind of anarchy that hard-core Jeopardy! fans will not countenance. The technique was pioneered in 1985 by a five-time champ named Chuck Forrest, whose law school roommate suggested it. The "Forrest bounce," as fans still call it, kept opponents off balance. He would know ahead of time where the next clue would pop up; they’d be a second slow.
I don't watch Jeopardy! much, but it's pretty fun to watch Chu dominate.
Then there's the most recent RadioLab. The first part talks about a game show called Golden Balls and the prisoner's dilemma, and how a guy — who plays and wins game shows for a living — won this one. The whole show is entertaining as usual, but this first part is of particular interest. After listening to that, watch the Golden Balls clip to see how it played out.
Selfiecity, from Lev Manovich, Moritz Stefaner, and a small group of analysts and researchers, is a detailed visual exploration of 3,200 selfies from five major cities around the world. The project is both a broad look at demographics and trends, as well as a chance to look closer at the individual observations.
Global Forest Watch uses satellite imagery and other technologies to estimate forest usage, change, and tree cover (among other things). These estimates and their eventual actions used to be slow. Now they're near-real-time.
This is about to change with the launch of Global Forest Watch—an online forest monitoring system created by the World Resources Institute, Google and a group of more than 40 partners. Global Forest Watch uses technologies including Google Earth Engine and Google Maps Engine to map the world’s forests with satellite imagery, detect changes in forest cover in near-real-time, and make this information freely available to anyone with Internet access.
Many layers and high granularity. Take your time with this one.
Maris Jensen just made SEC filings readable by humans. The motivation:
But in the twenty years since, despite hundreds of millions invested in rounds of contracted EDGAR modernization efforts and interactive data false starts, the SEC's EDGAR has remained almost untouched. In 2014, the SEC is quite literally doing less with SEC filings than their predecessors had planned for 1984. Data tagging is the red-headed stepchild of the Commission -- out of hundreds of forms, only about a dozen are filed as structured data -- and the first program to automate the selection of SEC filings for review, the Division of Economic and Risk Analysis (DERA)'s 'Robocop', has been 'aspirational' for years. The academics in the division responsible for the SEC's interactive data initiatives write papers about information asymmetry, using EDGAR data they repurchase in usable form for millions each year, but do nothing to fix it. Companies are chastised for insufficient and inefficient disclosure, while the SEC fails to help retail investors navigate corporate disclosures at all.
Look up a company and see their financials, ownership, influences, and board members, among other things typically not so straightforward to look up.
This is all sorts of neat. Researchers Andrew Adamatzky and Ramon Alonso-Sanz are using a slime mold, P polycephalum, to find the most efficient road routes to provide guidance on how to rework them. P polycephalum is a single-celled organism that forages for food through various branches, and when it finds the most efficient food source, backs away from the others. The video above is a sped up version of it in action. Adamatzky and Alonso-Sanz put a map underneath.
We cut agar plates in a shape of Iberian peninsula, place oat flakes at the sites of major urban areas and analyse the foraging network developed. We compare the plasmodial network with principle motorways and also analyse man-made and plasmodium networks in a framework of planar proximity graphs.
Nick Danforth for Al Jazeera delves into the history books for why north is typically on the top of our maps. There's no single reason for it, but Ptolemy might have had something to do with it.
The north's position was ultimately secured by the beginning of the 16th century, thanks to Ptolemy, with another European discovery that, like the New World, others had known about for quite some time. Ptolemy was a Hellenic cartographer from Egypt whose work in the second century A.D. laid out a systematic approach to mapping the world, complete with intersecting lines of longitude and latitude on a half-eaten-doughnut-shaped projection that reflected the curvature of the earth. The cartographers who made the first big, beautiful maps of the entire world, Old and New — men like Gerardus Mercator, Henricus Martellus Germanus and Martin Waldseemuller — were obsessed with Ptolemy. They turned out copies of Ptolemy's Geography on the newly invented printing press, put his portrait in the corners of their maps and used his writings to fill in places they had never been, even as their own discoveries were revealing the limitations of his work.
Ptolemy put north on top. Although, we don't know why he put it there.
Two bars, one blue and one red, represent two events that can happen together or independently of the other. When a ball hits a bar the corresponding event occurs. What is the probability that one event occurs given that the other does and vice versa? If the probability of both events increases and decreases, how does that change the separate probabilities? Sliders and options let you experiment, and the visual and counters change to help you learn.
A fun one to tinker with.
As most of us know, it's not easy getting by on minimum wage, and in some places it's not possible. The New York Times provides a calculator to see how challenging it can be.
A simple visual on the right shows dollars made per year, one box per dollar colored green initially and then red to signal debt. It's a good way to make the numbers more relatable. Select a state, enter expenses, and watch dollars disappear, and most likely you'll end up in the red early.
— Jerzy Wieczorek describes his first semester as a stat PhD student.
— Apparently there's a stochastic process in probability theory called the Chinese Restaurant Process, and a closely related Indian Buffet Process. Matt Dickenson made a quick visualization to demonstrate the latter. These require more investigation for their names alone.
— Stravinky's The Rite of Spring visualized.
— Why didn't anyone tell me Arduino was so fun and accessible? It's like LEGOs raised to the nerdeth degree.
Nicholas Felton, Drew Breunig, and Friends of the Web released Reporter for iPhone. The app—$3.99 on the app store—prompts you with quizzes, such as who you're with or what you're doing, sparsely throughout the day to help you collect data about yourself and surroundings. You can also create your own survey questions to collect data on what interests you and use your phone's existing capabilities to record location, sound levels, weather, and photo counts automatically.
Kirk Goldsberry talks the rise of analytics usage in the NBA. With cameras above every court recording player movements, there's a higher granularity analysis that is now possible, beyond the box score. One of the key metrics is expected possession value, or EPV, which estimates the number of points a possession is worth, given where everyone is on the court and where the ball is.
But the clearest application of EPV is quantifying a player's overall offensive value, taking into account every single action he has performed with the ball over the course of a game, a road trip, or even a season. We can use EPV to collapse thousands of actions into a single value and estimate a player's true value by asking how many points he adds compared with a hypothetical replacement player, artificially inserted into the exact same basketball situations. This value might be called "EPV-added" or "points added."
As a basketball fan, I hope this makes the game more fun and interesting to watch, and as a statistician, I hope this work can be applied to other facets of life like traffic or local movements. If just the latter, that'd be fine too.