Last year, fitness tracking app Strava released a high detail map of public activity data. Looking more closely, security student Nathan Ruser noticed activity in various parts of the globe that revealed secret US army bases.
Read More
-
-
-
We love complete and nicely formatted data. That’s not what we get a lot of the time.
-
This fake follower piece by Nicholas Confessore, Gabriel J.X. Dance, Richard Harris, and Mark Hansen for The New York Times is tops. In search of shortcuts to greater influence, many buy followers, likes, and retweets on Twitter. The numbers go up, but a lot of extra “influence” is just automated fluff.
The Times focuses on one company, Devumi, and investigates the follower pattern of some of the customers, as shown above. The scroll-y explanation is good. It’s even got pseudocode in there to explain the type of bots.
-
Inspired by Dear Data, the data drawing pen pal project, designers Josefina Bravo, Sol Kawage, and Tomoko Furukawa use the postcard medium to send each other weekly how-to instructions for a wide variety of everyday things. The only rule is that they can’t use words.
As of writing this, they’re on week 37, which covered how to roll maki, how to eat an apple like a boss, and how to make mayonnaise.
-
Evie Liu and William Davis, reporting MarketWatch, looked at release strategies of Oscar nominees over the past few years. Some go for the wide release with the movie playing in over 1,500 theaters, whereas others choose a platform release with the movie playing in fewer than 50 theaters. The last seven of eight Best Picture winners went with the latter route.
-
Max Fisher and Amanda Taub, for The New York Times, answer the question with a video and charts. And if you’re wondering how they generated a high resolution chart to video, Adam Pearce has you covered.
-
Population data typically comes in the context of boundaries. City data. County data. Country data. With their Population Estimate Service, NASA provides data at higher granularity. You can request estimated population in the context of a world grid.
Here’s an interactive map to demonstrate the API. Click and drag a shape across any region in the world and get an estimate of the population within that shape. [via kottke]
-
I think we can all benefit from knowing a little more about others these days. This is a glimpse of how different groups live.
-
According to NASA estimates, 2017 was the second warmest year on record since 1880. Henry Fountain, Jugal K. Patel, and Nadja Popovich reporting for The New York Times:
What made the numbers unexpected was that last year had no El Niño, a shift in tropical Pacific weather patterns that is usually linked to record-setting heat and that contributed to record highs the previous two years. In fact, last year should have benefited from a weak version of the opposite phenomenon, La Niña, which is generally associated with lower atmospheric temperatures.
Good times ahead.
-
New data dump from the Wikimedia Foundation:
The Wikimedia Foundation’s Analytics team is releasing a monthly clickstream dataset. The dataset represents—in aggregate—how readers reach a Wikipedia article and navigate to the next. Previously published as a static release, this dataset is now available as a series of monthly data dumps for English, Russian, German, Spanish, and Japanese Wikipedias.
-
PornHub compared minute-to-minute traffic on their site before and after the missile alert to an average Saturday (okay for work). Right after the alert there was a dip as people rushed for shelter, but not long after the false alarm notice, traffic appears to spike.
Some interpret this as people rushed to porn after learning that a missile was not headed towards their home. Maybe that’s part of the reason, but my guess is that Saturday morning porn consumers woke earlier than usual.
-
From ABC News, this is a clever comparison between people’s worst fears and the number of deaths caused by the things that people fear. It starts by getting the reader to think about his or her fears and then places them in the context of causes of death.
-
This is a fun ditty by Vasco Asturiano. I’m a little too far out from my eighth grade jazz band days, but it’s still fun to mess around with. Notes can be arranged in different ways, and then you just mouse over the hexagons to play.
-
From The Malaria Atlas Project, a global map of estimated accessibility to cities:
In the present study, we quantify and validate global accessibility to high-density urban centres at a resolution of 1×1 kilometre for 2015, as measured by travel time. The last global mapping effort to measure accessibility was for the year 2000, a time that predates both substantial investment and expansion of transportation infrastructure and an extraordinary improvement in the data quantity and quality of accessibility measures. The game-changing improvement underpinning this work is the first-ever, global-scale synthesis of two leading roads datasets – Open Street Map (OSM) data and distance-to-roads data derived from the Google roads database – which resulted in a nearly five-fold increase in the mapped road area relative to that used to produce the circa 2000 map.
The dark areas are the most fascinating.
-
Dan Hurley, reporting for The New York Times, describes the use of statistical software to assist call screeners:
[T]he decision to screen out or in was not Byrne’s alone. In August 2016, Allegheny County became the first jurisdiction in the United States, or anywhere else, to let a predictive-analytics algorithm — the same kind of sophisticated pattern analysis used in credit reports, the automated buying and selling of stocks and the hiring, firing and fielding of baseball players on World Series-winning teams — offer up a second opinion on every incoming call, in hopes of doing a better job of identifying the families most in need of intervention. And so Byrne’s final step in assessing the call was to click on the icon of the Allegheny Family Screening Tool.
I’m glad Hurley highlights the challenges of the inherent biases in the data and the algorithms later in the article. It’s one thing to use data to estimate player value in sports. It’s another thing to use data to decide whether or not to send help to someone calling the police. [Thanks, Jennifer]
-
The past few days in California has been non-stop rain, but the months before that, there was unprecedented wildfires in the state. Lauren Tierney, reporting for The Washington Post, provides an overview along with a scale comparison of 2017’s biggest fire against anywhere on the globe.
-
FiveThirtyEight used a dataset on broadband as the basis for a couple of stories. The data appears to be flawed, which makes for a flawed analysis. From their post mortem:
We should have been more careful in how we used the data to help guide where to report out our stories on inadequate internet, and we were reminded of an important lesson: that just because a data set comes from reputable institutions doesn’t necessarily mean it’s reliable.
Then, from Andrew Gelman and Michael Maltz, there was the closer look at data collected by the Murder Accountability Project, which has its merits but also some holes:
if you’re automatically sifting through data, you have to be concerned with data quality, with the relation between the numbers in your computer and the underlying reality they are supposed to represent. In this case, we’re concerned, given that we did not trawl through the visualizations looking for mistakes; rather, we found a problem in the very first place we looked.
There’s also the ChestXray14 dataset, which is a large set of x-rays used to train medical artificial intelligence systems. Radiologist Luke Oakden-Rayner looked closer, and the dataset appears to have its issues as well:
In my opinion, this paper should have spent more time explaining the dataset. Particularly given the fact that many of the data users will be computer scientists without the clinical knowledge to discover any pitfalls. Instead, the paper describes text mining and computer vision tasks. There is one paragraph (in eight pages), and one table, about the accuracy of their labeling.
For data analysis to be meaningful, for it to actually work, you need that first part to be legit. The data. If the data collection process rates poorly, missing data outnumbers observations, or computer-generated estimates aren’t vetted by a person, then there’s a good chance anything you do afterwards produces questionable results.
Obviously this isn’t to say avoid data altogether. Every abstraction of real life comes with its pros and cons. Just don’t assume too much about a dataset before you examine it.
-
How to Make Venn Diagrams in R
The usually abstract, qualitative and sometimes quantitative chart type shows relationships. You can make them in R, if you must.