The Rush Hour puzzle game was invented by Nob Yoshigahara in the 1970s and made its way to the United States in the 1990s. There are vehicles of varying length in a parking lot, and you have to figure out how to get one of the cars out by shifting all the others inside a six-by-six grid. Michael Fogleman wrote a solver and generator for the game, resulting in a database of 1.5 million puzzles.
Earlier this year, The New York Times investigated fake followers on Twitter showing very clearly that it was a problem. It’s hard to believe that Twitter didn’t already know about the scale of the issue, but after the story, the social service finally started to work on the problem.
An investigation by The New York Times in January demonstrated that just one small Florida company sold fake followers and other social media engagement to hundreds of thousands of users around the world, including politicians, models, actors and authors. The revelations prompted investigations in at least two states and calls in Congress for intervention by the Federal Trade Commission. In interviews this week, Twitter executives said that The Times’s reporting pushed them to look more closely at steps the company could take to clamp down on the market for fakes, which is fueled in part by the growing political and commercial value of a widely followed Twitter account.
This is statistics driving positive change instead of just advertising. I’m ready for more of this.
Using OpenStreetMap data, Geoff Boeing charted the orientation distributions of major cities:
Each of the cities above is represented by a polar histogram (aka rose diagram) depicting how its streets orient. Each bar’s direction represents the compass bearings of the streets (in that histogram bin) and its length represents the relative frequency of streets with those bearings.
So you can easily spot the gridded street networks, and then there’s Boston and Charlotte that are a bit nutty. Check out Boeing’s other chart for orientation of major non-US cities.
The trade war started in January of this year when the administration imposed tariffs on 18 solar panel and washing machine products. Then the United States imposed more, and countries returned the favor on U.S. products, which ballooned the product count to 10,000. Keith Collins and Jasmine C. Lee for The New York Times chronicled the shifts with force-directed bubbles.
So many bubbles. Maybe we should just get it over with and impose tariffs on all the things now.
Mike Loukides, Hilary Mason, and DJ Patil published a first post in a series on data ethics on O’Reilly.
We particularly need to think about the unintended consequences of our use of data. It will never be possible to predict all the unintended consequences; we’re only human, and our ability to foresee the future is limited. But plenty of unintended consequences could easily have been foreseen: for example, Facebook’s “Year in Review” that reminded people of deaths and other painful events. Moving fast and breaking things is unacceptable if we don’t think about the things we are likely to break. And we need the space to do that thinking: space in project schedules, and space to tell management that a product needs to be rethought.
Because data might just be computer output — cold and mechanical — but what data represents and the things it leads to are not.
On July 24-25 from 10am-5pm ET, Metis will host its free Demystifying Data Science live online conference for aspiring data scientists and data-curious business professionals. Attendees will experience a total of 28 interactive data science talks from industry-leading speakers.
Day 1 (July 24): For Aspiring Data Scientists
Hear talks on the training, tools, and career paths to the best job in the United States, featuring a keynote by Lillian Pierson, CEO of Data-Mania LLC.
Day 2 (July 25): For Data Curious Business Leaders
Speakers explain how to integrate data science into your organization and how it all applies to you. The day includes a keynote from Beth Comstock, author and former Vice Chair of General Electric.
Each talk is an 18-minute live presentation followed by a Q&A session with questions from the audience. All registrants will have access to recorded versions of the presentations post-conference. Register for free here!
The eighth Thai boy was rescued from the flooded cave recently. Great news. The South China Morning Post has a series of graphics to explain the rescue path and strategy.
Things have a way of repeating themselves, and it can be useful to highlight these patterns in data.
Benjamin Pavard from France made a low-probability goal the other day. Seth Blanchard and Reuben Fischer-Baum for The Washington Post explain the rarity and use it as a segue into expected versus actual goals to gauge how teams have played.
This statistic can also tell us which teams are over and under-producing given their level of play so far, by comparing their expected goals and actual results. Surprise quarterfinalist Russia is the biggest overproducer, with an actual goal differential of +4 compared with an expected goal differential of -1.7. This can mean a lot of things. The team could be getting a bit lucky, or just playing extremely well in such a way that they finish more hard challenges than you would normally expect.
Seems right, I think. I mean, I have to take it at face value, as the sports world is essentially dead to me until basketball season starts again.
In the early 1990s, the CIA published internal survey results for how people within the organization interpreted probabilistic words such as “probable” and “little chance”. Participants were asked to attach a probability percentage to the words. Andrew Mauboussin and Michael J. Mauboussinran ran a public survey more recently to see how people interpret the words now.
The main point, like in the CIA poll, was that words matter. Some words like “usually” and “probably” are vague, whereas “always” and “never” are more certain.
I wonder what results would look like if instead of showing a word and asking probability, you flipped it around. Show probability and then ask people for a word to describe. I’d like to see that spectrum.
Pedro M. Cruz, John Wihbey, Avni Ghael and Felipe Shibuya from Northeastern University used a tree metaphor to represent a couple centuries of immigration in the United States:
Like countries, trees can be hundreds, even thousands, of years old. Cells grow slowly, and the pattern of growth influences the shape of the trunk. Just as these cells leave an informational mark in the tree, so too do incoming immigrants contribute to the country’s shape.
After an unsuccessful battery search, the natural next step was of course to look up battery sizes and chart all of them.
Microsoft released a comprehensive dataset for computer-generated building footprints in the United States. The method:
We developed a method that approximates the prediction pixels into polygons making decisions based on the whole prediction feature space. This is very different from standard approaches, e.g. Douglas-Pecker algorithm, which are greedy in nature. The method tries to impose some of a priory building properties, which are, at the moment, manually defined and automatically tuned.
The GeoJSON files for each state are available for download, released under the Open Data Commons Open Database License. Nice.
LeBron James decides where he takes his talents this summer, and the sports news outlets continue to review every scenario as rumors trickle in. Neil Paine and Gus Wezerek for FiveThirtyEight present their quantitative solution, sending James to the Philadelphia 76ers.
On one hand, they consider the chances of winning a championship in the next four years, based on projection models. On the other hand, they consider a more subjective rating in legacy-building. All in good fun of course.
I always wonder what it’s like for professional athletes who have to make these sort of decisions. Much of their job is seemingly data-driven, but does someone like James even care about this stuff? Or is it all by feel? I imagine switching jobs to a new city, and I think I’d look at a few numbers initially, but it’d all filter down to the place where my family was happiest, data be damned.
It's important to consider the reasons so that we don't overreact. Otherwise, we're just berating, pointing, and laughing all of the time, and that's not good for anyone.
ProPublica compiled spending data from a wide range of sources to calculate the total, which is still an undercount:
The vast majority of the money — at least $13.5 million, or more than 84 percent of what we tracked — was spent by Trump’s presidential campaign (including on Tag Air, the entity that operates Trump’s personal airplane). Republican Senate and House political committees and campaigns have shelled out at least another $2.1 million at Trump properties. At least $400,000 has been spent by federal, state and local agencies. (For example, the Florida Police Chiefs Association held its summer conference last year at the Trump National Doral Miami.) The state and local tally appears to be a gross undercount because of the agencies’ spotty disclosures and reporting.
Messy headed into the presidency and messy still.
Catch the interactive visualization by ProPublica and Fathom Information Design. It shows the available records in more detail, and you can download the data for yourself.
Showing geospatial data in a single web interface, kepler.gl helps users quickly validate ideas and glean insights from these visualizations. Using kepler.gl, a user can drag and drop a CSV or GeoJSON file into the browser, visualize it with different map layers, explore it by filtering and aggregating it, and eventually export the final visualization as a static map or an animated video.
It plays nice with Mapbox if that’s your jam.
The U.S. Department of Education constantly investigates school districts and colleges for civil rights violations. Lena Groeger and Annie Waldman for ProPublica made the data more accessible, providing the status of past, present, and pending investigations. Search for the place of interest, and you get a calendar and list view of all the cases on record.
Condé Nast Traveler got 70 people from 70 different countries to count money on camera. Many times I found myself wondering, “Why would you ever do it like that?” There’s a metaphor for data and its interpretation somewhere in there.