• Back in 2008, the New York Times rolled out a campaign finance API so that you could easily access data based on Federal Election Commission filings. (If you’ve tried grabbing data direct from the source, you know this is a pain.) ProPublica took the reins a few days ago as we lead up to this year’s elections.

    Like millions around the world, you’re probably like, “What the what? I thought the FEC released their own API recently!” They did. But:

    One big difference is timeliness: the FEC API is updated nightly, while ours will be updated throughout each day. For many users of campaign finance data, that distinction may not be a big deal, but on filing days, when thousands of filings are submitted to the FEC, timeliness can matter a lot. Another is the source data: the FEC considers electronic filings to be “unofficial” in the sense that data from them is then brought into agency databases before being published as bulk data. The FEC API publishes data only from those official tables, while the ProPublica API has data from both the official tables and the raw electronic filings.

    I’d trust the ProPublica one more for now.

  • There’s a lot of data on criminal justice — prison populations, crime rates, police policies, etc — but it can be hard to find, because it’s scattered across and deep within thousands of local sites. Hall of Justice from the Sunlight Foundations is an effort to catalog a significant portion of reports and datasets.

    While not comprehensive, Hall of Justice contains nearly 10,000 datasets and research documents from all 50 states, the District of Columbia, U.S. territories and the federal government. The data was collected between September 2014 and October 2015. We have tagged datasets so that users can search across the inventory for broad topics, ranging from death in custody to domestic violence to prison population. The inventory incorporates government as well as academic data.

    Dealing with those pesky government PDF files is up to you. At least there’s an app for that.

  • On the PolicyViz podcast, Kim Rees of Periscopic and Mushon Zer-Aviv of Shual Design Studio discuss whether or not empathy plays a role in visualization. Stuff on this topic tends to be annoyingly dismissive or hand-wavy, but this is a good chat worth listening to.

    There’s a little bit of swearing, so maybe put on headphones if you’re in an area where that is frowned upon.

  • Adam Pearce charted minute-by-minute point differentials for NBA games during the 2014-15 season.

    To squeeze distribution in, I had to make a couple of trade offs. Instead of being able to encode point differentials with vertical position like I did with my Golden State’s win streak chart, I used color for point difference and saved vertical position for distribution. Since there have been more score differences (GSW was beating by MEM 52 at one point) than can be usefully encoded as unique colors, I bucketed the score differences into 7 colors.

    Nifty.

    And go Warriors.

  • Guess the Correlation is a straightforward game where you do just that, and it’s surprisingly fun. You get a scatterplot and you guess the correlation coefficient. That’s it. If you’re off by too much, you lose a life, and if you’re almost spot on, you gain a life. If you’re somewhat right, you get a coin. Bonus points for streaks of correct guesses.

    Have at it.

  • What if you relived life’s activities in big clumps? Thirty years of sleeping in one go. Five months sitting on the toilet. Based on David Eagleman’s Sum: Forty Tales from the Afterlives, this short film by Temujin Doran imagines such a life. Watch to the end.

    [via Brain Pickings]

  • Jaakko Seppälä drew ten comic characters, each in its original style and in the style of the other nine. It’s like the same source material can be shown and seen in different ways, communicating different moods and themes. Imagine that.

    See all 10.

  • Charlie Loyd, who works with satellite imagery at Mapbox, put together a 12-second time-lapse of Earth using a day of data from Japan’s weather satellite Himawari-8. The experiment is called Glittering Blue. Derek Watkins used the data to similar effect last year, but Glittering Blue is bigger and at high resolution making it all the more mesmerizing.

  • The Upshot, the data analysis-centric site from the New York Times, has a new editor, and her name is Amanda Cox.

    I have asked Amanda to take on this job because she is the best person to lift The Upshot to new heights. But I also want to note an underlying message in her appointment. Visual journalism – graphics, interactives, photography, video, virtual reality – is a growing part of our report, and it’s an area where we excel. In the future, visual journalists, and those, like Amanda, whose background spans both words and visuals, are a crucial part of the future leadership of The Times.

    So great and well-deserved.

    If you read FlowingData, you’ve seen her work, but if not, here’s a refresher.

  • Erik Bernhardsson downloaded 50,000 fonts and then threw them to the neural networks to see what sort of letters a model might come up with.

    These are all characters drawn from the test set, so the network hasn’t seen any of them during training. All we’re telling the network is (a) what font it is (b) what character it is. The model has seen other characters of the same font during training, so what it does is to infer from those training examples to the unseen test examples.

    I especially like the part where you can see a spectrum of generated fonts through varying parameters.

  • David Hagan looked closer at why the 11th of the month appeared to be missing in books. As with many modern curiosities, it began with an xkcd comic.

    First I confirmed that the 11th is actually interesting. There are 31 days and one of them has to be smallest. Maybe the 11th isn’t an outlier; it’s just on the smaller end and our eyes are picking up on a pattern that doesn’t exist. To confirm this is real, I compared actual numbers, not text size. The Ngrams database returns the total number times a phrase is mentioned in a given year normalized by the total number of books published that year. The database only goes up to the year 2008, so it is presumably unchanged from when Randall queried it in 2012.

  • While we’re on the topic of life expectancy, Tim Urban of Wait But Why used a simplified estimate of average life span and then extrapolated for various events in one’s life.

    For example, Urban is 34 years old, so that number of Super Bowls has passed. Then assume a 90-year life span, and you have the number of Super Bowls left in his lifetime. Other extrapolations include winters left set in snow flakes, dumplings to eat set in a dumpling emoji, and time left with parents set with stick figure icons.

    The math is simple, and you can easily do it in your head, but somehow seeing it as icons has a more sensitive effect. [Thanks, David]

  • Kaggle just opened up a Datasets section to download and analyze public data.

    At Kaggle, we want to help the world learn from data. This sounds bold and grandiose, but the biggest barriers to this are incredibly simple. It’s tough to access data. It’s tough to understand what’s in the data once you access it. We want to change this. That’s why we’ve created a home for high quality public datasets, Kaggle Datasets.

    It’s still really new and only has a handful of datasets but it looks interesting. The key is that it’s not just a place to download data. Instead, they have analysis environments and make it easy to share code that makes use of the data. They also make it easy to share results.

    Oftentimes, it’s the getting-started hurdle that gets in the way of working with a large-ish dataset. Maybe this will help set things on the right path.

  • It took forever and it’s way overdue, but the United States Census Bureau has committed to an open source policy, which seems pretty sweet.

    • Foster a community around Census data and tools by encouraging and responding to real-time feedback on how our data products are used by researchers, non-profit, and for profit organizations.
    • Increase our organizational capacity to do more open source by delivering more Free and Open Source Software (FOSS) to the community. FOSS is software that does not charge users a purchase or licensing fee for modifying or redistributing the source code, in our projects and contribute back to the open source community.
    • Identify opportunities to publish existing code under an open source license that may benefit the public.
      Identify opportunities to create new open source projects, and develop those projects in the open alongside community participation.
    • Adopt industry best practices for managing the lifecycle of our open source projects including standard release management and continuous integration approaches.
    • Encourage “Issues” and accept “Pull Requests” (PRs) from the community.
    • Ensure that new Code Releases and Community Contributions meet the specified guidelines, detailed in the sections below.
      Where feasible to do so, we will automate and also open source any testing procedures and encourage contributors to execute their own tests.

    Of course it all comes down to execution. The organization is not especially speedy, but it’s worth keeping an eye on this. See the current open source projects here.

  • So far we’ve seen when you will die and how other people tend to die. Now let’s put the two together to see how and when you will die, given your sex, race, and age.