• Extract CSV data from PDF files with Tabula

    April 8, 2014  |  Software


    Tabula, by Manuel Aristarán, came out months ago, but I've been poking at government data recently and came back to this useful piece of free software to get the data tables out of countless free-floating PDF files.

    If you've ever tried to do anything with data provided to you in PDFs, you know how painful this is — you can't easily copy-and-paste rows of data out of PDF files. Tabula allows you to extract that data in CSV format, through a simple interface.

    It's not the fastest software in the world, but it really is simple to use and it sure beats manual entry. You just load a PDF file into Tabula, which runs on your computer, highlight the table to extract, and the program does the rest. Save as a CSV and do what you want with it.

    Download Tabula here. Find out a little more about it on Source.

  • Interactive maps with R

    February 11, 2014  |  Software

    Interactive maps with R

    You can make static maps in R relatively well, if you know what packages to use and what to look for, but there isn't much direct interaction with your graphics. rMaps is a package that helps you create maps that you can mouse over and zoom in to.

    Don't get too excited though. A scan of the docs shows that it's basically a wrapper around JavaScript libraries Leaflet, DataMaps and Crosslet, so you could learn those directly instead, and you'd be better for it in the long run if you plan to make more maps. But if you're just working on a one-off or must stay in R because your life depends on, rMaps might be an option.

  • Learn R interactively with the swirl package

    January 29, 2014  |  Software

    R, the statistical computing language of choice and what I use the most, can seem odd to those new to the language or programming. And I think this what holds a lot of people back and what keeps people stuck in limited software. The swirl package for R helps beginners get over that first hurdle by teaching you within R itself.

    swirl is a software package for the R statistical programming language. Its purpose is to teach users statistics and R simultaneously and interactively. It attempts to do this in the most authentic learning environment possible by guiding users through interactive lessons directly within the R console.

    Assuming you installed R on your computer already, install the package (and the other packages it depends on), make a call to swirl(), and you get a guide through the basics.

  • Introducing R to a non-programmer, in an hour

    January 7, 2014  |  Coding

    Biostatistics PhD candidate Alyssa Frazee was tasked with teaching her sister, an undergraduate in sociology, how to use R. She had only one hour.

    Once you load in a dataset, things start to get fun. We learned a whole bunch of stuff from this data frame, like how to do basic tabulations and calculate summary statistics, how to figure out if you have missing data, and how to fit a simple linear model. This part was pretty fun because my sister started leading the session: instead of me saying "I'm going to show you how to do this," it was her asking "Hey, could we make a scatterplot?" or "Do you think we could put the best-fit line on that plot?" I was really glad this happened — I hope it meant she was engaged and enjoying herself!

    This is the nice thing about R. There are so many built-in functions and packages that you can get something useful with a few lines of code, and you don't really even have to know what a function is to get started (although you should eventually). Then you can go as far down the rabbit hole as you want.

  • Bokeh, a Python library for interactive visualization

    November 22, 2013  |  Software


    Bokeh, a Python library by Continuum Analytics, helps you visualize your data on the web.

    Bokeh is a Python interactive visualization library for large datasets that natively uses the latest web technologies. Its goal is to provide elegant, concise construction of novel graphics in the style of Protovis/D3, while delivering high-performance interactivity over large data to thin clients.

    If you're new to this stuff, you might just want to start with D3.js simply to avoid the Python setup, but if you use Python exclusively already, this might fit well in your workflow.

  • Databases for lazy people, a Python library

    November 15, 2013  |  Software

    Friedrich Lindenberg and Gregor Aisch recently released dataset, a Python library to take the grunt work out of using databases in Python.

    Although managing data in relational database has plenty of benefits, they’re rarely used in day-to-day work with small to medium scale datasets. But why is that? Why do we see an awful lot of data stored in static files in CSV or JSON format, even though they are hard to query and update incrementally?

    The answer is that programmers are lazy, and thus they tend to prefer the easiest solution they find. And in Python, a database isn't the simplest solution for storing a bunch of structured data. This is what dataset is going to change!

    So many times I start with a dataset, try to avoid the busy work in creating a database for a smallish project, and eventually dig up an old script or the most recent version of it. Saving this one for later.

  • Responsive maps with D3.js

    October 18, 2013  |  Software

    Responsive maps

    A challenge these days with visualization is that a piece might look great on a computer monitor and then break on a tablet or phone. However, if you design your software with that in mind so that it adapts to the device it's on — so that it's responsive — your audience loves you more for it. Chris Amico explains how to get started in D3.js: responsive maps, charts, and legends.

  • R plotting package ggplot2 ported to Python

    October 16, 2013  |  Software

    ggplot2 ported to Python

    Those who use the ggplot2 package in R and do everything else in Python will appreciate this Python port of the package from yhat.

    Excel makes some great looking plots, but I wouldn't be the first to say that creating charts in Excel involves a lot of manual work. Data is messy, and exploring it requires considerable effort to clean it up, transform it, and rearrange it from one format to another. R and Python make these tasks easier, allowing you to visually inspect data in several ways quickly and without tons of effort.

    The preeminent graphics packages for R and Python are ggplot2 and matplotlib respectively. Both are feature-rich, well maintained, and highly capable. Now, I've always been a ggplot2 guy for graphics, but I'm a Python guy for everything else. As a result, I'm constantly toggling between the two languages which can become rather tedious.

    Once you get the Python library installed (and its dependencies), you'll be able to use the same layered graphics approach as the R package, with a similar syntax.

  • Raw, a tool to turn spreadsheets to vector graphics

    October 8, 2013  |  Online Applications

    Sometimes it can be a challenge to produce data graphics in vector format, which is useful for high-resolution prints. Raw, an alpha-version tool by Density Design, helps make the process smoother.

    Primarily conceived as a tool for designers and vis geeks, Raw aims at providing a missing link between spreadsheet applications (e.g. Microsoft Excel, Apple Numbers, OpenRefine) and vector graphics editors (e.g. Adobe Illustrator, Inkscape, Sketch). In this sense, it is not intended to be a full “visualization tool” like Tableau or other similar products: as the name suggests it is a sketch tool, useful for quick and preliminary data explorations as well as for generating editable visualizations.

    Although still in its early stages, Raw is actually quite useable. Start with a dataset copy and pasted from your spreadsheet, select a visualization format, and then click-and-drag how you want to represent values. Modify options as you see fit and download in the format you need.

  • Easier Census data browsing with CensusReporter

    September 17, 2013  |  Online Applications


    Census data can be interesting and super informative, but getting the data out of the dreaded American FactFinder is often a pain, especially if you don't know the exact table you want. (This is typically the case.) CensusReporter, currently in beta, tries to make the process easier.

    CensusReporter is a Knight News Challenge-funded project to make it easy for journalists to write stories using U.S. Census data. Expanding upon the volunteer-built Census.ire.org, Census Reporter will simplify finding and using data from the decennial census and the American Community Survey. The goal of the new site is to include much more data, to provide a friendlier interface for navigating all of that data, and, as much as possible, to use visualizations to provide a more useful first look at the data.

    Although the application is still a work-in-progress, it's usable and clearly on its way to an improvement over the painful default. The CensusReporter is faster, easier to use, and the graphics provide a visual summary that helps you decide if the current table is actually what you want.

  • Excel paintings

    September 16, 2013  |  Software

    Excel paintings

    Tatsuo Horiuchi wanted to learn something new before retiring, so he bought a computer and booted up Microsoft Excel. Traditional graphics software from companies like Adobe were too expensive. Hariuchi's beautiful results look like nothing you normally see come out of the spreadsheet software.

    People like to knock Excel or <insert software program here> as if it's the leading cause of their challenges with data. Then you come across something like this. Maybe it's not the easiest way to go about making something, but if you can draw based on data, you've got yourself a way to visualize data. Maybe the software isn't your problem. [via ReadWrite]

  • Introduction to R, a video series by Google

    August 13, 2013  |  Software

    Google released a 21-part short video series that introduces R. Most of the videos are about two minutes, with none of them going over six, and each one is a on focused task or concept. So this could be a good way to start. Just open R, start a video, and follow along.

    Here's the first video in the series. It shows you how to write a simple script and navigate:

    [via Revolutions]

  • Beer Mapper: An experimental app to find the right beer for you

    April 30, 2013  |  Software

    Beer map

    Kevin Jamieson, an electrical and computer engineering graduate student at the University of Wisconsin-Madison, put his work in active ranking into practice. The experimental app is called Beer Mapper.

    The application presents a pair of beers, one pair at a time, from a list of beers that you have indicated you know or have access to and then asks you to select which one you prefer. After you have provided a number of answers, the application shows you a heat map of your preferences over the "beer space."

    Around 10,000 beers with at least 50 reviews on RateBeer were used as the foundation of the recommendation system. The reviews were reduced to just the individual words and counts, which gives sort of a profile for each beer (or a "weighted bag of words"). You rate beers, and the system tries to find profiles that are mathematically most similar.

    Two caveats. The first is that it looks like the app just gives you a heat map of the styles of beer you might like. A recommended list of actual beers would be way better. Second, the app is a research project that likely won't be in the app store any time soon, so the first point is moot. Sad face. Maybe Untappd should read Jamieson's paper. [via Fast Company]

  • Binify for hexagon binning in Python

    April 25, 2013  |  Software

    Hexagon binning

    As an alternative to dot density maps, Binify by Kevin Schaul allows you to map with hexagon binning in Python.

    Dot density maps are a straightforward way to visualize location data, but when you have too many locations, points can overlap and obscur clusters and trends. That's where binning comes in. Generally speaking, the goal is to look at an area on a map and then count how many points are within that area. Do that across the entire area.

    Grab the package on GitHub and go to town.

  • Vega: A visualization grammar to create without programming

    April 2, 2013  |  Software

    Population with Vega

    Visualization online can be a challenge if you don't know how to program. Analytics startup Trifacta just lightened the load with Vega, a "visualization grammar" that lets you create and share by editing a JSON file. Check out the demo live editor to see how this works. Select different chart types from the drop down menu on the top left, which you can render in HTML5 Canvas or SVG.

    Of note: Vega is built on top of Data-Driven Documents.

    To get right to the point: Vega is NOT intended as a "replacement" for D3. D3 is intentionally a low-level system. During the early design of D3, we even referred to it as a "visualization kernel" rather than a "toolkit" or "framework". In addition to custom design, D3 is intended as a supporting layer for higher-level visualization tools. Vega is one such tool, and leverages D3 heavily within its implementation.

    Gonna keep an eye on this one.

  • Forecast: A weather site that’s easier to read

    March 27, 2013  |  Online Applications


    When you go to one of the major sites to look up the weather, it's often hard to find what you're looking for. The sites feel dated, there isn't much hierarchy to the information, and navigation gets buried in the show-as-much-information-as-possible-on-the-same-page approach. Forecast, a site by the makers of the Dark Sky app, hopes to improve that experience during those times you need more than the high and lows for the day from the nearest widget.

    When you visit Forecast, you notice a difference right away. There's a map with local, regional, and global views, the temperature in large print on the right, and there are descriptions about what to expect that are easy to understand.

    From there, you get your daily forecasts below the map with details on demand. So you can get a lot of the same information that you get from larger sites, but you don't get hit with a bunch of data at once, and when you request more information, you get it quickly.

    There's also an API. Forecast and the Dark Sky app both run on it, which is the cherry on top of the goodness.

    I usually go to Matthew Ericson's minimalist weather page when I'm figuring out when to ride my bike or mow the lawn. Forecast might be my new weather destination for a while.

  • Learn about politics in your state with Open States

    February 26, 2013  |  Online Applications

    Open States

    It's not especially straightforward to know or find out what's going on with your state's government. Sites aren't maintained, are unusable, or just don't provide much information. Open States, a project by the Sunlight Foundation, aims to change that.

    After more than four years of work from volunteers and a full-time team here at Sunlight we're immensely proud to launch the full Open States site with searchable legislative data for all 50 states, D.C. and Puerto Rico. Open States is the only comprehensive database of activities from all state capitols that makes it easy to find your state lawmaker, review their votes, search for legislation, track bills and much more.

    Just click on a state or enter an address, and you can quickly get information that's relevant to where you are. There's also iPhone and iPad apps if you prefer those, and all the data on the site is accessible via an API or a bulk data dump.
    Continue Reading

  • Sitegeist: A mobile app that tells you about your data surroundings

    December 14, 2012  |  Software

    From businesses to demographics, there's data for just about anywhere you are. Sitegeist, a mobile application by the Sunlight Foundation, puts the sources into perspective.

    Sitegeist is a mobile application that helps you to learn more about your surroundings in seconds. Drawing on publicly available information, the app presents solid data in a simple at-a-glance format to help you tap into the pulse of your location. From demographics about people and housing to the latest popular spots or weather, Sitegeist presents localized information visually so you can get back to enjoying the neighborhood. The application draws on free APIs such as the U.S. Census, Yelp! and others to showcase what's possible with access to data.

    Available for free on both Android and iPhone. Data just a flick and a scroll away. [Thanks, Nicko]

  • Shiny allows web applications with R

    November 13, 2012  |  Software

    Shiny for R

    RStudio, the folks behind the IDE for R released last year, continues to expand their offerings for current and future R users. Shiny is RStudio's most recent release, and it aims to make R web applications easier to make and share.

    The main advantage is that you can create user interfaces that show R output, without HTML and JavaScript. There are essentially two parts to each app that you write: the client and the server. You load the Shiny package, create a client and server, and you're off to the races.

    However, don't get too excited about R on the Web yet. The apps are meant to run locally, so to share an application with someone, you have to send them the code for them to run on their own. RStudio is working on a paid service that lets you host your apps online. Or, because Shiny is open source, you can try running it on your own, if you like.

  • xkcd-style charts in R, JavaScript, and Python

    October 19, 2012  |  Software

    xkcd-style plots

    The ports and packages to make your charts look like the came out of the web comic xkcd are coming out in rapid fashion. Dan Foreman-Mackey stylized charts in JavaScript using D3, Mark Bulling did the same in R, and Jake Vanderplas described how he did in Python. Still waiting for a Gangnam theme.

Unless otherwise noted, graphics and words by me are licensed under Creative Commons BY-NC. Contact original authors for everything else.