PDF data woes

Posted to Data Sharing  |  Tags: ,  |  Nathan Yau

We do not provide these tables in Excel or CSV format. You will have to cut and paste from the pdf.

— A government group that provides a lot of data

If you’re going to provide a dataset to the public, or anyone for that matter, please don’t use PDF as your one and only format. At the very least, provide it in Excel. You can easily export spreadsheets to PDF. I don’t hold anything against the person who sent me this message. She was just doing her job. But organizations need to get with the times and provide data in a way that is actually usable.


  • I actually had this issue with my past employer. We had to take about 600 pages worth of a survey across maybe 12 waves and literally copy and paste each line individually into excel using the copy table function. There is a way of getting around this now and we were able to do so and export the pdfs for the most part almost instantaneously. There were two or three exceptions but it took us about 3-4 years to actually figure out that you can export a pdf into microsoft word. The other surveys we had to do a heck of a lot copy and pasting for.

  • Wow – what is this, 1998? What government group was that? The best way to spark change is to communicate it so more people can request the real thing.

    • In my case it was a private company that provided the codebooks only in PDF format. We were attempting to catalogue the entire survey for use. So really got no help there to convert from PDF to excel.

  • Don’t get me started! :)

    State of California. we’re looking at you!

  • I’ve dealt with this frustration so often trying to get FOI data on behalf of journalists lately that we’ve batted around the idea of making some kind of “How to get your data when stumped with a PDF” flow-chart, one that would be partly for humour (xkcd-stylez http://xkcd.com/844/ ) and partly for education …

  • Earlier this year I saw a tweet from a city agency boasting the “staggering” amount of data they produced. But the document the tweet referenced was in PDF format. And one that was obviously produced from an Excel file in the first place. But they were keen on sending around the PDF and only after I prodded them a bit did they think to share the raw data (to my knowledge they still haven’t shared the underlying data files). Here’s my writeup of my PDF woes: http://wp.me/pBfcP-9M

    The embarrassing thing is that the agency is in New York City, the self-avowed leading digital city in the US. I think New York still has quite a way to go before legitimately achieving that title.

  • The Buzzdata blog actually had a good write-up suggesting things to try when you hit a roadblock accessing gov data.

  • I get this all the time working with state transportation agencies. It’s incredibly frustrating. The lack of knowledge about individuals have about their own agencies’ data is mind boggling. People actually tell me that they maintain the data in no other format! … Really, you only maintain your databases in PDF, really???

  • I hate this as much as anyone, but it’s not like there aren’t solutions out there.

    Just Google “convert pdf to Excel”.

    • Do you know a program that works well every time? I’ve tried a few and they always get tripped up on the headers and footers of a PDF document. Not to mention if a PDF was scanned, if that’s the case forget about it!

      • Having dealt with a lot of government agencies and other sources that send data only as PDF, I’ve tried a bunch of different methods and programs to convert them into some sort of actual data. I’ve had the best luck with Able2Extract, tho depending upon the format of the PDF, the layout of the table and the structure of the data, it may or may not do the trick.

    • The conversion is never that straightforward, and there are too many things that can go wrong.

      • Agreed. In fact its a super pain if the files are not formatted in the exact same manner. I had to bug our IT guy for months to figure out a consistent way for us to do this in the future.

  • Why Excel? Why not tab-delimited text file?

    • Tab-delimited would be fine, too. I only say at least Excel, because most of these PDFs are generated by the software anyways. So it’d actually be less work.

      • Nathan,
        There are a few reasons for this: government-types convert data to .pdf files in order to create an electronic equivalent of a paper file (which is what most bureaucrats would rather give you), in a commonly accessible format that can be emailed or posted on a web 1.0 site. Government doesn’t trust *anybody* not to alter the original file and claim it to be the original data. It is waaaaaaay beyond the technical capabilities of most people in the government workforce to tag a file with appropriate metadata or sign a file (using their government-issued key).

  • short of maybe using a .jpg with an image of the data, I can’t imagine any file format worse for export than the pdf.

    • Try looking up the software Choices 2. I had to use it. It no longer exists and ran on Excel 3.0
      Literally the worst program/file I have ever had to work with.

  • It’s incredibly wide spread and just about every government worker has that program on their desktop. Chances are very good that even an unskilled person will understand when you say that.

    Saying “excel” is just plain more understandable, and easier for the most common denominator of skill to accomplish.

  • Similar discussion from the Sunlight labs, 2009:

    Adobe is Bad for Open Government

  • About 3 years ago, I did a side job with a TV station who wanted to catalog information on daycare facilities around their city, which was in 3 counties. 2 counties provided CSV files, one provided PDF. I got hired to extract and table 900 PDF reports, which all had pretty much the same layout, by converting the PDF into a text file and then using a mess of regular expressions to extract the data into a tabular form. It was a fun exercise for me that got me some coffee money (I don’t do this regularly, I was a friend of someone there who knew I had done something similar before for my own personal stuff) and some more experience as to how to handle problems like this that shouldn’t exist but do.

  • Has anyone considered that maybe they *deliberately* put the information in PDF to make it harder to use? They can still tick the box saying that they’ve released the data….

    • I think that is a given. In general, the Govt wants you to think that you’re free to dig deeper but they don’t really want you to dig.

  • I second Stephen’s comment. I went through a six month FOIA battle over aid data, and the resulting document was five pages, which was formatted in such a manner that it had *clearly* just been exported from .xls, and I’d put 2:1 odds on a wager that they were trying to make it more difficult to analyze.

    Two hours of manual entry later… voila! http://haitijustice.wordpress.com/2011/08/23/how-the-government-used-our-money-in-haiti-foia-request/

  • Tips for coders working on this:

    First convert the PDF to XML with http://poppler.freedesktop.org/ . Then parse the XML and load up a table data structure. Sort the text boxes by position on the page, because otherwise they are likely to be out of order. Don’t get fooled by the fonts. For example, a math font has the Greek letter mu as an m. In my data, this meant that the numeric prefix for micro was read as milli, causing an error of a factor of 1000! Keep track of the font in order to watch out for this.

    PDF appears to be the native file format for Adobe Illustrator.

    As a last resort you can print and OCR. Accuracy from a fresh printout on decent paper is not too bad. Much better than you get from an old book.

    If there is much interest from this, I can write up more about it and put it on tomacorp.

  • Please don’t advise people to use proprietary or pseudo-proprietary formats like Excel as a data interchange format. There are just so many problems with proprietary formats and Excel/OOXML in particular which can often make it a nightmare to work with.

  • “CUT and paste from the PDF” — maybe that person also cut the PDF from her computer and pasted it onto the interwebs. That government group doesn’t seem to be a bunch of computer experts anway.

  • I haven’t done it myself, but Stack Overflow has a question of how to convert PDF to Excel and the answers might be helpful: http://stackoverflow.com/questions/704287/converting-pdf-to-excel

  • this is really common in the insurance industry, where brokers and even agencies seem to deliberately provide only hard copies…

  • Excel is a proprietary format and equals fail too. Data needs to be offered, at the very lease, in a CSV or flat-file format.

  • Margaret Egan September 15, 2011 at 9:11 am

    If I had a buck for every funder who needs to STEP AWAY from the PDF!! Redundancy for data entry is the bane of our existence, no?

  • Henceforth, every FOIA-type boilerplate request should include the phrase “…in the original format used to create the file(s)” and “Please tell us the program or application — and the version of those programs — used to create the original file(s).”


How We Spend Our Money, a Breakdown

We know spending changes when you have more money. Here’s by how much.

Graphical perception – learn the fundamentals first

Before you dive into the advanced stuff – like just about everything in your life – you have to learn the fundamentals before you know when you can break the rules.

How You Will Die

So far we’ve seen when you will die and how other people tend to die. Now let’s put the two together to see how and when you will die, given your sex, race, and age.

How to Spot Visualization Lies

Many charts don’t tell the truth. This is a simple guide to spotting them.