Let’s say you have this idea for a visualization or application, or you’re just curious about some trend. But you have a problem. You can’t find the data, and without the data, you can’t even start. This is a guide and a list of sources for where you can find that data you’re looking for. There’s a lot out there.
Being a graduate student, I always look to the library for books and resources. Many libraries are amping up their technology and have some expansive data archives. Many statistics departments also tend to keep a list of data somewhere.
- Data and Story Library – An online library of datafiles and stories that illustrate the use of basic statistics methods, from Carnegie Mellon
- Berkeley Data Lab – Part of the UC Berkeley library system. Hey, they’ve even on Twitter.
- UCLA Statistics Data Sets – Some of the data that UCLA stat uses in their labs and assignments.
I’m sure you’ve seen a graphic in the paper or I guess more likely, on a news site, and wondered about another aspect of the data. Major news organizations always put their sources somewhere on the graphic or are mentioned in the accompanying article. It’s usually not a direct link, but a quick online search will get you to the right place. Sometimes, you’ll have to email someone to get the same data, but those people are usually happy that you’re interested in their data or analysis.
- The New York Times – They also have several data-rich APIs
- Wall Street Journal
- Guardian Datablog – Provides a lot of free-to-use data via Google spreadsheets.
Got some mapping software, but no geographic data? You’re in luck. There are plenty of shapefiles, etc. at your disposal.
- TIGER – From the US Census Bureau, detailed data about roads, railroads, rivers, and zipcodes. Probably the most extensive you’re going to find.
- OpenStreetMap – One of the best examples of data and community effort.
- Geocommons – Both data and a map maker.
- Flickr Shapefiles – Boundaries as defined by Flickr users.
America loves its sports, and thus, has decades of sports data. You’ll find it on Sports Illustrated or the sports organizations’ sites, but you’ll also find more on sites dedicated to the data.
There are several noteworthy international organizations that keep data about the world, mainly health and development indicators. It does take some sifting though, because a lot of the data sets are pretty sparse. It’s not easy to get standardized data across countries with varied methods.
- Global Health Facts
- UNdata – Most of the data I used for Progress came from this data search engine from the United Nations.
- World Health Organization
- OECD Statistics
Government and Politics
With the new administration, there’s been a fresh emphasis on data and transparency, so there are lot of government organizations that supply data. They’ve been doing this for a while, but with the launch of data.gov, much of the data is finding itself in one place. There are also plenty of non-governmental sites that aim to make politicians more accountable.
- Census Bureau – Incredibly important data about the country with more effect on your life than you probably know
- DataSF – San Francisco recently launched their own data site. Hopefully, other cities follow suit. Check out the showcase.
- Follow the Money
- OpenSecrets – Interesting site MAPlight is powered by data from OpenSecrets.
You’re usually going to find the best data straight from the source, but there are lots of applications and sites that try to make all data easier to find or easier to access.
- Freebase – Free data and a community effort. For some types, the data are kind of sparse, but it continues to get better.
- Many Eyes – More of a visualization and exploratory site than for data, but they do have a data section.
- Infochimps – Did you get your invite?
- Amazon Public Data Sets
- DBpedia – Allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data.
- Wikipedia – Lots of HTML tables. Copy and paste in Excel.
Get it From an API
Plenty of sites and applications make their data freely available via APIs. Twitter has an API (duh). Google has lots of APIs. Yahoo does too. So on and so forth. Visit Programmable Web for a detailed catalog of what’s available.
Scrape the Data
I’m still figuring out how to scrape AJAX-based sites though. I’d be happy to hear any tips from anyone who has experience with that.
Did I miss anything? Where do you get your data from?