30 Resources to Find the Data You Need

October 1, 2009

Topic

Data Sources, Guides / resources

Let’s say you have this idea for a visualization or application, or you’re just curious about some trend. But you have a problem. You can’t find the data, and without the data, you can’t even start. This is a guide and a list of sources for where you can find that data you’re looking for. There’s a lot out there.

Universities

Being a graduate student, I always look to the library for books and resources. Many libraries are amping up their technology and have some expansive data archives. Many statistics departments also tend to keep a list of data somewhere.

DATA SOURCES:

Data and Story Library – An online library of datafiles and stories that illustrate the use of basic statistics methods, from Carnegie Mellon
Berkeley Data Lab – Part of the UC Berkeley library system. Hey, they’ve even on Twitter.
UCLA Statistics Data Sets – Some of the data that UCLA stat uses in their labs and assignments.

News Sources

I’m sure you’ve seen a graphic in the paper or I guess more likely, on a news site, and wondered about another aspect of the data. Major news organizations always put their sources somewhere on the graphic or are mentioned in the accompanying article. It’s usually not a direct link, but a quick online search will get you to the right place. Sometimes, you’ll have to email someone to get the same data, but those people are usually happy that you’re interested in their data or analysis.

DATA SOURCES:

The New York Times – They also have several data-rich APIs
Wall Street Journal
Guardian Datablog – Provides a lot of free-to-use data via Google spreadsheets.

Geographic Data

Got some mapping software, but no geographic data? You’re in luck. There are plenty of shapefiles, etc. at your disposal.

DATA SOURCES:

TIGER – From the US Census Bureau, detailed data about roads, railroads, rivers, and zipcodes. Probably the most extensive you’re going to find.
OpenStreetMap – One of the best examples of data and community effort.
Geocommons – Both data and a map maker.
Flickr Shapefiles – Boundaries as defined by Flickr users.

Sports

America loves its sports, and thus, has decades of sports data. You’ll find it on Sports Illustrated or the sports organizations’ sites, but you’ll also find more on sites dedicated to the data.

DATA SOURCES:

World

There are several noteworthy international organizations that keep data about the world, mainly health and development indicators. It does take some sifting though, because a lot of the data sets are pretty sparse. It’s not easy to get standardized data across countries with varied methods.

DATA SOURCES:

Global Health Facts
UNdata – Most of the data I used for Progress came from this data search engine from the United Nations.
World Health Organization
OECD Statistics

Government and Politics

With the new administration, there’s been a fresh emphasis on data and transparency, so there are lot of government organizations that supply data. They’ve been doing this for a while, but with the launch of data.gov, much of the data is finding itself in one place. There are also plenty of non-governmental sites that aim to make politicians more accountable.

DATA SOURCES:

Census Bureau – Incredibly important data about the country with more effect on your life than you probably know
Data.gov
DataSF – San Francisco recently launched their own data site. Hopefully, other cities follow suit. Check out the showcase.
Follow the Money
OpenSecrets – Interesting site MAPlight is powered by data from OpenSecrets.

General Sources

You’re usually going to find the best data straight from the source, but there are lots of applications and sites that try to make all data easier to find or easier to access.

DATA SOURCES:

Freebase – Free data and a community effort. For some types, the data are kind of sparse, but it continues to get better.
Numbrary
Many Eyes – More of a visualization and exploratory site than for data, but they do have a data section.
Infochimps – Did you get your invite?
Swivel
Amazon Public Data Sets
DBpedia – Allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data.
Wikipedia – Lots of HTML tables. Copy and paste in Excel.

Get it From an API

Plenty of sites and applications make their data freely available via APIs. Twitter has an API (duh). Google has lots of APIs. Yahoo does too. So on and so forth. Visit Programmable Web for a detailed catalog of what’s available.

Scrape the Data

When all else fails, you can always find a site that serves the data through HTML pages, and then scrape the data with Python, javascript, or whatever language you’re comfortable with. I use Python with the Beautiful Soup library that makes parsing pretty easy.

For example, I scraped weather data from Weather Underground a while back (although I don’t think the script works anymore). I also used it gather television sizes from CNet.

I’m still figuring out how to scrape AJAX-based sites though. I’d be happy to hear any tips from anyone who has experience with that.

Did I miss anything? Where do you get your data from?

68 Comments

Ben Hosken — October 1, 2009 at 3:28 am

One great source worth supporting that was just opened up this week!! is http://data.australia.gov.au/

As for scraping AJAX sites… I suggest using FIrebug in Firefox which can identify the URL that the data is being pulled from. Then you can “generally” use traditional scraping to grab the data directly.
Pingback: [tip] 30 Resources to Find the Data You Need
JoÃ£o Paulo de Paiva — October 1, 2009 at 4:55 am

Rock+Roll Nathan
jerome cukier — October 1, 2009 at 4:58 am

OECD statistics:
http://stats.oecd.org/Index.aspx
http://www.oecdilibrary.org/oecd/content/statistics
Stef — October 1, 2009 at 5:04 am

UNEP’s (United Nations Environment Programme) GEO Data Portal gives access to some 500 different variables, as national, subregional, regional and global statistics or as geospatial data sets (maps), covering themes like Freshwater, Population, Forests, Emissions, Climate, Disasters, Health and GDP. One can display them on-the-fly as maps, graphs, data tables or download the data in different formats.
Bob — October 1, 2009 at 5:56 am

Another source of data (that is more specific to machine learning) is UCI’s Machine Learning Repository – as of Oct 1, 2009, there are 185 data sets.
Bob — October 1, 2009 at 5:58 am

Apologies – forgot to include the link to the UCI ML Repository.

http://archive.ics.uci.edu/ml/
Drew Conway — October 1, 2009 at 7:55 am

One big omission re: universitiy/gov’t politics data is the Inter-University Consortium for Political and Social Research (ICPSR)

http://www.icpsr.umich.edu/icpsrweb/ICPSR/access/index.jsp

The primary source for most heavily cited data sources in the social sciences
- Hadley — October 1, 2009 at 2:39 pm
  
  But their access policies are vicious!
- Frank — October 16, 2009 at 6:23 am
  
  You must be an academic researcher to get access
Drew Conway — October 1, 2009 at 7:58 am

Also, a heads up, your link to the databaseFootball points to the wrong site
Tracy Boyer — October 1, 2009 at 8:15 am

StatSheet definitely needs to be included under Sports. It is the only site I have seen that does stats and visualizations for a variety of sports:

http://statsheet.com/
Pieter — October 1, 2009 at 8:36 am

I don’t want to destory the party here but maybe you could add something around copyright as well? Scraping data from a website is (in most cases) illegal. Even if you use an API you should read the license to see how it can be used.

I know it’s annoying but you should take these things into account. That’s why Freebase, OpenStreetMaps and similar sites are nice, they have a liberal license that allows you to use the data in almost any way you want. Google is much stricter for example, you can not use Google Maps without the Google logo or even change it’s colors.

Just my 2 cents…
- Nathan Yau — October 1, 2009 at 2:13 pm
  
  i agree with you 100%
- Hadley — October 1, 2009 at 2:38 pm
  
  That’s not correct (at least in the US) – data is not copyrightable.
  - Nathan Yau — October 1, 2009 at 2:48 pm
    
    but what about permissions? so you can scrape and use whatever you like?
  - jcukier — October 1, 2009 at 5:19 pm
    
    data is -often- not copyrightable under us law.
  - Andy H — October 4, 2009 at 3:18 pm
    
    Here is a decent summary of data copyright laws: http://answers.google.com/answers/threadview/id/778789.html
    
    It is quite murky.
- alex — October 15, 2009 at 6:16 pm
  
  illegal now, itunes later.
  - John W. Palmer — October 27, 2009 at 11:35 pm
    
    Remember this comment, Alex. Seriously. I think there is an incredible weight in what you just said as it pertains to the future of this field.
Pingback: Fontes de dados « Novidades
Pingback: the 30 top trending webdesign, seo and development-related links on the web! « Adrian Zyzik’s Weblog
Peter F. Couvares — October 1, 2009 at 12:37 pm

We’ve also aggregated a large repository of social and government data (much of it auto-updated from the primary sources) at http://verifiable.com . All of it can be visualized using our software or downloaded in raw form (for free) for use with other tools.
Roynr — October 1, 2009 at 1:09 pm

Wow I didn’t realize there was this much data in the world. I wonder what the redundancy rate is?
Jp — October 1, 2009 at 1:24 pm

Someone said that Intel is the new data inside. Or is it the other way round? Seriously, I already bookmarked this post under 3 user accounts on del.icio.us, just to make sure. Such a wealth of data resources is to be saved until indefinite posterity. :-)
Pingback: Moonlit Minds « Moonlit Minds
annie — October 1, 2009 at 2:15 pm

agreed on statsheet. their data is pretty awesome.
Andrew — October 1, 2009 at 2:25 pm

Another very rich source for data that most universities will subscribe to is the World Bank’s World Development Indicators. I believe that they also allow access to limited data series for those without institutional subscriptions.
Glen Barnes — October 1, 2009 at 3:08 pm

I would use Charles for helping you scrape AJAX sites. It sits between any browser and the website and lets you see what traffic is going between them which then makes it a lot easier to work out what you need to call. IMHO it is easier to interpret the data than using Firebug.

What you need to look for is the call it is making (format and parameters) and then what the response data and format is. Typically this will be JSON, XML or HTM content returned. The great thing is that it is more likely to be structured in a machine readable format.

Disclosure: Charles is written by a friend of mine.
- Nathan Yau — October 1, 2009 at 3:20 pm
  
  @Glen – link?
  - Glen Barnes — October 1, 2009 at 3:33 pm
    
    The word Charles is hyperlinked above but it doesn’t show that well with the stylesheet on this site.
    
    http://www.charlesproxy.com/
  - Nathan Yau — October 1, 2009 at 3:51 pm
    
    thanks. it was hyperlinked, but no link inside the anchor tag.
Pingback: Finding Data – elearnspace
Chris — October 1, 2009 at 6:46 pm

For US health data, WONDER http://wonder.cdc.gov/ is WONDERFUL.

To contrast with UK Health, http://www.dh.gov.uk/en/Publicationsandstatistics/Statistics/index.htm
Mx — October 1, 2009 at 10:44 pm

Cool stuff! btw, also checkout Feedity – http://feedity.com – I use it a lot these days for creating custom RSS feeds from various webpages. It is simple to use and gives great results. Hope it helps! Chao :)
dan — October 1, 2009 at 11:57 pm

Also, WRI’s EarthTrends database has country-level data on lots of economic, environmental and social indicators:

http://earthtrends.wri.org
troels — October 2, 2009 at 8:02 am

If you really feel up for the task of taking on vast amounts of high-dimensional data you should take a look at the Gene expression omnibus (http://www.ncbi.nlm.nih.gov/geo/). Shortly and simplified, it is a repository of how much the genes of a given organism, e.g. human, are expressed under a wide variety of conditions.

All the data can be easily extracted into R using tools from bioconductor, such as GEOquery.
- Andy H — October 4, 2009 at 3:23 pm
  
  For those more comfortable in PERL, I would suggest IEAutomation for scraping AJAX sites,
  
  It is a nice module that automatically controls actions in Internet Explorer, the nice part for AJAX is that when you use the code:
  
  $ie->Content();
  
  to get the source of the page, it gives you the post-AJAX source, instead of just the original page source you would get if viewing the source in a browser.
  
  http://search.cpan.org/~prashant/Win32-IEAutomation-0.4/lib/Win32/IEAutomation.pm
Pingback: tecosystems » The View from HadoopWorld
Pingback: Midday open thread | The Latest Liberal Blogs
Pingback: Infolink Bookmarks (weekly) « School libraries in South Africa
Pingback: Flow » Blog Archive » Daily Digest for October 4th - The zeitgeist daily
Abdul — October 5, 2009 at 5:53 am

We are into lead generation and marketing. This post is very useful for us to find the data and how to collect data. Its really help ful.

Thanks for sharing information,
cheers
Abdul
Pingback: Farr-Out Links to Learning » Blog Archive » October 9 – Top 10
aidian — October 9, 2009 at 3:43 am

Get the state code/code of state regulations for wherever you’re working. The code is in many ways the rules that are set up to follow the laws passed on the political side. In there you’ll find all sorts of details about what exactly the agency is legally required to collect. Then ask the agency for it. If they tell you no, file an open records request, and be willing to fight for it.

Two other tips to see what out’s there —
records retention schedules (try the secstate or state archivist
and a tip a teacher gave me — every time you see a government form, there’s a database out there for it
Pingback: Charts and graphs and data oh my! | Research by Design
Pingback: Media Literacy: Making Sense Of New Technologies And Media by George Siemens – Oct 10 09 « Argument
Pingback: Some Useful Data Resources « Social Computing
James Saunders — October 13, 2009 at 9:46 am

I just came across this site today: http://www.factual.com/

It sounds like they are pretty open with their rights and API, but I haven’t had enough time to do much investigation yet.
Pingback: The data landscape online, as we see it. Part 1 | blog.infochimps.org
Pingback: Thursday Reads | The Big Picture
Pingback: 30 Resources to Find the Data You Need « ResourceShelf
Pingback: 30 Resources to Find the Data You Need « Quantitative Trading
rjs — October 16, 2009 at 4:55 am

worldcat.org – world library catalog
Stephen McDaniel — October 16, 2009 at 9:59 am

Data and facts can NOT be copyrighted or otherwise protected.

However, you can be bound by a contract that is enforcable in court (civil suit) if you agree to the Terms of Service of a site or application. Generally, this must be explicit acceptance of such terms. In other words, if Google can find it and openly displays it -AND- if it is a fact (fictional subjects don’t work like this) then you can use it for analysis at will.

Note this is a continuum, as lawyers often do with such topics. So scraping stock prices is free game, analyst ratings somewhat free game, and customer reviews at Amazon likely a loser if you do it. However, aggregating Amazon results might get by, it’s hard to say. How much do you have for court costs against Google, Amazon, etc.???
Pingback: 30 Resources to Find the Data You Need | FlowingData | My Web Tools
JenniC — October 21, 2009 at 5:05 pm

Nice article. You forgot a big category – stock and business data. People make money by collecting, organizing and selling this data.

Take a look at http://forum.codecall.net/html-programming/21123-web-scraping-get-stock-info.html?mode=threaded#post205686 for stock data download and parsing. That script is in biterscripting, but any scripting language will do.
rjs — October 21, 2009 at 5:27 pm

i forgot to add the world factbook from the CIA; continuously updated with all declassified info…

https://www.cia.gov/library/publications/the-world-factbook/index.html
- kelly — October 27, 2009 at 1:05 am
  
  you can’t forget about IPUMS for the University section.
  
  http://www.ipums.umn.edu/
  
  great data sets
Pingback: 30 Resources to Find the Data You Need â€” Simple Complexity
Seeking Alpha — October 28, 2009 at 9:06 am

We at Seeking Alpha think your blog is great and would love to have you join our team. Please contact me at [email protected].

Thanks,
Boaz
Joni Saunders — October 30, 2009 at 6:09 pm

Hi All,
I thought there might be some interest in these short web lectures from the Center for Research Libraries in Chicago :

Political Science, Sociology, and Economics
In the fields of political science, sociology, and economics, digital technology has led to an explosion of data and information. This webcast will examine how both nonprofit and commercial organizations aggregate and distribute information on public opinion, populations, and finance, and how researchers use those sources. The presentation will feature three case studies:

â€¢Cline Center for Democracy, Societal Infrastructures and Development Project
â€¢Dow Jones Factiva
â€¢National Opinion Research Center General Social Survey
Monday, November 9 2:00 pm
Tuesday, November 10 10:00 am
Wednesday, November 11 12:00 Noon
Pingback: 30 Resources to Find the Data You Need « Your Search Assistant
Pingback: VatulBlog: links for 2009-11-05
Pingback: VatulBlog: links for 2009-11-06
Lyndie — November 12, 2009 at 7:11 pm

Check out http://www.researchpipeline.com
This is a wiki that attempts to catalog all the free sources of data out there…

Thanks,
Lyndie Chiou
Panos Ipeirotis — November 19, 2009 at 12:49 pm

The NYC Data Mine repository http://www.nyc.gov/html/datamine/html/home/home.shtml “supplies many sets of public data produced by City agencies. The data sets are available in a variety of machine-readable formats and are updated often.”

This can also go under the datasets listed under “Government and Politics”

30 Resources to Find the Data You Need

Topic

Universities

News Sources

Geographic Data

Sports

World

Government and Politics

General Sources

Get it From an API

Scrape the Data

68 Comments

Second Edition

Visualize This: The FlowingData Guide to Design, Visualization, and Statistics (2nd Edition)

30 Resources to Find the Data You Need

Topic

Universities

News Sources

Geographic Data

Sports

World

Government and Politics

General Sources

Get it From an API

Scrape the Data

Related

68 Comments

Second Edition

Visualize This: The FlowingData Guide to Design, Visualization, and Statistics (2nd Edition)