Data is rarely in the format you want it. Dan Nguyen, for ProPublica, provides a thorough guide on how to scrape data from Flash, HTML, and PDF. [via @JanWillemTulp]
Data Sources
-
A guide for scraping data
-
Search how phrases have been used via Google Ngram Viewer
Language changes. Culture changes. And we can see some of these changes via what authors write about in books over the years. Google's Book Ngram Viewer lets you search through this data, and shows a graph similar similar to the output of Google Trends. The above is the trends for nursery school, kindergarten, and child care:
This shows trends in three ngrams from 1950 to 2000: "nursery school" (a 2-gram or bigram), "kindergarten" (a 1-gram or unigram), and "child care" (another bigram). What the y-axis shows is this: of all the bigrams contained in our sample of books written in English and published in the United States, what percentage of them are "nursery school" or "child care"? Of all the unigrams, what percentage of them are "kindergarten"? Here, you can see that use of the phrase "child care" started to rise in the late 1960s, overtaking "nursery school" around 1970 and then "kindergarten" around 1973. It peaked shortly after 1990 and has been falling steadily since.
Find anything interesting?
Continue Reading -
Jon Stewart explains Wikileaks’ Cablegate
You've probably already heard and read about Wikileaks' Cablegate. If not, Andy Baio has a fine roundup with significant coverage and events to get you caught up quick. Alternatively, you can watch Jon Stewart and The Daily Show explain in the clip below (slightly NSFW, because it mentions a body part).
Continue Reading -
How do people use Firefox?
Mozilla Labs just released a bunch of anonymized browsing data for their open data visualization competition:
This competition is based on Mozilla's own open data program, Test Pilot. Test Pilot is a user research platform that collects structured user data through Firefox. All data is gathered through pre-defined Test Pilot studies, which aim to explore how people use their web browser and the Internet.
There are two datasets in various formats. The first is browsing behavior from 27,000 users, including on/off private browsing that we saw a few months ago. The second dataset is from 160,000 users and is on how they actually use the Firefox interface.
Additionally, both sets have survey answers to questions like "How long have you used Firefox?" which could make for some fun and interesting breakdowns.
The deadline is December 17.
-
Making recalls and market withdrawals more accessible
Last week I found out that the FDA has a feed for all product recalls and market withdrawals since 2009... -
Opportunities in Government 2.0
Vivek Wadhwa talks government data and the (financial) opportunities ripe for the picking:
What is happening with the opening up of government data is nothing less than a silent revolution. There are literally thousands of new opportunities to improve government and to improve society—and to make a fortune while doing it. Unlike the Web 2.0 space, which is overcrowded, Gov 2.0 is uncharted territory: a new frontier to explore, grow things on, and settle on. It’s fresh soil for unlikely seedling ideas that, if they take root, could lead to very successful ventures. So I encourage entrepreneurs to stake their claims as soon as they can.
Wait a minute. Hold up. You can do more with government data than awkward dashboards? Bring it.
[TechCrunch via @ucdatalab]
-
How people use private browsing
Private browsing. All the modern browsers have it. Turn it on, and the browser won't keep your history during the session. Sometimes it's used to pay bank bills on a public computer. Sometimes it's used for other stuff. In an opt-in study looking at a week in the life of a browser, Mozilla looked at how people use private browsing.
Again, it's worth noting that people opted in to this study (about 4,000 of them), and Mozilla only recorded when users started and stopped private browsing. Nothing in between.
That said, they came up with two basic findings. The first is when people typically use private browsing (above).
They saw usage spikes during the lunch hours as well as just before the work day ended. The other spike is after the dinner hours and then finally, in the late hours of the night.
Continue Reading -
How weather data became open data
Weather in the private sector is over a $1.5 billion industry, and it's largely because of the government's open weather data. You can find what the weather is just about anywhere with just a few clicks of the mouse. It wasn't always like that though. Clay Johnson, former director of Sunlight Labs, describes the history of open weather data, starting with Thomas Jefferson in the late 1700s.
Continue Reading -
Afghanistan war logs revealed and mapped
This past Sunday, well-known whistle-blower site Wikileaks released over 91,000 secret US military reports, covering the war in Afghanistan. Each report contains the time, geographic location, and details of an event the US military thought was important enough to put on paper.
Continue Reading -
Data and its impact on journalism
In regards to the UK's recent boom in open data, Simon Rogers of the Guardian, ponders data's role in journalism, and the opportunities this new found information could bring:
The impact on journalism is expected to be great. The Chicago-based web developer and founder of the neighbourhood news site EveryBlock, Adrian Holovaty, says it's going to be challenging but exciting for journalists. "As more governments open their data, journalists lose privileged status as gatekeepers of information – but the need for their work as curators and explainers increases. The more data that's available in the world, the more essential it is for somebody to make sense of it."
This need not only creates a fresh brand of news, but also a new type of journalist:
I once prided myself on my lack of maths knowledge. Now I find myself editing a datajournalism site, the Guardian's datablog: a site where we use Google Spreadsheets to post key datasets. We make the data properly accessible, then encourage our users to take the numbers, produce graphics and applications and help us look for stories.
Priding yourself on a lack of know-how on how to deal with data is a little weird, but okay.
In any case, people always ask me how to get into information design, infographics, visualization etc. Journalism is one of those choices, and there's a lot of opportunity there if you've got the skills.
-
Egregious Citations Issued to BP
BP processes about 1.5 million barrels of crude oil per day, across six refineries in the United States. In total, 150 refineries in the United States process just under 18 million barrels per day, so BP processes about 8.5 percent of it. However, as reported by the Center for Public Integrity, 97 percent of the most dangerous violations found by OSHA were on BP properties.
Continue Reading -
Live webcast: Community Health Data Initiative

Health and Human Services (HHS) is about to announce the launch of their Community Health Data Initiative over in DC right now. The point is to make health data more usable for consumers and communities.
Today groups will be presenting how they've made use of the data in the past few weeks from about 9:30 to 10:30 - as in right now. I've embedded the live webcast below.
They're just going through the formalities of thank yous and intros right now, but the good stuff should start soon.
Continue Reading -
Twitter data buffet is back in business

Almost a year and a half ago, Infochimps, the data repository slash marketplace, released a giant scrape of Twitter data representing 2.7 million users, 10 million tweets, and 58 million connections. Twitter soon requested that they take it down while they figured out how they wanted to handle licensing, privacy, etc.
That was in 2008, before Twitter really started booming. Fast forward to now. Twitter and Infochimps have figured out what they want to do, and the Twitter census data is back up. It's no longer a measly 2.7 million users anymore though. The population has grown to 35 million.
Continue Reading -
World data released ‘is a dream come true’
In another step towards open data and all that jazz, the World Bank released World Development Indicators 2010 today, which is meant to serve as a progress report of the world.
The WDI provides a valuable statistical picture of the world and how far we've come in advancing development," said Justin Yifu Lin, the World Bank’s Chief Economist and the Senior Vice President for Development Economics. “Making this comprehensive data free for all is a dream come true.
More importantly though, this comes with the launch of the freely available online database and public API to 1,000+ indicators. There used to be a big fee for this data. I can't speak for the API, but the website is well-designed. It has profile pages for each country, links to download the indicators in Excel and XML, and hey, are those graphs implemented in HTML5? I spy
<canvas>tags.
Continue Reading -
TransparencyData makes campaign finance data easier to access

Anyone who's looked at campaign finance data knows it can get messy really quick (especially if you're getting it directly from the FEC). Sunlight Labs' newly launched TransparencyData aims to make the process a lot easier.
They've merged state data from FollowTheMoney and federal data from OpenSecrets and made it easy to search with a clickable interface. Select from a number of filters such as amount, recipient, or contributor, and then download data in bulk or make use of the API.
Continue Reading -
Buy and sell data at Data Marketplace

Add another site to the list of places to find the data you need. Data Marketplace connects people who want data to people who can find, scrape, and cull data.
Here's how it works. If you want data, you put in a request and optionally, a deadline and budget. A provider can then go find that data for you, maybe through scraping a difficult-to-parse website, and then post it online. You then have the option to purchase the tabular data.
There are three big humps to get over though for Data Marketplace to work.
Continue Reading -
Data.gov.uk versus Data.gov – Which wins?
Back in May last year, the US government launched Data.gov as a statement of transparency, and the Internet rejoiced. After... -
Data.gov.uk Gearing Up For Launch, er, Does Launch
Update: I had scheduled this post for next week, but apparently, Data.gov.uk launched today. The site isn't loading for me right now though. I guess they weren't prepared for traffic.
Data.gov, a catalog of US data, launched last year. Now it's the UK's turn. Well, not yet. But soon. Data.gov.uk is still under lock and key, but it has granted access to some developers. Ito Labs, the group behind mapping a year of OpenStreetMap edits posted screenshots of their maps that show vehicle counts (above).
Here are some comparison maps between 2001 and 2008, by vehicle type.
Continue Reading -
Unemployment, 2004 to Present – The Country is Bleeding
The Bureau of Labor Statistics released the most recent unemployment numbers last week. Things aren't looking good for the unemployed, I'm afraid.
I showed my younger sister the maps. Her response: "It looks like the country is bleeding."
Continue Reading -
Target Store Openings Since the First in 1962 – Data Now Available
FlowingData readers who have been around for a while will remember I made a map early this year that showed the growth of Target stores across America. It starts with the first one in 1962 and then goes from there. It was a follow-up to the Walmart map, which I shared the code and data for.
Continue Reading
Data Points
Visualize This