Data Sources

  • A Future Without Key Social and Economic Statistics for the Country

    May 13, 2012 to Data Sources  •  Nathan Yau  •  Share on Twitter

    Robert Groves, director of the U.S. Census Bureau, on the Appropriations Bill:

    The Appropriations Bill eliminates the Economic Census, which measures the health of our economy. It terminates the American Community Survey, which produces the social and demographic information that monitors the impact of economic trends on communities throughout the country. It halts crucial development of ways to save money on the next decennial census. In the last three years the Census Bureau has reacted to budget and technological challenges by mounting aggressive operational efficiency programs to make these key statistical cornerstones of the country more cost efficient. Eliminating them halts all the progress to build 21st century statistical tools through those innovations. This bill thus devastates the nation’s statistical information about the status of the economy and the larger society.

    A lot of the negative comments following the post are from people who have never used Census data, or any substantial amount of data for that matter, and have no clue how a dataset can feed into a model to make other estimates. Then there's the people who don't want to answer questions about their toilets. I wonder what their Facebook profiles look like.

  • CNN transcript collection, 2000-2012

    May 9, 2012 to Data Sources  •  Nathan Yau  •  Share on Twitter

    Thanks to the Internet Archive and CNN, thirteen years of transcripts, about a gigabyte compressed, is available to download as one file.

    For over a decade, CNN (Cable News Network) has been providing transcripts of shows, events and newscasts from its broadcasts. The archive has been maintained and the text transcripts have been dependably available at transcripts.cnn.com. This is a just-in-case grab of the years of transcripts for later study and historical research.

    Changes in news coverage and CNN's focus over the years, anyone?

    [via @A_L]

  • 1940 Census Individual Records Released

    April 3, 2012 to Data Sources  •  Nathan Yau  •  Share on Twitter

    Racial ethnic diversity

    The 72-year mark has arrived, and the United States Census released individual records from 1940 yesterday. So you can now, for example, see that J.D. Salinger lived at 1133 Park Avenue.

  • Texting on the toilet

    January 30, 2012 to Data Sources  •  Nathan Yau  •  Share on Twitter

    I thought this riveting post on the New York Times Bits blog about the rise of the toilet texter deserved a graphic. Since their graphics department is no doubt busy with elections, I took the liberty. I am — the 91 percent.

    I got the numbers straight from the Bits post, but you can download the full report from 11mark for all the demographics. You have to register though, and I didn't want to be the guy who creates an online account to just read a report on what people do while they make dooty. I have standards.

  • What Facebook knows about you

    December 14, 2011 to Data Sources  •  Nathan Yau  •  Share on Twitter

    Facebook privacy

    Facebook logs and saves a lot of data about you and what you do on their site. This shouldn't be surprising given the more time people spend on Facebook, the greater the cash flow, but just how much data do they store? Austrian law student Max Schrems, because European law states that citizens can do this, requested all the data Facebook had about him. He got back a CD with 1,222 PDF files.
    Continue Reading

  • Geo API from Infochimps brings you closer to mapping fun

    August 31, 2011 to Data Sources  •  Nathan Yau  •  Share on Twitter

    Summarizer from infochimps

    Mostly because of the popularity of smartphones, location data is all the rage nowadays. You're almost always connected no matter where you are. Rich location data can help provide you a new sense of place, and at the same time, this sort of data can paint an interesting picture of what's going on in your country or around the world. Hence, Infochimps, the one-stop shop for data folk and developers, just announced their new Geo API.
    Continue Reading

  • Reporters make it easier to access Census data

    August 29, 2011 to Data Sources  •  Nathan Yau  •  Share on Twitter

    Census data can provide valuable information, but the datasets are not always the easiest to access. So you often end up spending a lot of time getting your data in order before you actually get to do anything with it. Investigative Reporters and Editors has released the next phase in their Census project to make Census 2010 more accessible via a simple interface. Easily download data in bulk as CSV or shapefiles or build it into your applications with the API.

    [census.ire.org via @bryanboyer]

  • Get a coffee, give a coffee API

    August 7, 2011 to Data Sources  •  Nathan Yau  •  Share on Twitter

    Jonathan Stark, a mobile application consultant, is running an interesting social experiment with his Starbucks card:

    Jonathan's Card is an experiment in social sharing of physical goods using digital currency on mobile phones. I stumbled on the idea while doing research for a blog post about Broadcasting Mobile Currency.

    Based on the similarity to the "take a penny, leave a penny" trays at convenience stores in the US, I've adopted a similar "get a coffee, give a coffee" terminology for Jonathan's Card.

    Simply save the picture of Jonathan's Starbucks card onto your smartphone and use it to buy your coffee. If you like, add money to the card so that someone else can buy a coffee.

    The best part is that Stark provides a simple API that returns the balance on the card every minute. When do people buy coffee? How do people give and take? Are people more likely to give when there's a large balance or when there's nothing left? Lots of fun things to look at.

    [Jonathan's Card via @kn0thing]

  • Pew Research raw survey data now available

    May 25, 2011 to Data Sources  •  Nathan Yau  •  Share on Twitter

    The Pew Research churns out a lot of interesting results from a number of surveys about online and American culture, but they usually only shared aggregated results, pre-made charts and graphs. This is well and good for the information-consuming public; however, these results can spawn curiosities that are fun to dig into. Luckily, the Pew Research Center launched a Data Sets section that provides raw survey responses and the questions in a variety of easy-to-use data formats.

    Our raw data, previously posted only as SPSS files, is now available in comma-delimited (.csv) format for all reports going back to 2003. We hope that making our data available in this open-source format will make analysis easier for researchers who don’t own a copy of SPSS to analyze our data.

    This should be fun. Recent datasets include the social side of the Internet, health tracking habits, and reputation management.

    [Pew Research via @kzickhur]

  • Map your location – that your iPhone secretly records

    April 20, 2011 to Data Sources, Mapping  •  Nathan Yau  •  Share on Twitter

    iphone gps trace

    Researchers Alasdair Allan and Pete Warden have found that the iPhone records cell tower access, and hence your location, in an easy-to-read file that is transferred as you switch devices. And they do this whether you like it or not.

    The more fundamental problem is that Apple are collecting this information at all. Cell-phone providers collect similar data almost inevitably as part of their operations, but it’s kept behind their firewall. It normally requires a court order to gain access to it, whereas this is available to anyone who can get their hands on your phone or computer.

    Allan and Warden provide an open-source application, iPhone Tracker, that maps that data. The good news is that the data doesn't seem go to be anywhere other than your own backups and devices. Privacy concerns aside, this kind of makes me wish I had an iPhone; although I suspect my map would be painfully boring.

    [iPhone Tracker via Marco]

  • Data.gov and other transparency sites to be shut down due to budget cuts

    March 31, 2011 to Data Sources  •  Nathan Yau  •  Share on Twitter

    Last week, there were rumblings over the end of the Statistical Abstract, and I suggested that it was just a sign of changing technologies. I thought that Data.gov and similar sites were the natural progression. Here's the problem with that argument. Congress is planning on shutting down Data.gov and other transparency sites in the next few months.
    Continue Reading

  • Tell-all telephone reveals politician’s life

    March 30, 2011 to Data Sources, Mapping  •  Nathan Yau  •  Share on Twitter

    Tell-all telephone

    Not many people understand the importance of data privacy. They don't get out how little bits of information sent from your phone every now and then can show a lot about your day-to-day life.

    As the German government tries to come to a consensus about its data retention rules, Green party politician Malte Spitz retrieved six months of phone data from Deutsche Telekom (by suing them), to show what you can get from a little bit of private mobile data. He handed the data to Zeit Online, and they in turn mapped and animated practically every one of Spitz' moves over half a year and combined it with publicly available information from sources such as his appointment website, blog, and Twitter feed for more context.
    Continue Reading

  • Lots of health data released via Health Indicators Warehouse

    March 1, 2011 to Data Sources  •  Nathan Yau  •  Share on Twitter

    The government has been making a big push for more open health-related data, and a couple of weeks ago, they released a whole bunch of it with the launch of HealthData.gov. It's the same interface as Data.gov, but for health. Additionally, the Health Indicators Warehouse launched with different data and a slightly more useable interface.

    A quick scan of the data available, however, does seem to indicate that a lot of it is spotty or outdated (like on data.gov), which doesn't make it especially useful. For example, some data sets are only one data point, while others are only a single year. At least it's a start.

    [Health Indicators Warehouse via @periscopic]

  • Million song dataset available for download

    February 24, 2011 to Data Sources  •  Nathan Yau  •  Share on Twitter

    Need music data? Get all the data you want and more from the freely available million song dataset, offered by LabROSA at Columbia University and Echo Nest. There's lots of metadata on song features and your standard stuff like year and artist. There are also several code wrappers and samples to help researchers make use of the data right away.

    [Million Song Dataset via @MacDivaONA]

  • Sunlight Labs opens up Real Time Congress API

    February 17, 2011 to Data Sources  •  Nathan Yau  •  Share on Twitter

    Sunlight Labs continues its work for a more open government with its recent release of the Real Time Congress API.

    Today we're making available the Real Time Congress API, a service we've been working on for several months, and will be continuing to expand.

    The Real Time Congress API (RTC) is a RESTful API over the artifacts of Congress, kept up to date in as close to real time as possible. It consists of several live feeds of data, available in JSON or XML. These feeds are filterable and sortable and sliceable in all sorts of different ways, and you can read the docs to see how.

    There are seven data types the API will report:

    • Bills
    • Votes
    • Amendments
    • Videos
    • Floor Updates
    • Committee Hearings
    • Documents

    Now someone has to do something with all of this data coming in. Can you think of a useful application for what is essentially an automated government Twitter feed?

    [Real Time Congress API]

  • Find more of the data you need with DataMarket

    January 31, 2011 to Data Sources, Online Applications  •  Nathan Yau  •  Share on Twitter

    Add another online destination to find the data that you need. DataMarket launched back in May with Icelandic data, but just a few days ago relaunched with data of the international variety. They tout 100 million time series datasets and 600 million facts. I'm not totally sure what that means (100 million lines, sets of lines?), but I take it that means a lot.

    Just over 2 years and countless cups of coffee after we started coding, DataMarket.com launches with international data. You can now find, visualize and download data from many of the world’s most important data providers on our site.

    At first glance DataMarket feels a lot like now defunct Swivel. Search for the data you want and you get back a list of datasets. The focus on only time series though is actually a plus in that they can provide more specific tools to visualize and explore. The current toolset isn't going to blow you away, but it's not bad.
    Continue Reading

  • A guide for scraping data

    January 17, 2011 to Data Sources  •  Nathan Yau  •  Share on Twitter

    Data is rarely in the format you want it. Dan Nguyen, for ProPublica, provides a thorough guide on how to scrape data from Flash, HTML, and PDF. [via @JanWillemTulp]

  • Search how phrases have been used via Google Ngram Viewer

    December 20, 2010 to Data Sources, Online Applications  •  Nathan Yau  •  Share on Twitter

    Ngram - kindergarten

    Language changes. Culture changes. And we can see some of these changes via what authors write about in books over the years. Google's Book Ngram Viewer lets you search through this data, and shows a graph similar similar to the output of Google Trends. The above is the trends for nursery school, kindergarten, and child care:

    This shows trends in three ngrams from 1950 to 2000: "nursery school" (a 2-gram or bigram), "kindergarten" (a 1-gram or unigram), and "child care" (another bigram). What the y-axis shows is this: of all the bigrams contained in our sample of books written in English and published in the United States, what percentage of them are "nursery school" or "child care"? Of all the unigrams, what percentage of them are "kindergarten"? Here, you can see that use of the phrase "child care" started to rise in the late 1960s, overtaking "nursery school" around 1970 and then "kindergarten" around 1973. It peaked shortly after 1990 and has been falling steadily since.

    Find anything interesting?
    Continue Reading

  • Jon Stewart explains Wikileaks’ Cablegate

    December 2, 2010 to Data Sources, News  •  Nathan Yau  •  Share on Twitter

    You've probably already heard and read about Wikileaks' Cablegate. If not, Andy Baio has a fine roundup with significant coverage and events to get you caught up quick. Alternatively, you can watch Jon Stewart and The Daily Show explain in the clip below (slightly NSFW, because it mentions a body part).
    Continue Reading

  • How do people use Firefox?

    November 30, 2010 to Data Sources, News  •  Nathan Yau  •  Share on Twitter

    Mozilla Labs just released a bunch of anonymized browsing data for their open data visualization competition:

    This competition is based on Mozilla's own open data program, Test Pilot. Test Pilot is a user research platform that collects structured user data through Firefox. All data is gathered through pre-defined Test Pilot studies, which aim to explore how people use their web browser and the Internet.

    There are two datasets in various formats. The first is browsing behavior from 27,000 users, including on/off private browsing that we saw a few months ago. The second dataset is from 160,000 users and is on how they actually use the Firefox interface.

    Additionally, both sets have survey answers to questions like "How long have you used Firefox?" which could make for some fun and interesting breakdowns.

    The deadline is December 17.

    [Mozilla Labs]

Copyright © 2007-2012 by FlowingData. Hosted by Media Temple.