• All You Can Eat at the Twitter Data Buffet

    December 24, 2008  |  Data Sources

    Philip from infochimps posts the results of some heavy Twitter scraping. Data for 2.7 million users, 10 million tweets, and 58 million edges (i.e. connections between users) to satisfy your data hunger are available for download. I know a lot of you social network researchers will especially appreciate the big dataset, and best of all, Twitter gave Philip permssion to release. Yes, you could use the Twitter API, but isn't it better when someone does it for you?

    Download the data here. The password is the Ramanujan taxicab number followed by the word
    'kennedy' - all one word. Google is your friend, if that doesn't make sense.

    [Thanks, Tim]

  • Amazon Gets In On the Public Data Arena

    December 5, 2008  |  Data Sources

    It was really only a matter of time, but Amazon now hosts public data sets. Not small data sets though - more like the ones in between 1 gigabyte and 1 terabyte:

    Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community, and like all AWS services, users pay only for the compute and storage they use for their own applications. An initial list of data sets is already available, and more will be added soon.

    Previously, large data sets such as the mapping of the Human Genome and the US Census data required hours or days to locate, download, customize, and analyze. Now, anyone can access these data sets from their Amazon Elastic Compute Cloud (Amazon EC2) instances and start computing on the data within minutes. Users can also leverage the entire AWS ecosystem and easily collaborate with other AWS users. For example, users can produce or use prebuilt server images with tools and applications to analyze the data sets. By hosting this important and useful data with cost-efficient services such as Amazon EC2, AWS hopes to provide researchers across a variety of disciplines and industries with tools to enable more innovation, more quickly.

    There's the human genome data set, US Census data from the past 3 decades, labor statistics, and some others. Still waiting on Google to follow through with their data hosting plans.

    [via TechCrunch | Thanks, David]

  • Neighborhood Boundaries with Flickr Shapefiles

    November 28, 2008  |  Data Sources, Mapping

    Neighborhood Boundaries by Tom Taylor uses Flickr Shapefiles and Yahoo! Geoplanet "to show you where the world thinks its neighbors are." Yahoo! provides access to the Where on Earth (WOE) database, which attempts to describe locations as a hierarchy. For example - a town belongs to a city, a city to a county, a county to a state. The Flickr API stores shape files identified by the WOE ID. Here's the punchline. The shapefiles are built using only the latitude and longitude from geotagged photos on Flickr. There's no GIS involved here.

    Why this matters, I can't really say. I think it's mostly to show how much data is stored in geotagged Flickr photos. I'm no GIS expert though. Anyone care to comment on the significance?

    [Thanks, @couch]

  • US Oil Doesn’t Come From Where You Think it Does

    November 21, 2008  |  Data Sources, Mapping

    Where do you think the US imports the most oil from? Most of us would probably say somewhere in the Middle East, but Jon Udell does some number crunching and shows that misconception is false. Canada supplies us with the most oil (according to the US Department of Energy).

    This realization however, isn't the post's punchline. It's how easy it was for Jon to figure this stuff out. With some help from Dabble DB (an app that lets you easily use a database without too much technical fuss), Jon was able to parse the data and map it by region with a few swift clicks.

    We’re really close to the point where non-specialists will be able to find data online, ask questions of it, produce answers that bear on public policy issues, and share those answers online for review and discussion. A few more turns of the crank, and we’ll be there. And not a moment too soon.

    We're gettin' there.

    [Thanks, Tim]

  • New York Times Visualization Lab – Collaboration with Many Eyes

    October 28, 2008  |  Data Sources

    It was just a little over a week ago that The New York Times announced their Developer Network i.e. Campaign Finance API. Yesterday, they announced something more - the Visualization Lab. In collaboration with the Many Eyes group, the Times has rolled out a Many Eyes for data used by Times writers. You can visualize, explore, and comment on data posted at the Visualization Lab in the same way that you can at Many Eyes.

    Today, we’re taking the next step in reader involvement with the launch of The New York Times Visualization Lab, which allows readers to create compelling interactive charts, graphs, maps and other types of graphical presentations from data made available by Times editors. NYTimes.com readers can comment on the visualizations, share them with others in the form of widgets and images, and create topic hubs where people can collect visualizations and discuss specific subjects.

    A Few More Steps

    I said the API was a good step forward. The Visualization Lab is more than a step. No doubt The Times heard what I said about their API and decided to roll with it since I am the head authority on everything. Yes, I'm totally kidding, in case that didn't come across as a joke. Come on now.

    I'm looking forward to seeing how well Times readers take to this new way of interacting.

    [Thanks, William]

  • Playboy Playmate Curves and the State of the Economy

    October 24, 2008  |  Data Sources, Economics

    Terry Pettijohn and Brian Jungeberg of Mercyhurst College took a very close look at the curves, um, measurements of past Playboy Playmates of the Year in relation to the state of the economy.
    Continue Reading

  • New York Times Rolls Out Campaign Finance API

    October 16, 2008  |  Data Sources

    The New York Times announced the opening of their Developer Network a couple of days ago. It's their "API clearinghouse and community." It might seem kind of weird that a newspaper company has an API, but as many FlowingData readers know, the Times prides itself on innovation.

    The Campaign Finance API is currently available:

    With the Campaign Finance API, you can retrieve contribution and expenditure data based on United States Federal Election Commission filings. Campaign finance data is public and is therefore available from a variety of sources, but the developers of the Times API have distilled the data into aggregates that answer most campaign finance questions. Instead of poring over monthly filings or searching a disclosure database, you can use the Times Campaign Finance API to quickly retrieve totals for a particular candidate, see aggregates by ZIP code or state, or get details on a particular donor.

    For anyone who has tried to play with FEC data, myself included, knows that this API is cool. You could get the data directly from the FEC, but it's a bit of a painstaking process. Now you don't have to sift through a bunch of reports or an awkward user interface.

    The Movie Review API is next in line. After that, who knows, but it's a good step forward for The Times.

    [via serial consign]

  • OneGeology Wants to Be Geological Equivalent of Google Maps

    September 11, 2008  |  Data Sources, Mapping

    There's lots of free geographical data about what's going on at the surface of our planet. It's a different story for what going on underneath though. OneGeology aims to be the solution to that problem.

    OneGeology is an international initiative of the geological surveys of the world and a flagship project of the 'International Year of Planet Earth'. Its aim is to create dynamic geological map data of the world available via the web. This will create a focus for accessing geological information for everyone.

    I've never been one for the geology, but if the data (and interactive maps) were easily accessible, there certainly would be a peak in interest.

    [via msnbc | Thanks, Samantha]

  • FlowingData Cited in Forbes Magazine?

    June 28, 2008  |  Data Sources

    Whaaa? Cool beans.

  • What Do People Want to Do With Their Lives?

    June 17, 2008  |  Data Sources, Projects, Visualization

    43things-viz

    43 Things is a goal-setting community where people set goals, cheer each other on, and connect with others who are trying to achieve the same thing. Even if you're not setting goals yourself, it's still interesting and often amusing to see what others have set out to do e.g. go skinny dipping, have a one night stand, and be myself.
    Continue Reading

  • U.S. Census Bureau’s 2008 Statistical Abstract – Looking at America’s Data

    May 21, 2008  |  Data Sources

    The U.S. Census Bureau released their 2008 Statistical Abstract, the National Data Book, not too long ago (um, like in January). There are state rankings and data in 30 categories and many more sub-categories. All this data is in the form of PDFs and Excel spreadsheets, which doesn't lend much to readability, but still, it's nice to have access to all the information.

    Maybe FlowingData readers can put together a giant statistical abstract all conveyed through graphics. That would be cool. Above are six data sets that I picked from the billion or so available.

  • World Internet City-to-City Connections and Density Maps

    April 1, 2008  |  Data Sources, Mapping

    Chris Harrison put together a series of Internet maps that show how cities are interconnected by router configuration. Similar to Aaron Koblin's Flight Patterns, Chris chose to map only the data, which makes an image that looks a lot like strands of silk stretched from city to city. With these maps, viewers gain a sense of connectivity in the world - and as expected the U.S. and Europe are a lot brighter than the rest.
    Continue Reading

  • Six Years of Piracy Data Available for Download – Shiver Me Timbers

    March 18, 2008  |  Data Sources

    bootleg-china

    I stumbled across this dataset covering piracy of Oscar-nominated films over the last 6 years and a short analysis.

    Piracy by the NumbersDespite the Academy's efforts to crack down on bootlegging, its attempts haven't done a whole lot. Focus on stopping one area, like downloading, another area just grows more prolific, like Region 5 DVDs from overseas. A quick search in the right places will show you that piracy isn't going away any time soon.

    I even met someone whose job it was to find people who were "seeding" films through bit torrents and to report them to police. I got the impression that it was a really tedious process and people go uncaught most of the time. I'm uh, not condoning this, but if you don't want to get caught, just make sure you stop the torrent once you've got your file.

    Bootlegging on Seinfeld

    Bootlegging always reminds me of the Seinfeld episode when Jerry somehow gets caught up in a bootlegging scheme:

    [T]here was a kid couldn't have been more than ten years old. He was asking a street vendor if he had any other bootlegs as good as Death Blow. That's who I care about. The little kid who needs bootlegs, because his parent or guardian won't let him see the excessive violence and strong sexual content you and I take for granted.

    For those interested (and I know you are), the term bootleg originates from hiding flasks of liquor in the legging of boots. Ahoy, matey.

    Photo by mumelopics

  • 10 Largest Data Breaches Since 2000 – Millions Affected

    March 14, 2008  |  Data Sources

    In light of the MySpace photo breach (due to their negligence) a couple of months ago, I got to wondering about other recent data breaches. It turns out Attrition.org keeps a Data Loss Archive and Database that contains known data breaches since 2000. Records include date, number affected, groups involved, summaries, and links to reported stories and updates. It's surprisingly detailed and even better, it's all available for download.

    The above graphic shows the 10 largest data breaches which affected millions. I thought the 800,000 records thieved from UCLA a couple of years ago (that my information was unfortunately a part of) was a lot. That's nothing compared to these.

    Notice the higher frequency as we get closer to the present?

    [Thanks Ryan | Welcome, Boing Boing readers]

  • A World of Information – United Nations Data Just Became Accessible

    March 7, 2008  |  Data Sources

    United Nations Data LogoFor our Humanflows project, we used the United Nations Common Database for our demographic numbers. Anyone who has used the common database knows that it's not especially user-friendly. You have to go through a series of non-intuitive dropdown menus to get the data you want. You then have to decipher the downloaded data's CSV format. The recently released UNdata relieves a lot of these problems.
    Continue Reading

  • Estimate Financial Impact of Risk and Uncertainty for a Living

    March 1, 2008  |  Data Sources

    I stumbled across a data table from the Social Security Administration that shows the probability of death. It's an actuarial life table estimating the probability that you will die within one year given your age.
    Continue Reading

  • Rambo Kill Counts From Parts I, II, III, and IV

    February 22, 2008  |  Data Sources

    rambo-kill-chart

    I don't think I've seen a single Rambo all the way through nor do I remember the premise of any of the movies, but I still found these kill counts amusing. Notice the near doubling of deaths each sequel. Yo, Adrian!!! Yeah, I know, wrong movie, but come on, is there really a difference?

    Here's a graph showing kill counts (mostly for my own entertainment):

    Rambo Kill Counts Graph

    Mr. Rambo may have gotten more violent in the latest installment, but it looks like he also grew more modest.

    [via Geekstir]

  • Speed Dating Data – Attractiveness, Sincerity, Intelligence, Hobbies

    February 6, 2008  |  Data Sources

    In their paper Gender Differences in Mate Selection: Evidence from a Speed Dating Experiment, Fisman et al. had a bit of fun with a speed dating dataset. Here's what they found:

    Women put greater weight on the intelligence and the race of partner, while men respond more to physical attractiveness. Moreover, men do not value women's intelligence or ambition when it exceeds their own. Also, we find that women exhibit a preference for men who grew up in affl­uent neighborhoods. Finally, male selectivity is invariant to group size, while female selectivity is strongly increasing in group size.

    The dataset is substantial with over 8,000 observations for answers to twenty something survey questions. With questions like How do you measure up? and What do you look for in the opposite sex?, this dataset is definitely high on human element and should be fun to play with.

    [via Statistical Modeling]

  • Weekend Minis – Government, Environment & Angry Employee

    February 2, 2008  |  Data Sources

    FedStats - Provides access to the full range of official statistical information produced by the Federal Government, including population, eduction, crime, and health care.

    MAPLight - A detailed database that brings together information on campaign contributions and votes in the California legislature. Check out the video tour.

    EarthTrends - A collection of information regarding the environmental, social, and economic trends that shape our world.

    Angry Employee Deletes All of Company's Data - A woman about to "lose" her job goes to the office at night and deletes 7 years' worth of data. Can we say backup, please?

  • Google Decides to Host a Whole Lot of Scientific Data – Palimpsest Project

    January 21, 2008  |  Data Sources

    Google ResearchIn its continued efforts for absolute power over all information ever created in the world, Google will be hosting open-source scientific datasets at its research section. Here are the presentation slides from Google's Jon Trowbridge:

    In the next few weeks, terabytes of data will be made available to the public. For example, all 120 terabytes of Hubble Space Telescope data is going to be online. That's kind of cool but kind of scary too. Such a large amount of data is bound to affect lots of people on many different levels.

    For scientists, data will be available for deeper research. For the scientists who generated the data, their research could be placed under more critical scrutiny. Existing data applications might be eclipsed by the data giant, or it could go the other way such that the general public grows more aware of data-type things. Mashups will in turn spring up as well as more visualization, I am sure.

    All of this Doesn't Matter If...

    Of course, all of this depends on what data end up on the Google servers and how easily accessible the data are. Knowing Google, I don't think accessibility will be a problem. Getting data will be the super hard part. Who will be willing to contribute their data? What type of data will get contributed? Will it be the good, raw data or more cleaned and processed data? Do researchers even want to share their data with the rest of the world?

    It's going to be interesting to see what goes up on Google Research in these coming weeks.

    [via Wired and Pimm]

Copyright © 2007-2014 FlowingData. All rights reserved. Hosted by Linode.