Data Sources

  • What Do People Want to Do With Their Lives?

    43 Things is a goal-setting community where people set goals, cheer each other on, and connect with others who are trying to achieve the same thing. Even if you're not setting goals yourself, it's still interesting and often amusing to see what others have set out to do e.g. go skinny dipping, have a one night stand, and be myself.
    Continue Reading

  • U.S. Census Bureau’s 2008 Statistical Abstract – Looking at America’s Data

    May 21, 2008 to Data Sources by Nathan Yau

    The U.S. Census Bureau released their 2008 Statistical Abstract, the National Data Book, not too long ago (um, like in January). There are state rankings and data in 30 categories and many more sub-categories. All this data is in the form of PDFs and Excel spreadsheets, which doesn't lend much to readability, but still, it's nice to have access to all the information.

    Maybe FlowingData readers can put together a giant statistical abstract all conveyed through graphics. That would be cool. Above are six data sets that I picked from the billion or so available.

  • World Internet City-to-City Connections and Density Maps

    April 1, 2008 to Data Sources, Mapping by Nathan Yau

    Chris Harrison put together a series of Internet maps that show how cities are interconnected by router configuration. Similar to Aaron Koblin's Flight Patterns, Chris chose to map only the data, which makes an image that looks a lot like strands of silk stretched from city to city. With these maps, viewers gain a sense of connectivity in the world - and as expected the U.S. and Europe are a lot brighter than the rest.
    Continue Reading

  • Six Years of Piracy Data Available for Download – Shiver Me Timbers

    March 18, 2008 to Data Sources by Nathan Yau

    bootleg-china

    I stumbled across this dataset covering piracy of Oscar-nominated films over the last 6 years and a short analysis.

    Piracy by the NumbersDespite the Academy's efforts to crack down on bootlegging, its attempts haven't done a whole lot. Focus on stopping one area, like downloading, another area just grows more prolific, like Region 5 DVDs from overseas. A quick search in the right places will show you that piracy isn't going away any time soon.

    I even met someone whose job it was to find people who were "seeding" films through bit torrents and to report them to police. I got the impression that it was a really tedious process and people go uncaught most of the time. I'm uh, not condoning this, but if you don't want to get caught, just make sure you stop the torrent once you've got your file.

    Bootlegging on Seinfeld

    Bootlegging always reminds me of the Seinfeld episode when Jerry somehow gets caught up in a bootlegging scheme:

    [T]here was a kid couldn't have been more than ten years old. He was asking a street vendor if he had any other bootlegs as good as Death Blow. That's who I care about. The little kid who needs bootlegs, because his parent or guardian won't let him see the excessive violence and strong sexual content you and I take for granted.

    For those interested (and I know you are), the term bootleg originates from hiding flasks of liquor in the legging of boots. Ahoy, matey.

    Photo by mumelopics

  • 10 Largest Data Breaches Since 2000 – Millions Affected

    March 14, 2008 to Data Sources by Nathan Yau

    In light of the MySpace photo breach (due to their negligence) a couple of months ago, I got to wondering about other recent data breaches. It turns out Attrition.org keeps a Data Loss Archive and Database that contains known data breaches since 2000. Records include date, number affected, groups involved, summaries, and links to reported stories and updates. It's surprisingly detailed and even better, it's all available for download.

    The above graphic shows the 10 largest data breaches which affected millions. I thought the 800,000 records thieved from UCLA a couple of years ago (that my information was unfortunately a part of) was a lot. That's nothing compared to these.

    Notice the higher frequency as we get closer to the present?

    [Thanks Ryan | Welcome, Boing Boing readers]

  • A World of Information – United Nations Data Just Became Accessible

    March 7, 2008 to Data Sources by Nathan Yau

    United Nations Data LogoFor our Humanflows project, we used the United Nations Common Database for our demographic numbers. Anyone who has used the common database knows that it's not especially user-friendly. You have to go through a series of non-intuitive dropdown menus to get the data you want. You then have to decipher the downloaded data's CSV format. The recently released UNdata relieves a lot of these problems.
    Continue Reading

  • Estimate Financial Impact of Risk and Uncertainty for a Living

    March 1, 2008 to Data Sources by Nathan Yau

    I stumbled across a data table from the Social Security Administration that shows the probability of death. It's an actuarial life table estimating the probability that you will die within one year given your age.
    Continue Reading

  • Rambo Kill Counts From Parts I, II, III, and IV

    February 22, 2008 to Data Sources by Nathan Yau

    I don't think I've seen a single Rambo all the way through nor do I remember the premise of any of the movies, but I still found these kill counts amusing. Notice the near doubling of deaths each sequel. Yo, Adrian!!! Yeah, I know, wrong movie, but come on, is there really a difference?

    Here's a graph showing kill counts (mostly for my own entertainment):

    Rambo Kill Counts Graph

    Mr. Rambo may have gotten more violent in the latest installment, but it looks like he also grew more modest.

    [via Geekstir]

  • Speed Dating Data – Attractiveness, Sincerity, Intelligence, Hobbies

    February 6, 2008 to Data Sources by Nathan Yau

    In their paper Gender Differences in Mate Selection: Evidence from a Speed Dating Experiment, Fisman et al. had a bit of fun with a speed dating dataset. Here's what they found:

    Women put greater weight on the intelligence and the race of partner, while men respond more to physical attractiveness. Moreover, men do not value women's intelligence or ambition when it exceeds their own. Also, we find that women exhibit a preference for men who grew up in affl­uent neighborhoods. Finally, male selectivity is invariant to group size, while female selectivity is strongly increasing in group size.

    The dataset is substantial with over 8,000 observations for answers to twenty something survey questions. With questions like How do you measure up? and What do you look for in the opposite sex?, this dataset is definitely high on human element and should be fun to play with.

    [via Statistical Modeling]

    *Photo by bilateral

  • Weekend Minis – Government, Environment & Angry Employee

    February 2, 2008 to Data Sources by Nathan Yau

    FedStats - Provides access to the full range of official statistical information produced by the Federal Government, including population, eduction, crime, and health care.

    MAPLight - A detailed database that brings together information on campaign contributions and votes in the California legislature. Check out the video tour.

    EarthTrends - A collection of information regarding the environmental, social, and economic trends that shape our world.

    Angry Employee Deletes All of Company's Data - A woman about to "lose" her job goes to the office at night and deletes 7 years' worth of data. Can we say backup, please?

  • Google Decides to Host a Whole Lot of Scientific Data – Palimpsest Project

    January 21, 2008 to Data Sources by Nathan Yau

    Google ResearchIn its continued efforts for absolute power over all information ever created in the world, Google will be hosting open-source scientific datasets at its research section. Here are the presentation slides from Google's Jon Trowbridge:

    In the next few weeks, terabytes of data will be made available to the public. For example, all 120 terabytes of Hubble Space Telescope data is going to be online. That's kind of cool but kind of scary too. Such a large amount of data is bound to affect lots of people on many different levels.

    For scientists, data will be available for deeper research. For the scientists who generated the data, their research could be placed under more critical scrutiny. Existing data applications might be eclipsed by the data giant, or it could go the other way such that the general public grows more aware of data-type things. Mashups will in turn spring up as well as more visualization, I am sure.

    All of this Doesn't Matter If...

    Of course, all of this depends on what data end up on the Google servers and how easily accessible the data are. Knowing Google, I don't think accessibility will be a problem. Getting data will be the super hard part. Who will be willing to contribute their data? What type of data will get contributed? Will it be the good, raw data or more cleaned and processed data? Do researchers even want to share their data with the rest of the world?

    It's going to be interesting to see what goes up on Google Research in these coming weeks.

    [via Wired and Pimm]

  • Iraq Body Count: A Human Security Project

    January 17, 2008 to Data Sources by Nathan Yau

    Iraq Body CountIraq Body Count keeps track of civilian deaths by cross checking media reports and hospital, morgue, and NGO figures. Along with a widget counter that you can post on your blog or site, IBC also makes their database available for download.

    Systematically extracted details about deadly incidents and the individuals killed in them are stored with every entry in the database. The minimum details always extracted are the number killed, where, and when.

    The data comes in two sets -- incident reports and individuals who have lost their lives -- in the form of CSV files.

    Albeit, the data is a little depressing, but still very necessary.

  • 25 Highest Grossing Films of All Time (Wallpaper)

    January 2, 2008 to Data Sources by Nathan Yau

    I love to look at how the current week's movies are doing at the box office. I'm not really sure what it is. I think it's kind of like a gauge for what good movies are out; or maybe I'm just constantly amazed by the millions of dollars that movies make; or I think it could be my addiction to numbers?

    Something that always strikes me as interesting is how movies are always breaking records at the box office. So and so movie just broke the record for most money made over a single weekend or a month or a long holiday weekend or for a Thursday when there was at least 2 inches of rain and a dog skateboarded two miles.

    I took a look at the 25 highest grossing American films, adjusted for inflation. I'm so tired of hearing statistics for money comparisons over time that don't adjust for inflation. Wow, gasoline prices are at an all time high. Well guess what -- so are milk, bread, burgers, televisions, light bulbs, paper, cars, and everything else on the planet. Sorry, slight tangent.

    Download the Wallpaper

    As an early birthday gift to you, here are my results in wallpaper form:

    Grossing Films Wallpaper 1024 x 7681024 x 768

    1280 x 1024

    1440 x 900

    The movie titles are color coded for genre and the higher grossing films are in a larger font. Drama and action/adventure clearly dominate -- The hills are alive. Luke, I am your father. Phone home. I'll never go hungry again.

    Surprisingly (at least to me), only 7 of the top 25 films won the Oscar for best picture and of the top 50, only 9 won best picture.

  • Download Detailed Baseball Statistics from the DataBank

    December 21, 2007 to Data Sources by Nathan Yau

    1220-baseball

    Baseball (or all sports for that matter) statistics are all over the place. You can easily find data for pretty much whatever sport and for whichever player you want at any given time. The problem is that if you want to download all of the data at once, you usually have to write a script and do some parsing. Who wants to do that? I don't.
    Continue Reading

  • Migration/Demographics Database Available for Download

    December 14, 2007 to Data Sources by Nathan Yau

    United Nations and Migration InformationFor our humanflows visualization, we used data from the United Nations Common Database and the Migration Information Source. The great thing about these types of sources is that they are publicly available so that everyone gets to have fun with the data. The downside is that the data is accessible via a user interface that often makes it a chore to get all of the data.

    Hence, to save you some time, you can now download the migration database that we used. I don't see any reason why you have to go through the whole data importing process when we already did it. Enjoy!

    Disclaimer: Keep in mind that the data is from the United Nations and Migration Information Source, so you should refer to the two sites for any documentation. In a nutshell, the inflows table is from MIS and the rest is from United Nations. If you're looking for more, you might also want to check out OECD. I really wanted to use their data at the time, but was having trouble accessing it from Spain.

  • Fast Food Restaurant Menu Items Compared

    November 14, 2007 to Data Sources by Nathan Yau

    McDonald’s Big MacWe all know fast food is incredibly bad for us and yet we still eat it. Why? Because it has tons of fat and tastes delicious. Nevermind that we will die a few days earlier for every French fry we eat.

    Over at Calorie Counter, they try to make us feel guilty with numbers. Check out the Carl's Jr. Double Six Double Dollar Burger with 1,520 Calories and a delicious 111 grams of fat. I'm a little surprised that it beat out the Burger King Triple Whopper with cheese. I shudder just thinking about eating one of those.

    Anyways, there's a whole lot of numbers here but not an incredible amount of meaning. How bad is bad? How much fat should I consume per day? Is 111 grams of fat bad? If yes, how will it directly affect me? Yes, 111 grams of fat is bad for you. You will directly feel the effects as you sit on the toilet in the morning wondering why it is taking you so long to take a dump. Now that's context.

    Also, with all the numbers, I bet all the tables would benefit from some kind of chart or, at the least, a simple infographic. Any takers? We should have a contest for who can make fast food the least appealing using nutritional data and without bending the truth.

  • Education Statistics Free, Available, and Waiting for You

    October 15, 2007 to Data Sources by Nathan Yau

    Raw, fine-grain data is still a bit hard to come by. Summary statistics (i.e. data that came from some analysis), on the other hand, are often easy to find. A lot of the time the data is already online or just a simple phone call away.

    The National Center for Education Statistics, a part of the U.S. Department of Education, offers a bunch of data including, but not at all limited to, poverty and math achievement, average science scores overall and by grade level, and quantitative literacy.
    Continue Reading

  • 360 Variables Describing the United States

    September 5, 2007 to Data Sources by Nathan Yau

    Order From Randomness Data Browser

    Order From Randomness has an extensive data collection featuring 360 variables describing all 50 states. The indicators are placed in 25 groups including birth rates, death rates, disease, environment, energy, nutrition, and education.

    Most of the data seems to range somewhere between 1999 and 2005, and I believe there's four variables to 2007. There's also a simple data browser featuring a distribution curve and some summary statistics. Generally, students seem to like the extensive set of variables, says one of my professors.

  • U.S. News & World Report College Rankings are Now Available

    August 17, 2007 to Data Sources by Nathan Yau

    The well-known college rankings are now available for your viewing pleasure. Whether the ranking system is legit or not, I'll let you be the judge, but I think everyone should take note that UC Berkeley was again the number one ranked public national university and UCLA was ranked number three. Go Calee-forn-ee-ah! In a nutshell, here's what U.S. News ranks the universities:

    • Peer Assessment - 25%
    • Retention - 20% in national universities and liberal arts colleges and 25% in master's and baccalaureate colleges
    • Faculty Resources - 20%
    • Student Selectivity - 15%
    • Financial Resources - 10%
    • Graduate Rate Performance - 5%; only in national universities and liberal arts colleges
    • Alumni giving rate 5%

    I wonder how much bias is in peer assessment.

  • Making Public Data Public

    July 11, 2007 to Data Sources by Nathan Yau

    As Jon Udell has mentioned, there's a ton of data online, but it's not often we can find it, often hidden in the deep, dark basement of some website. He has proposed that people book mark public datasets on del.icio.us under the tag "publicdata". I think this is a great idea. In turn, you can subscribe to the feed with the url http://del.icio.us/tag/publicdata.

    I've been doing this already for a while, but I had been just tagging with "data". So I'm going to join in on the party and start tagging with publicdata, and I hope others will too. Until sites like Many Eyes and Swivel get more wind beneath their wings, I think it's necessary.

Unless otherwise noted, graphics and words by me are licensed under Creative Commons BY-NC. Contact original authors for everything else.