All You Can Eat at the Twitter Data Buffet

Philip from infochimps posts the results of some heavy Twitter scraping. Data for 2.7 million users, 10 million tweets, and 58 million edges (i.e. connections between users) to satisfy your data hunger are available for download. I know a lot of you social network researchers will especially appreciate the big dataset, and best of all, Twitter gave Philip permssion to release. Yes, you could use the Twitter API, but isn’t it better when someone does it for you?

Download the data here. The password is the Ramanujan taxicab number followed by the word
‘kennedy’ – all one word. Google is your friend, if that doesn’t make sense.

[Thanks, Tim]

20 Comments

  • err, what’s the username?

  • Username is “theinfo.org” … it’s in the auth dialog :-)

  • oh wow. nerd fail. *facepalm*

  • The more mature release will go up on archive.org and most likely Amazon AWS free datasets pool. While it’s on the oh-so-rickety staging served a geek check in front of the 20gb package seemed perfectly cromulent.

    If anyone would like to add to the data:
    * geocode all the geocodable location fields
    * Do a basic latent semantic analysis to give keyword vectors for people / clusters / demographic slices
    * I just posted half a million expanded tinyurls from the tweet_url list – tie both the tweet_urls and the personal urls in with info from the web at large, such as alexa/technorati metrics or googles social graph API
    * convert it to FOAF
    * what else?

  • @Barry – we’ve all been there :)

    @Flip – I think you’ve covered most bases. Maybe I’ll give geocoding a try… maybe

  • Merry Christmas Nathan, but I can’t open my gift:-) The default app for this is the archive utility on the Mac and I have downloaded some data but I see I need an interface to interpret it. I am not a nerd; I’m a communiction strategist. I would love to do some research and writing about the way Twitter has evolved into a conversation with replies, rather than just a place to announce what am I doing now. I’d love to work on this with someone…

  • @francine hardaway:

    The files will, in principle, work with anything that can import a spreadsheet-style TSV file. But if you try to load a 56-million row dataset into Excel it will burst into flames. So will most tools; even opening an explorer/finder window on the ripd/_xxxx directories will fill you with regret.

    To work with the dataset at full scale you will probably need to work with a specialist. If that specialist has access to a cluster with Hadoop and Pig it’s highly, highly recommended (Amazon EC2 if you have the coin). Otherwise, the files will load straight into MySQL using LOAD DATA INFILE (and assumedly other DBs as well). Industrial-strength products such as Mathematica and Cytoscape will struggle, but can handle good-sized subsets of the data.

    If you have a budget for your project, the get.theinfo google group and the Flowing Data forums might turn up a consultant who can help.

  • Am I too late to the bufffet? I tried to access the data file, but received the message “account was suspended.” Is it mirrored anywhere else? I am a researcher studying social networking, and would love exactly this kind of data for my work.

  • @Kristina — lemme say it’s been an interesting week for infochimps.

    Our backend was on Bluehost.com, who used to have excellent service and a generous plan — but let’s just say my opinion’s changed. We’re scrambling for an alternate solution.

    On the other end of things: Twitter, after initially giving us full permission, has since asked us to hold off on redistribution while they figure out some terms of service to apply. (I think all of us were a bit surprised at the scope of the demand for this data.) The timescale for that may be… who knows.

    Even if I can’t send you the data, since you’re a researcher they may work something out; or there might be a collaboration opportunity. Shoot me an email, flip at infochimps dot org.

  • Ugh, that sucks. FlowingData uses westhost.com for its hosting – in case you’re still looking. I haven’t had any problems, and they have really good support.

  • Let us know how you get on.

    We are looking for some large datasets to play with, good visuals and helps give the servers a workout.

  • If size and interestingness is what you’re after I can hook you up with a bunch of big datasets besides this one.

    Here are some of the funner ones: 30 years’ daily stock market open/close/low/hi/volume; 50 years’ global hourly weather; the Enron email corpus; every MLB baseball game outcome, plus every play of every game for 40 years, plus the trajectory and game state for every pitch thrown in 2008 and most of 2007; answers to every question by every student (grades 3-11) on the Texas state exam for the past 5 years; a wealth of large text corpora; others. For silly transitional reasons not all are indexed yet on infochimps, but if you email me (flip at infochimps org) with the structure you’re after I can recommend something from the extended menu.

  • Hi Flip,

    I’ve bookmarked the site and am perusing it now, might play with some of the stocks, and get playing with Flex or something to graph them.

  • @flip How on Earth did you get the answers from every student on the TAAS? LOL I imagine they must have scrapped the exam and put a new one in place? Standardized testing usually has nearly Fort Knox security.

  • I logged in, but the site is suspended. Anyone have the data set?

  • I would like to download the dataset, but I cannot. I do not have any idea what happened. The link is broken?

    Please let me know how to download it.

    Thanks. :)

  • Please let me know when it’s back!

  • I’d be very interested in the data set. Please let me know if it will be available again.

  • Mark Gruëger March 16, 2009 at 6:17 am

    I’m HIGHLY interested too for my thesis. Please.