All You Can Eat at the Twitter Data Buffet

December 24, 2008

Topic

Data Sources

Philip from infochimps posts the results of some heavy Twitter scraping. Data for 2.7 million users, 10 million tweets, and 58 million edges (i.e. connections between users) to satisfy your data hunger are available for download. I know a lot of you social network researchers will especially appreciate the big dataset, and best of all, Twitter gave Philip permssion to release. Yes, you could use the Twitter API, but isn’t it better when someone does it for you?

Download the data here. The password is the Ramanujan taxicab number followed by the word
‘kennedy’ – all one word. Google is your friend, if that doesn’t make sense.

[Thanks, Tim]

20 Comments

barry — December 24, 2008 at 3:33 am

err, what’s the username?
NathanaelB — December 24, 2008 at 3:38 am

Username is “theinfo.org” … it’s in the auth dialog :-)
barry — December 24, 2008 at 3:50 am

oh wow. nerd fail. *facepalm*
Flip Kromer — December 24, 2008 at 5:01 am

The more mature release will go up on archive.org and most likely Amazon AWS free datasets pool. While it’s on the oh-so-rickety staging served a geek check in front of the 20gb package seemed perfectly cromulent.

If anyone would like to add to the data:
* geocode all the geocodable location fields
* Do a basic latent semantic analysis to give keyword vectors for people / clusters / demographic slices
* I just posted half a million expanded tinyurls from the tweet_url list – tie both the tweet_urls and the personal urls in with info from the web at large, such as alexa/technorati metrics or googles social graph API
* convert it to FOAF
* what else?
Nathan Yau — December 24, 2008 at 12:18 pm

@Barry – we’ve all been there :)

@Flip – I think you’ve covered most bases. Maybe I’ll give geocoding a try… maybe
francine hardaway — December 25, 2008 at 10:28 am

Merry Christmas Nathan, but I can’t open my gift:-) The default app for this is the archive utility on the Mac and I have downloaded some data but I see I need an interface to interpret it. I am not a nerd; I’m a communiction strategist. I would love to do some research and writing about the way Twitter has evolved into a conversation with replies, rather than just a place to announce what am I doing now. I’d love to work on this with someone…
Philip (flip) Kromer — December 29, 2008 at 10:12 pm

@francine hardaway:

The files will, in principle, work with anything that can import a spreadsheet-style TSV file. But if you try to load a 56-million row dataset into Excel it will burst into flames. So will most tools; even opening an explorer/finder window on the ripd/_xxxx directories will fill you with regret.

To work with the dataset at full scale you will probably need to work with a specialist. If that specialist has access to a cluster with Hadoop and Pig it’s highly, highly recommended (Amazon EC2 if you have the coin). Otherwise, the files will load straight into MySQL using LOAD DATA INFILE (and assumedly other DBs as well). Industrial-strength products such as Mathematica and Cytoscape will struggle, but can handle good-sized subsets of the data.

If you have a budget for your project, the get.theinfo google group and the Flowing Data forums might turn up a consultant who can help.
Kristina — January 13, 2009 at 5:04 pm

Am I too late to the bufffet? I tried to access the data file, but received the message “account was suspended.” Is it mirrored anywhere else? I am a researcher studying social networking, and would love exactly this kind of data for my work.
Philip (flip) Kromer — January 14, 2009 at 5:59 am

@Kristina — lemme say it’s been an interesting week for infochimps.

Our backend was on Bluehost.com, who used to have excellent service and a generous plan — but let’s just say my opinion’s changed. We’re scrambling for an alternate solution.

On the other end of things: Twitter, after initially giving us full permission, has since asked us to hold off on redistribution while they figure out some terms of service to apply. (I think all of us were a bit surprised at the scope of the demand for this data.) The timescale for that may be… who knows.

Even if I can’t send you the data, since you’re a researcher they may work something out; or there might be a collaboration opportunity. Shoot me an email, flip at infochimps dot org.
Nathan Yau — January 14, 2009 at 10:00 am

Ugh, that sucks. FlowingData uses westhost.com for its hosting – in case you’re still looking. I haven’t had any problems, and they have really good support.
Manonfire — January 15, 2009 at 12:35 pm

Let us know how you get on.

We are looking for some large datasets to play with, good visuals and helps give the servers a workout.
Philip (flip) Kromer — January 15, 2009 at 2:49 pm

If size and interestingness is what you’re after I can hook you up with a bunch of big datasets besides this one.

Here are some of the funner ones: 30 years’ daily stock market open/close/low/hi/volume; 50 years’ global hourly weather; the Enron email corpus; every MLB baseball game outcome, plus every play of every game for 40 years, plus the trajectory and game state for every pitch thrown in 2008 and most of 2007; answers to every question by every student (grades 3-11) on the Texas state exam for the past 5 years; a wealth of large text corpora; others. For silly transitional reasons not all are indexed yet on infochimps, but if you email me (flip at infochimps org) with the structure you’re after I can recommend something from the extended menu.
Pingback: Shared data services - examples from Thomson Reuters and Salesforce | The Equity Kicker
Manonfire — January 17, 2009 at 6:21 am

Hi Flip,

I’ve bookmarked the site and am perusing it now, might play with some of the stocks, and get playing with Flex or something to graph them.
Ryan — January 18, 2009 at 2:00 pm

@flip How on Earth did you get the answers from every student on the TAAS? LOL I imagine they must have scrapped the exam and put a new one in place? Standardized testing usually has nearly Fort Knox security.
Terry Martin — January 27, 2009 at 1:44 pm

I logged in, but the site is suspended. Anyone have the data set?
sang — February 5, 2009 at 4:39 pm

I would like to download the dataset, but I cannot. I do not have any idea what happened. The link is broken?

Please let me know how to download it.

Thanks. :)
Cassio — March 3, 2009 at 5:02 am

Please let me know when it’s back!
Jorg Unger — March 11, 2009 at 10:27 am

I’d be very interested in the data set. Please let me know if it will be available again.
Mark GruÃ«ger — March 16, 2009 at 6:17 am

I’m HIGHLY interested too for my thesis. Please.

All You Can Eat at the Twitter Data Buffet

Topic

20 Comments

Second Edition

Visualize This: The FlowingData Guide to Design, Visualization, and Statistics (2nd Edition)

All You Can Eat at the Twitter Data Buffet

Topic

Related

20 Comments

Second Edition

Visualize This: The FlowingData Guide to Design, Visualization, and Statistics (2nd Edition)