CNN transcript collection, 2000-2012

Thanks to the Internet Archive and CNN, thirteen years of transcripts, about a gigabyte compressed, is available to download as one file.

For over a decade, CNN (Cable News Network) has been providing transcripts of shows, events and newscasts from its broadcasts. The archive has been maintained and the text transcripts have been dependably available at transcripts.cnn.com. This is a just-in-case grab of the years of transcripts for later study and historical research.

Changes in news coverage and CNN’s focus over the years, anyone?

[via @A_L]

5 Comments

simon — May 10, 2012 at 12:21 am

really cool! Would have been cooler if they had 2 versions: one with advertising and one without. :) anyway it’s amazing what we will be able to do with this data!
B — May 10, 2012 at 6:49 am

And it would be great if other news organizations follow suit! Here’s an interesting paper analyzing newspaper data: http://faculty.chicagobooth.edu/matthew.gentzkow/research/competition.pdf
T — May 10, 2012 at 9:54 am

I’m a bit of a newbie, so forgive the question:

How can I search through all of these files for key words or phrases if they are all in html file?
- Kevin Carlson — May 10, 2012 at 3:28 pm
  
  Maybe parse using Python BeautifulSoup, then process using NLTK and report with MatPlotLib?
alexander — May 12, 2012 at 8:03 pm

this seems to work better: http://transcripts.cnn.com/TRANSCRIPTS/
alas, no easy search i can find

CNN transcript collection, 2000-2012

Topic

5 Comments

Second Edition

Visualize This: The FlowingData Guide to Design, Visualization, and Statistics (2nd Edition)

CNN transcript collection, 2000-2012

Topic

Related

5 Comments

Second Edition

Visualize This: The FlowingData Guide to Design, Visualization, and Statistics (2nd Edition)