Thanks to the Internet Archive and CNN, thirteen years of transcripts, about a gigabyte compressed, is available to download as one file.
For over a decade, CNN (Cable News Network) has been providing transcripts of shows, events and newscasts from its broadcasts. The archive has been maintained and the text transcripts have been dependably available at transcripts.cnn.com. This is a just-in-case grab of the years of transcripts for later study and historical research.
Changes in news coverage and CNN’s focus over the years, anyone?
really cool! Would have been cooler if they had 2 versions: one with advertising and one without. :) anyway it’s amazing what we will be able to do with this data!
And it would be great if other news organizations follow suit! Here’s an interesting paper analyzing newspaper data: http://faculty.chicagobooth.edu/matthew.gentzkow/research/competition.pdf
I’m a bit of a newbie, so forgive the question:
How can I search through all of these files for key words or phrases if they are all in html file?
Maybe parse using Python BeautifulSoup, then process using NLTK and report with MatPlotLib?
this seems to work better: http://transcripts.cnn.com/TRANSCRIPTS/
alas, no easy search i can find