<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>FlowingData &#187; Exploratory Data Analysis</title>
	<atom:link href="http://flowingdata.com/category/statistics/exploratory-data-analysis/feed/" rel="self" type="application/rss+xml" />
	<link>http://flowingdata.com</link>
	<description>Strength in Numbers</description>
	<lastBuildDate>Sun, 12 Feb 2012 01:23:22 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<atom:link rel="next" href="http://flowingdata.com/category/statistics/exploratory-data-analysis/feed/?page=2" />

		<item>
		<title>Data Visualization is Only Part of the Answer to Big Data</title>
		<link>http://flowingdata.com/2009/03/20/data-visualization-is-only-part-of-the-answer-to-big-data/</link>
		<comments>http://flowingdata.com/2009/03/20/data-visualization-is-only-part-of-the-answer-to-big-data/#comments</comments>
		<pubDate>Fri, 20 Mar 2009 09:27:31 +0000</pubDate>
		<dc:creator>Nathan Yau</dc:creator>
				<category><![CDATA[Design]]></category>
		<category><![CDATA[Exploratory Data Analysis]]></category>

		<guid isPermaLink="false">http://flowingdata.com/?p=1456</guid>
		<description><![CDATA[How can we now cope with a large amount of data and still do a thorough job of analysis so &#8230;]]></description>
			<content:encoded><![CDATA[<blockquote class="quote"><p>How can we now cope with a large amount of data and still do a thorough job of analysis so that we don't miss the Nobel Prize?</p>
<div class="cite">&mdash; Bill Cleveland, <a href="http://seedmagazine.com/content/article/getting_past_the_pie_chart/">Getting Past the Pie Chart</a>, SEED Magazine, 2.18.2009</div>
</blockquote>
<p>For the past year, I've been slowly drifting off my statistical roots - more interested in design and aesthetics than in whether or not a particular graphic works or the more numeric tools at my disposal. I've always had more fun experimenting on a bunch different things rather than really knuckling down on a particular problem. This works for a lot of things - like online musings - but you miss a lot of the important technical points in the process, so I've been (slowly) working my way back to the analytical side of the river. </p>
<p>If you really want to learn about a large dataset, visualization is only part of the answer. It's an exploratory process. You create a graph. You create a whole bunch of graphs. Notice anything interesting? Okay, let's look over there.  This process is called <a href="http://en.wikipedia.org/wiki/Exploratory_data_analysis">exploratory data analysis</a>, coined by famed statistician <a href="http://flowingdata.com/2008/01/01/john-tukey-and-the-beginning-of-interactive-graphics/">John Tukey</a> back in the 1970s. Too often we settle on a particular graphic because it looks pretty, or worse, it helps prove our point. We get blinded by outside motivations, that we forget to listen and look at what else the data have to say. On the flip side, we often like to visualize everything at once and leave it at that. This works to an extent, but we miss out on a lot of details.</p>
<p>Basically, what I'm trying to say is that design can do wonders for visualization, yes, but so can analysis. Put the two together, and you're going to gain a much better understanding of a dataset than if you were to have just one or the other. In my experience, designers are afraid of statistical methods and statisticians are oblivious to design. I say - put the two together. Learn both, and we'll all be that much better at understanding the even bigger data to come.</p>
]]></content:encoded>
			<wfw:commentRss>http://flowingdata.com/2009/03/20/data-visualization-is-only-part-of-the-answer-to-big-data/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>John Tukey and the Beginning of Interactive Graphics</title>
		<link>http://flowingdata.com/2008/01/01/john-tukey-and-the-beginning-of-interactive-graphics/</link>
		<comments>http://flowingdata.com/2008/01/01/john-tukey-and-the-beginning-of-interactive-graphics/#comments</comments>
		<pubDate>Tue, 01 Jan 2008 09:20:37 +0000</pubDate>
		<dc:creator>Nathan Yau</dc:creator>
				<category><![CDATA[Exploratory Data Analysis]]></category>

		<guid isPermaLink="false">http://flowingdata.com/2008/01/01/john-tukey-and-the-beginning-of-interactive-graphics/</guid>
		<description><![CDATA[More than 30 years ago, visualization cracked its way into stat.]]></description>
			<content:encoded><![CDATA[<p><img src='http://flowingdata.com/wp-content/uploads/2007/12/tukey.png' alt='John Tukey' class='imgright' />With the start of a new year, it only seems right to open with John Tukey and his work with interactive graphics. In 1972, when computers were giant and screens were green, John Tukey came up with PRIM-9, the first program to use interactive dynamic graphics to explore multivariate data. PRIM-9 allowed picturing, rotation, isolation, and masking. In other words, PRIM-9 allowed users to see multivariate data from different angles and identify structures in a dataset that might otherwise have gone undiscovered (kind of like the more recent <a href="http://ggobi.org">GGobi</a>). </p>
<blockquote><p>To fully appreciate the revolutionary nature of PRIM-9 one has to view it against the backdrop of its time. When Statistics was widely taken to be synonymous with inference and hypotheses testing, PRIM-9 was a purely descriptive instrument designed for data exploration. When statistics research meant research in statistical theory, employing the tools of mathematics, the research content of PRIM-9 was in the area of computer-human interfaces, drawing on tools from computer science. When the product of statistical research was theorems published in journals, PRIM-9 was a program documented in a movie.</p>
<p><cite><em>John W. Tukey's Work on Interactive Graphics</em>. The Annals of Statistics, Vol. 30 No. 6. 2002.</cite>
</p></blockquote>
<p>Luckily, you can appreciate Tukey's work <a href="http://stat-graphics.org/movies/prim9.html">here</a> at the ASA video library. It's even more amazing when you consider where computers and technology were at back then. Who knows where Statistics would be if it weren't for Tukey and his brilliance and creativity. I can't imagine, or maybe I just don't want to.</p>
<p>Tukey was someone who truly understood data -- structure, patterns, and what to look for -- and because of that, he was able to create something amazing.</p>
]]></content:encoded>
			<wfw:commentRss>http://flowingdata.com/2008/01/01/john-tukey-and-the-beginning-of-interactive-graphics/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Netflix Prize Dataset Visualization</title>
		<link>http://flowingdata.com/2007/12/11/netflix-prize-dataset-visualization/</link>
		<comments>http://flowingdata.com/2007/12/11/netflix-prize-dataset-visualization/#comments</comments>
		<pubDate>Tue, 11 Dec 2007 09:12:05 +0000</pubDate>
		<dc:creator>Nathan Yau</dc:creator>
				<category><![CDATA[Exploratory Data Analysis]]></category>

		<guid isPermaLink="false">http://flowingdata.com/2007/12/11/netflix-prize-dataset-visualization/</guid>
		<description><![CDATA[<p><a href="http://flowingdata.com/2007/12/11/netflix-prize-dataset-visualization/"><img width="520" height="213" src="http://flowingdata.com/wp-content/uploads/2007/12/netflix-prize-1.png" class="attachment-medium wp-post-image" alt="netflix-prize (1)" title="netflix-prize (1)" /></a></p>One million dollars goes to whoever can understand the Netflix ratings dataset best.]]></description>
			<content:encoded><![CDATA[<p><a href="http://flowingdata.com/2007/12/11/netflix-prize-dataset-visualization/"><img width="520" height="213" src="http://flowingdata.com/wp-content/uploads/2007/12/netflix-prize-1.png" class="attachment-medium wp-post-image" alt="netflix-prize (1)" title="netflix-prize (1)" /></a></p><p>Most are familiar with the <a href="http://netflixprize.com">Netflix Prize</a>. If you're not, Netflix has offered a one million dollar prize to whoever improves their movie recommendation by a certain amount. It's been going on for a little over a year with still no grand prize winner. The dataset is 100 million ratings.</p>
<p>The above is a <a href="http://abeautifulwww.com/2007/04/03/an-interactive-visualization-of-the-netflix-prize-dataset/">visualization of the Netflix dataset</a>. Each dot represents a movie, and the closer two dots are the more similar the two corresponding movies are based on Netflix ratings. I'm guessing the orientation of the dots was decided by some variant of multidimensional scaling.</p>
<p>It's kind of fun to scroll over the clusters. Like in the bottom right we see Babylon 5, Buffy the Vampire Slayer, Alias, and Battlestar Galactica clumped together. The giant blob in the middle, however, is pretty useless; it'd probably benefit from some zoom functionality.</p>
<h2>The Need to Explore</h2>
<p>I'm kind of surprised that I haven't seen more Netflix visualizations like this (or ones better than this), because I'm pretty sure it would help see some relationships that typical analysis won't provide. I was browsing the forum and saw someone ask if others had had success loading the 100 million observation dataset into <a href="http://r-project.org">R</a>. Silly undergrad.</p>
<p>A computer scientist, designer, and statistician walk into a bar; they discuss how they would boost the Netflix recommendation system. The punchline is that they win a million dollars, but I'm not sure what happens in between.</p>
]]></content:encoded>
			<wfw:commentRss>http://flowingdata.com/2007/12/11/netflix-prize-dataset-visualization/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Transcript Analyzer for Republican Debate</title>
		<link>http://flowingdata.com/2007/12/04/transcript-analyzer-for-republican-debate/</link>
		<comments>http://flowingdata.com/2007/12/04/transcript-analyzer-for-republican-debate/#comments</comments>
		<pubDate>Tue, 04 Dec 2007 09:42:54 +0000</pubDate>
		<dc:creator>Nathan Yau</dc:creator>
				<category><![CDATA[Exploratory Data Analysis]]></category>

		<guid isPermaLink="false">http://flowingdata.com/2007/12/04/transcript-analyzer-for-republican-debate/</guid>
		<description><![CDATA[<p><a href="http://flowingdata.com/2007/12/04/transcript-analyzer-for-republican-debate/"><img width="520" height="378" src="http://flowingdata.com/wp-content/uploads/2007/12/transcript-analyzer.png" class="attachment-medium wp-post-image" alt="New York Times Transcript Analyzer" title="New York Times Transcript Analyzer" /></a></p>The New York Times recently put up a cool data exploration tool to sift through the transcript of the most &#8230;]]></description>
			<content:encoded><![CDATA[<p><a href="http://flowingdata.com/2007/12/04/transcript-analyzer-for-republican-debate/"><img width="520" height="378" src="http://flowingdata.com/wp-content/uploads/2007/12/transcript-analyzer.png" class="attachment-medium wp-post-image" alt="New York Times Transcript Analyzer" title="New York Times Transcript Analyzer" /></a></p><p>The New York Times recently put up a cool data exploration tool to sift through the transcript of the most recent Republican debate. They call it the <a href="http://www.nytimes.com/interactive/2007/11/28/us/politics/20071128_DEBATE_GRAPHIC.html#transcript">transcript analyzer</a>. There are three key features:</p>
<ol>
<li>View where candidates put in their two cents indicated by the blue, highlighted rectangles</li>
<li>Read the actual chunks of transcript for each block</li>
<li>Search the transcript to see when specific words and phrases were used indicated by the smaller gray highlighted rectangles</li>
</ol>
<p>My particular favorite is the search feature because it really allows readers to dig into the transcript or a reader can find out which candidate is (or isn't) talking about his or her point of interest and when in the debate the topic was discussed. The intuitive text scrolling is pretty awesome too. Good job, New York Times!</p>
<p>[via <a href="http://blog.jonudell.net/2007/11/29/excellent-debate-visualizer-at-nytimescom/">Jon Udell</a>]</p>
]]></content:encoded>
			<wfw:commentRss>http://flowingdata.com/2007/12/04/transcript-analyzer-for-republican-debate/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Exploring Twitter with Blocks</title>
		<link>http://flowingdata.com/2007/09/02/exploring-twitter-with-blocks/</link>
		<comments>http://flowingdata.com/2007/09/02/exploring-twitter-with-blocks/#comments</comments>
		<pubDate>Sun, 02 Sep 2007 06:57:42 +0000</pubDate>
		<dc:creator>Nathan Yau</dc:creator>
				<category><![CDATA[Exploratory Data Analysis]]></category>

		<guid isPermaLink="false">http://flowingdata.com/2007/09/02/exploring-twitter-with-blocks/</guid>
		<description><![CDATA[<p><a href="http://flowingdata.com/2007/09/02/exploring-twitter-with-blocks/"><img width="500" height="183" src="http://flowingdata.com/wp-content/uploads/2007/09/twitter-blocks1.png" class="attachment-medium wp-post-image" alt="twitter-blocks" title="twitter-blocks" /></a></p>On their new exploration section, Twitter blocks is available for viewing and use. The viz is in Flash and is &#8230;]]></description>
			<content:encoded><![CDATA[<p><a href="http://flowingdata.com/2007/09/02/exploring-twitter-with-blocks/"><img width="500" height="183" src="http://flowingdata.com/wp-content/uploads/2007/09/twitter-blocks1.png" class="attachment-medium wp-post-image" alt="twitter-blocks" title="twitter-blocks" /></a></p><p>On their new <a href="http://explore.twitter.com/">exploration section</a>, Twitter blocks is available for viewing and use. The viz is in Flash and is supposed to allow you to explore your neighbors as well as your neighbors' neighbors. I think the higher up the blocks are, the more recent. It's kind of hard to say. Other than that, I'm actually not really sure what I'm looking at. I thought it might be because I'm not following that many people, but I viewed the blocks for the public timeline and still had trouble deciphering. Maybe others will have better luck.</p>
<p><strong>Update:</strong> Michal <a href="http://mike.teczno.com/notes/uselessness.html">posted on the feedback</a> they've been getting on Twitter Blocks that's certainly worth reading:</p>
<blockquote><p>So we get this a lot: "Beautiful! But useless!". We've heard it in response to most projects we've done over the past few years (one exception has been Oakland Crimespotting, whose stock yokel response is: "no way am I moving to Oakland!").</p></blockquote>
<p>This kinda surprises me. I think their other projects are pretty useful and informative.</p>
]]></content:encoded>
			<wfw:commentRss>http://flowingdata.com/2007/09/02/exploring-twitter-with-blocks/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

