Data Visualization is Only Part of the Answer to Big Data

How can we now cope with a large amount of data and still do a thorough job of analysis so that we don’t miss the Nobel Prize?

— Bill Cleveland, Getting Past the Pie Chart, SEED Magazine, 2.18.2009

For the past year, I’ve been slowly drifting off my statistical roots – more interested in design and aesthetics than in whether or not a particular graphic works or the more numeric tools at my disposal. I’ve always had more fun experimenting on a bunch different things rather than really knuckling down on a particular problem. This works for a lot of things – like online musings – but you miss a lot of the important technical points in the process, so I’ve been (slowly) working my way back to the analytical side of the river.

If you really want to learn about a large dataset, visualization is only part of the answer. It’s an exploratory process. You create a graph. You create a whole bunch of graphs. Notice anything interesting? Okay, let’s look over there. This process is called exploratory data analysis, coined by famed statistician John Tukey back in the 1970s. Too often we settle on a particular graphic because it looks pretty, or worse, it helps prove our point. We get blinded by outside motivations, that we forget to listen and look at what else the data have to say. On the flip side, we often like to visualize everything at once and leave it at that. This works to an extent, but we miss out on a lot of details.

Basically, what I’m trying to say is that design can do wonders for visualization, yes, but so can analysis. Put the two together, and you’re going to gain a much better understanding of a dataset than if you were to have just one or the other. In my experience, designers are afraid of statistical methods and statisticians are oblivious to design. I say – put the two together. Learn both, and we’ll all be that much better at understanding the even bigger data to come.

8 Comments

  • While charts and other data visualizations are probably sometimes intentionally manipulated, I think a lot of the time the errors are simply mistakes or oversights. Like you said, the designers and the statisticians like their jobs, but not each others.

    One site that I’ve found quite interesting has fun with the errors in informational graphics. Perhaps Flowing Data readers will find it interesting too.

    Check out Junk Charts at http://junkcharts.typepad.com/

  • simianmenace March 20, 2009 at 7:18 am

    Visualisations that work best for me are those where the presentation layer elucidates relationships within the data, easy on the eye yet still a lens. Large datasets are often samples of even more massive populations with sampling error still present.

  • himan powered March 20, 2009 at 3:32 pm

    Great intro into the subject. I would suspect the human factors field will have started looking into this. Certainly as we continue to need to make sense of huge data sets there will be real research into what visualizations do the best job of increasing usability versus what is pretty design. I wish I had the time to do this muself.

  • I suggest you have a look at HCE (http://www.cs.umd.edu/hcil/hce/) – it is a nice tool that try to fill this gap between statistics and visualizations

  • I have come to the exact same conclusion: use raw data visualization to define hypotheses before doing the statistical analysis, which leads to new visualizations again, and so on. http://saaientist.blogspot.com/2008/11/visualize-or-summarize.html

  • I couldn’t agree more. We (a design research group) recently made a partnership with a department of statistics, in order to be able to work both on analysis and visualization, integrating as much as possible these domains. I can say that the relationship works very well, and there are more common elements then one could expect.

  • @Paolo – that’s definitely something i’d like to hear more about as that relationship develops