# Visualizing uncertainty still unsolved problem

Posted to Statistical Visualization  |  Tags:  |  Nathan Yau

Data from an experiment may appear rock solid. Upon further examination, the data may morph into something much less firm. A knee-jerk reaction to this conundrum may be to try and hide uncertain scientific results, which are unloved fellow travelers of science. After all, words can afford ambiguity, but with visuals, “we are damned to be concrete,” says Bang Wong, who is the creative director of the Broad Institute of MIT and Harvard. The alternative is to face the ambiguity head-on through visual means.

I still struggle with uncertainty and visualization. I haven’t seen many worthwhile solutions other than the old standbys, boxplots and histograms, which show distributions. But how many people understand spread, skew, etc? It’s a small proportion, which poses an interesting challenge.

• You can do uncertaintity heatmapping with Oneslate’s logic-sharing tool.

• The reason visualising uncertainty, in general, is a difficult problem is because different types of situations will require different types of metrics and no plot type shows all metrics equally.

• For my bar charts, I’m still using an overlay of the top and bottom end of the range for uncertainty and I’m always on the lookout for something more elegant. I really like the concept of the fading colours in the link. But I feel like I need glasses as the whole image appears blurred. The effect of all the shading seems to affect my eyesight. Is it just me, early in the morning, or do others see the same thing? Would different colours and intensities improve (or make worse) this effect? Have you seen other similar examples to compare and contrast please?

• Amelia Bellamy-Royds

I think it’s important that we recognize there are two situations where uncertainty may come in to play in a data visualization.

In the first situation, there are the times when we want to visualize a data set that comes with nice numerical uncertainty values. These values could represent measurement error in the original data or statistical uncertainty in estimated parameters (e.g., averages or lines of best fit), but either way we have a quantity to visualize and an attached measure of its imprecision. In this case, the uncertainty just becomes another dimension of the data. Whether or not it is easy to visualize depends on how many data dimensions you are already visualizing, and how dense your data is on the page/screen size you are using.

Scientific readers are already used to error bars, but I would encourage people making visualizations for less technical audiences to try to introduce this level of information. For a simple one-dimensional dataset, I like the use of fading colours (as in the example linked by Till Keyling above) rather than more conventional error bars or box plots. I think they are more intuitive, and more accurately represent the nature of (most) data distributions as compared to a blocky 95% confidence cut-off.

You may need to add annotations explaining what the bars or blurred area signify, but the more examples people are exposed to the more natural it will be to understand. Annotations/captions are a good idea anyways, since error bars can sometimes represent population variance (prediction error) and sometimes represent standard error on the estimated mean.

That said, if your data visualization is already quite complex — with colours or sizes of points already having assigned meanings — or if points overlap on the page, then it becomes more difficult to add in a transparent or fuzzy nature to the points without detracting from other aspects of meaning.

It may also be difficult to clearly attribute the uncertainty to the correct quantity; imagine a stacked bar graph where you have different uncertainty values for each category and for the total. You may need to completely rethink your visualization design; instead of a stacked bar graph, perhaps a rectangle graph where total values (and their uncertainty) are represented in one direction, while the categorical distribution (and its uncertainty) is represented in another. But if you’ve got more than two categories, blurring the boundaries between them will still not properly represent the uncertainty.

(Anyone who is having trouble picturing what such a rectangle plot would look like can check out this article, which also discusses a use of a colour code to represent statistical confidence values:
http://www.math.yorku.ca/SCS/sugi/sugi17-paper.html#Fig_mosaica )

All of that assumes that you have a number derived from a statistical analysis to quantify the uncertainty of your data. The other situation where uncertainty is often not communicated comes when the visualization itself is the analysis.

Many types of visualizations, from maps to word clouds, display individual data points and then implicitly make use of the human brain’s ability to identify patterns in visual data. However, the human brain is so good at identifying visual patterns (and patterns in general) that it often identifies patterns in random variation. We see animals in the clouds, a face on the moon, lucky streaks in casino games, and cancer hot-spots in the geographic juxtaposition of random unfortunate events.

Nearly the entire field of statistics is based on determining whether or not a particular pattern in the data could reasonably have occured by random variation in the sample. But with the accessibility of data and visualization tools, many people are going straight from data to final presentation without any statistical analysis in between.

In order to accurately represent uncertainty in these cases, you first have to do the analysis — whether it is a simple Poisson test or a more complex cluster analysis — to distinguish random patterns or clusters from those that may be significant. But the statistical analysis required is often many times more complicated than the data visualization, and may be beyond the skill level of the person involved. (This gets into Nathan’s “Non-statistician analysts are the new norm” post from June). And that still assumes that the data was collected in ways suitable for statistical analysis. The uncertainty of biased and incomplete samples can rarely be properly quantified, so how could it be properly visualized?

But it is an important topic, one that I think should always be included in a discussion of data visualization. And I do truly believe that if more visualizations, particularly in the media, included uncertainty values, that it would become more natural to understand them, and even to expect them in any good visualization.

That’s a long post, and comes two weeks after the main blog, but hopefully it will help or inspire someone, sometime, who stumbles on this in the future.

–ABR

• I recommend interactive simulation, which provides an experiential feel for probability distributions. It can be done in native Excel (see the SIPmath page of ProbabilityManagement.org).