This is a guest post by Martin Krzywinski who develops Circos, a GPL-licensed (free) visualization tool that can help you show relationships in data. This article is based on a longer writeup which you can find here.
Suppose that you are reading an article and the text refers you to a table on the next page. Before you turn the page, what are your expectations of the table? Chances are, you would like it to communicate trends and patterns. Chances are, too, that it will fail and simply deliver numerical minutiae. You are left hunting around the numbers for a while, only to return to the text in hopes that the table’s data trends will be communicated elsewhere.
Imagine if, instead, the table were replaced by a visual representation that was agnostic to the data domain, sufficiently quantitative to identify patterns and descriptive statistics, and made no assumptions about the kind of patterns that might exist. In this article, I outline one such representation.
Tables are Visual Obstacles
As the saying goes – it’s not the table, it’s you. We are notoriously bad at evaluating quantitative information when it is presented in its raw numerical form. We reach our limit in the ability to glean trends from a table very quickly. Consider the five tables below – the 1×1 table is trivial to interpret and the 5×5 table impossible. Somewhere in between is where you reach numerical overload.
Unfortunately, most published tables are larger than these examples. Due to their size, many fail to effectively communicate their information. They provide the numerical minutiae from which visual representations can be genreated, but on their own they make opaque any patterns that might arise in such representations.
An Uninterpretable Table
Even prestigious journals are not exempt from poorly communicated data. Frequently it is not an issue of poor communication, as much as no communication. The reader is left frustrated, without a sense of what is important in the data and which differences are meaningful.
Consider the table below (Horvath, J. E. et al. Development and application of a phylogenomic toolkit: resolving the evolutionary history of Madagascar’s lemurs. Genome Res 18, 489-99 (2008)), which suffers from two extremes of the same problem: inappropriate amount of information.
On the left half of the first table there is nearly no information – almost all values are 1.0. On the other hand, the right half of the table is packed so tightly with numbers as to make them visually unparsable. The second table is even worse, suffering not only from information overload, but also from both poor layout, and inconsistent precision (e.g. 7 (4.74-9.24)).
Poorly designed tables can suffer from visual noise (lots of ink, but no information), obscured statistics (descriptive statistics are hidden in numbers), unparsable content (too much information), misguided sightlines (poor row and column spacing), and burden of significance (reported precision is much higher than required for visual inspection). Such tables do not help understand the scale and tolerance inherent in the data and leave the reader faced with a deluge of numbers, to fend for themselves.
Visualization of Tabular Data
The method presented here provides an alternative to mitigate the problems outlined above. It is a visual approach that uses Circos[http://mkweb.bcgsc.ca/circos] to represent rows and columns in a circular fashion, and ribbons to represent cell values. Does it solve every table’s problems? No. It does provide, however, a way to capture the essence of the table and present it quantitatively and attractively.
In this approach, relationships between data elements (e.g. a row and a column) are encoded by ribbons that join segments that correspond to these elements.
The ribbons can have different end thicknesses to represent a ratio between the elements. By coloring the ribbons (and/or adding transparency), such as shown below, the representation can focus on the flow of information in a particular direction (e.g. from A (left), or to A (right)).
In practise, a visualization of a table based on this scheme might look like the figure below. Normalizing the segments to equal size is motivated by whether absolute or relative relationships are important.
Practical Example – Preference for Hair Color in Relationships
To illustrate this visual approach with a small data set, consider how one could visualize dating preference for hair color. You might have information about the relationship history of a large number of individuals and want to visualize the probabilities of transitions between hair colors in successive relationships.
The data might look like this, where each cell represents the number of cases in which someone moved from a partner with one hair color (row) to another (column). For example, 2,868 individuals dated someone with red hair right after someone with black hair.
These data are synthetic (drawn from my own stereotypes) and visually represented in the image below
Several trends, not immediately discernable from the table, are made clear in the figure. Moreover, given that we can simultaneously process more visual details than numerical ones, this image can communicate many patterns at the same time and therefore enhance both interpretation and retention of information.
Practical Example – Reactivity of Chemical Elements in Minerals
The hair color data set was both small and synthetic. Let’s turn to something much more complicated to see how a visual representation can help avoid visual burden.
For this example, I used a database of mineral formulae [http://un2sg4.unige.ch/athena/mineral/minppmi.html] to extract all pairwise element ratios from each mineral. or example, Zabuyelite is Li2CO3 and would therefore contribute +2 (Li,C), +2 (Li,O), +1 (C,Li), +1 (C,O), +3 (O,Li), +3 (O,C). The resulting table was a 77 x 77 matrix [http://mkweb.bcgsc.ca/circos/export/mineral-element-ratio-table.txt] of ratios of elements.
To start, I condensed the table by combining elements of the same classification (e.g. alkali metal, transition, etc). In the table below, the counts are in units of 1,000.
The image of the table below presents the trends in the data well. By keeping the segment size for each classification in absolute units, the representation also communicates information about abundance. By using relative tick marks however (every 10%) for each segment, it is possible to quickly evaluate extent of contribution from each ribbon to its segments.
By greying out ribbons that provide minor contribution, and varying the amount of opacity as a function of percentile rank for the remaining ribbons, major patterns can be accentuated (image below, left). Alternatively, ribbons’ percentile rank can be mapped onto a rainbow color palette (image below, left).
Now what happens when the data for individual elements are drawn? It is no surprise that the result is a very complicated image.
However, even at this level of detail, the image is visually parsable. First, relative sizes of ribbons quickly indicate which segments provide the majority of contribution to the table. The thin ribbons, which correspond to small values in the table, do not distract the eye to the same extent as a sea of small numbers in a table.
Oxygen’s abundance in minerals is reflected in the fact that its segment occupies half of the figure. To explore how oxygen combines with elements as a function of their abundance, the image below shows all segments normalized to equal size (except oxygen, which is shown at 20x) and uses color to focus on pairings between oxygen and other elements.
The manner in which the ribbons transit across the figure, and in places cross, indicates a difference between the order of reactivity and the order of abundance for the elements. For example, look at the ribbon between sulphur (S) and oxygen, indicated by the black arrow. Sulphur is 4th most abundant, but 12th in terms of number of O atoms that combine with it. Similarly, calsium (Ca) is 7th most abundant but 3rd in terms of reactivity with oxygen (red arrow).
Another treatment of the figure is shown below, with the oxygen segment removed, and the ribbons that correspond to element pairs that have the highest relative affinity (strong preference) for one another shown in color.
While, it is possible to apply information design principles to a table to ensure that it communicates its content clearly, sometimes tables are not the best way to present data.
I hope that in this short writeup I have given you ideas that will be useful in your quest to articulate your own data sets.
Martin is a scientist who specializes in bioinformatics at the Genome Sciences Centre in Vancouver. Visit his site for more on Circos and some of Martin’s other data musings.