Visualizing data that looks like it came straight out of Statistics 101 text book is nice and all — for teaching and learning purposes. You gotta learn to stand before you can run a marathon. Once you’re ready for the real data though, which is fuzzier and more irregular, you run into data points that don’t quite fit in with the rest. The outliers.
There are various ways to incorporate outliers into your visualization, but you have to understand them first.
Why is the outlier there in the first place? Maybe it’s a recording error or a kink in methodology. For example, PornHub claimed that a disproportionate percentage of traffic came from Kansas. However, location was based on IP addresses, and any locations that could not be identified defaulted to the center of the country. That spot was in Kansas.
Sometimes outliers might be an exception or something extraordinary. We see this in sports a lot, like when Stephen Curry broke the single-season three-point record. Or when Usain Bolt ran faster than everyone.
In one case, the outlier is noise relative to the rest of the data. In another the outliers deserve a closer look.
With your own data, figure out which is which first. Then decide if the outlier belongs in the background or foreground. The visualization options below will be much more useful.
Point of Focus
Focus on the outlier directly and show how it stands out from the rest. Visually, differences outweigh similarities.
The outlier — or as my mom would say, the thing that sticks out like a sore thumb — draws attention away from the averages. Instead, the reader’s eyes head straight to a single point.
Showing an outlier on the same scale can overly obscure the rest of the data. Maybe a distribution gets squished into a few bins or a scatter plot shows most of the data squished into a corner. If you don’t want to highlight an outlier, try a different visualization route.
Outliers are about major differences from the norm, so they tend to come up on the news a lot. For example, The Upshot highlighted gun homicides per day in the United States, as compared to other Western democracies:
Visualize the data as you normally would for an overview, and then zoom in or highlight outliers to explain.
You can get a sense of the overall distribution of the data instead of immediately focusing on what doesn’t belong.
The outliers might end up in obscurity or overlooked. If you present the data, it’s your job to draw attention to outliers if they’re not obvious.
When I mapped commute times by county, some tended to start the day a lot earlier or later than the rest. So I mapped the country first, and then zoomed in what stood out.
Sometimes outliers are viewed better on a different scale that allows for extremes and averages to display at the same time. For example, a logarithmic scale is often useful with large counts. Or, with a color scale, the outlier might fall into the last bin by default.
You can show the full dataset without obscuring too much, if anything at all. An outlier on one scale might be normal on another, so it’s worth trying.
Fuss around too much, and what was a pro might end up a con. You don’t want to visualize an outlier as average if it’s an outlier.
I can also be useful to let users switch between scales to see how the result changes. With the occupation matchmaker, I looked for how those with one job were more likely to be married to others. On an absolute scale, the occupations with the a lot of workers (such as cashiers) always stand out. On a relative, the specificity is unique to each occupation.
See also: Using a logarithmic scale to show emergency room visits.
Use the outlier as a point of comparison for a sense of scale or to make the data more relatable.
Outliers are often really large or really small, so the scale can sometimes get lost in the mix. By using the outlier as a reference point against something familiar, the data also becomes more familiar.
This route highlights differences between the outlier and the other data points. Be careful not to lose the overall distribution in the process.
In the Diary of a Food Tracker, the story focuses on the weight loss of an individual over three years. But as many weight loss stories go, readers can easily relate to the data. At the end of the scrolly-telling, the individual’s time series is plotted against a sample of others’ experience with food tracking.
Maybe you don’t want to highlight the outlier. Maybe it’s not as important as the rest of the dataset. In this case, use it as context or background.
The patterns in the full dataset don’t get lost in scale adjustments, which can make for easier reading.
The outlier could become a side thought or ends up too far in the background that it is forgotten. Use your best judgement.
In this barebones chart that shows age distribution for Olympic sports, there are obvious outliers with really young or old competitors. But the main point is the general differences across sports.
The general theme here is about purpose. Figure out what aspect of the outlier you want to show, and then choose your visualization accordingly. Your design choice should stem from the meaning behind the outlier. Is it statistical noise? Is it worth a deeper look? Once you figure this out, the visualization of the outlier is much easier.
Become a member. Learn to visualize your data. Gain instant access to tutorials.Join Today
This is for people who want to learn to make and design data graphics. Your support goes directly to FlowingData, an independently run site.
What You Get
- Instant access to tutorials on how to make and design data graphics
- Source code and files to use with your own data
- In-depth courses on visualization in R
- Hand-picked links and resources from around the web
Think Like a Statistician – Without the Math
I call myself a statistician, because, well, I’m a statistics graduate student. However, the most important things I’ve learned are less formal, but have proven extremely useful when working/playing with data.
Visualizing the Uncertainty in Data
Data is an abstraction, and it’s impossible to encapsulate everything it represents in real life. So there is uncertainty. Here are ways to visualize the uncertainty.