How to: make a scatterplot with a smooth fitted line
Oftentimes, you’ll want to fit a line to a bunch of data points. This tutorial will show you how to do that quickly and easily using open-source software, R.
Maybe you have observations over time or it might be two variables that are possibly related. In either case, a scatter plot just might not be enough to see something useful. A fitted line can let you see a trend or relationship more easily.
As an example, we’ll take a look at monthly unemployment data, from 1948 to February this year, according to the Bureau of Labor Statistics.
What LOESS is
First, let’s briefly go over what we’re actually doing with this loess thing. LOESS stands for locally weighted scatterplot smoothing. It was developed [pdf] in 1988 by William Cleveland and Susan Devlin, and it’s a way to fit a curve to a dataset.
If we plot unemployment without any lines or anything fancy, it looks like this:
Dot plot showing unemployment over time
Most of us are familiar with fitting just a plain old straight line. The end result is a slope and an intercept. You know the whole y=mx + b equation back from middle school?
Scatterplot with a linear fit, y = mx + b
So without going into the nitty-gritty, the above fit looks at all the data and then fits a line. Loess however, moves along the dataset, and looks at chunks at a time, fitting a bunch of smaller lines that connect to make one smooth line.
Alright, enough background. On to the how-to.
Step 0. Download R
You’ve already done this, right? If not, you can download it for Windows, Mac, or Linux. Don’t let the out-dated site full you. You can get a lot done with the free software, and it’ll be a simple one-click install for most.
Step 1. Load the data
Like I said, I got the data from the Bureau of Labor Statistics. You can download it here in CSV format if you like, but we’ll load it directly into R with the following:
unemployment <- read.csv("http://datasets.flowingdata.com/unemployment-rate-1948-2010.csv", sep=",")
You’re basically telling R to load data in the
unemployment variable from the given URL, and columns are separated by commas.
Once it’s loaded, take a brief look by typing
unemployment[1:10,]. Your screen will look something like this:
As usual, you load your data in R before you start anything else
There are four columns, but we’re actually just going to use that last one:
Step 2. Time to plot
Yup, it’s already time to make the scatterplot with fitted curve:
Since we’re only looking at unemployment, the x-axis is just a sequence from 1 to the total number of observations. Here’s what the above line will give you.
Fit a LOESS curve to the dots
Not bad, right? Two lines of code, and you’ve already got your plot. We can do a little better though. Let’s fix it up a bit.
Step 3. Modify axis limits
It’s usually a good idea to start your values axis at zero if you can. The above graph doesn’t start at zero, so let’s fix that using the
ylim argument to make it go from 0 to 11.
scatter.smooth(x=1:length(unemployment$Value), y=unemployment$Value, ylim=c(0,11))
Update the axes to start at zero
That’s a little better. Now let’s do something about the color.
Step 4. Modify colors
I want the curve to stand out some more. Everything blends together as it is now. We’ll use the
col argument to change the dots to light gray:
scatter.smooth(x=1:length(unemployment$Value), y=unemployment$Value, ylim=c(0,11), col="#CCCCCC")
Make the fitted the line the point of interest and put dots in the background
Step 5. Save as PDF and do whatever
So at this point, you can fuss around with arguments to tweak. Just type
?scatter.smooth to read documentation on the function. As many of you know though, I like to take it into Adobe Illustrator at this point. This just happens to be what works for me. There are lots of ways to edit PDF files.
Anyways, after some color changes, and label cleanup, we’re done.
Title, color, cite, and fonts
Tada. And it only took two lines of code. How about that? Give it a try for yourself, and happy graphing.
For more examples, guidance, and all-around data goodness like this, pre-order Visualize This, the upcoming FlowingData book.
Want more visualization goodness? Become a member and learn about tools and process.Join Now
More Tutorials See All →
Calendar Heatmaps to Visualize Time Series Data
The familiar but underused layout is a good way to look at patterns over time.
How to Make Beeswarm Plots in R to Show Distributions
Try the more element-based approach instead of your traditional histogram or boxplot.