How to: make a scatterplot with a smooth fitted line
Oftentimes, you’ll want to fit a line to a bunch of data points. This tutorial will show you how to do that quickly and easily using open-source software, R.
Maybe you have observations over time or it might be two variables that are possibly related. In either case, a scatter plot just might not be enough to see something useful. A fitted line can let you see a trend or relationship more easily.
As an example, we’ll take a look at monthly unemployment data, from 1948 to February this year, according to the Bureau of Labor Statistics.
What LOESS is
First, let’s briefly go over what we’re actually doing with this loess thing. LOESS stands for locally weighted scatterplot smoothing. It was developed [pdf] in 1988 by William Cleveland and Susan Devlin, and it’s a way to fit a curve to a dataset.
If we plot unemployment without any lines or anything fancy, it looks like this:
Dot plot showing unemployment over time
Most of us are familiar with fitting just a plain old straight line. The end result is a slope and an intercept. You know the whole y=mx + b equation back from middle school?
Scatterplot with a linear fit, y = mx + b
So without going into the nitty-gritty, the above fit looks at all the data and then fits a line. Loess however, moves along the dataset, and looks at chunks at a time, fitting a bunch of smaller lines that connect to make one smooth line.
Alright, enough background. On to the how-to.
Step 0. Download R
You’ve already done this, right? If not, you can download it for Windows, Mac, or Linux. Don’t let the out-dated site full you. You can get a lot done with the free software, and it’ll be a simple one-click install for most.
Step 1. Load the data
Like I said, I got the data from the Bureau of Labor Statistics. You can download it here in CSV format if you like, but we’ll load it directly into R with the following:
unemployment <- read.csv("http://datasets.flowingdata.com/unemployment-rate-1948-2010.csv", sep=",")
You’re basically telling R to load data in the
unemployment variable from the given URL, and columns are separated by commas.
Once it’s loaded, take a brief look by typing
unemployment[1:10,]. Your screen will look something like this:
As usual, you load your data in R before you start anything else
There are four columns, but we’re actually just going to use that last one:
Step 2. Time to plot
Yup, it’s already time to make the scatterplot with fitted curve:
Since we’re only looking at unemployment, the x-axis is just a sequence from 1 to the total number of observations. Here’s what the above line will give you.
Fit a LOESS curve to the dots
Not bad, right? Two lines of code, and you’ve already got your plot. We can do a little better though. Let’s fix it up a bit.
Step 3. Modify axis limits
It’s usually a good idea to start your values axis at zero if you can. The above graph doesn’t start at zero, so let’s fix that using the
ylim argument to make it go from 0 to 11.
scatter.smooth(x=1:length(unemployment$Value), y=unemployment$Value, ylim=c(0,11))
Update the axes to start at zero
That’s a little better. Now let’s do something about the color.
Step 4. Modify colors
I want the curve to stand out some more. Everything blends together as it is now. We’ll use the
col argument to change the dots to light gray:
scatter.smooth(x=1:length(unemployment$Value), y=unemployment$Value, ylim=c(0,11), col="#CCCCCC")
Make the fitted the line the point of interest and put dots in the background
Step 5. Save as PDF and do whatever
So at this point, you can fuss around with arguments to tweak. Just type
?scatter.smooth to read documentation on the function. As many of you know though, I like to take it into Adobe Illustrator at this point. This just happens to be what works for me. There are lots of ways to edit PDF files.
Anyways, after some color changes, and label cleanup, we’re done.
Title, color, cite, and fonts
Tada. And it only took two lines of code. How about that? Give it a try for yourself, and happy graphing.
For more examples, guidance, and all-around data goodness like this, pre-order Visualize This, the upcoming FlowingData book.
Become a member. Learn to visualize your data. Support FlowingData.Join Now
This is for people who want to learn to make and design data graphics. Your support goes directly to FlowingData, an independently run site.
What You Get
- Instant access to tutorials on how to make and design data graphics
- Source code and files to use with your own data
- Four-week course on visualization in R
- Hand-picked links and resources from around the web
More Tutorials See All →
Build Interactive Area Charts with Filters
When you have several time series over many categories, it can be useful to show them separately rather than put it all in one graph. This is one way to do it interactively with categorical filters.
How to Make Horizon Graphs in R
The relatively new and lesser known time series visualization can be useful if you know what you’re looking at, and they take up a lot less space.
Mapping With Shapefiles in R – Getting Started
Geographic data is often available as a shapefile, and there’s plenty of heavy software to get that data in a map. R is an open source option, and as a bonus, much of the work can be done in a few lines of code.