Can You Improve this Mediocre Statistical Graphic?

I’m on my way back home from the workshop Integrating Computing into the Statistics Curricula in Berkeley (and this time I managed to get through the line without getting yelled at). During one of the labs, there was an assignment called Deconstruct-Reconstruct which was a great way to learn how to improve statistical graphics. Basically, we picked apart (deconstruct) a graphic from Swivel and then created a better version (reconstruct).

Your Mission, If You Choose to Accept it…

As I was making my own version, I thought to myself, “I bet FlowingData readers would do really well with this exercise.” Let’s see if I’m right. Can you deconstruct-reconstruct the above graphic? Here are questions worth considering:

  • What is the graphic (trying) to show?
  • Does the graphic achieve its goal?
  • Are there other data that could make the plot more informative?
  • How can we improve the bar chart?

I’ll put my version a little later…This post will self-destruct in ten seconds…

43 Comments

  • OK, maybe I’m not a lot familiar with the Deconstruct-Reconstruct term, but I’ll give it a try. I think a vertical axis with the relative frequencies and a label above each bar showing both the absolute and relative freq. would be a first step of improvement.

    The labels of the horizontal axis should be in the middle of each pair.

    Is the data continuous? If so we need a histogram.

  • Nathan, the years are switched around. The readers should check the comment in the original chart to get the link to the right data.

    So much work…

    Readers should also look for inspiration in the original PDF data source (I’ll say no more).

  • Hi,

    Ok this is lame… I am just learning alphabets of visualisation so hence give me some leeway.

    I don’t know why it all adds up to 58. I don’t know which is Dems and Rep (but I can guess).

    Din’t know how to send this hence posted on my blog here: http://surff.blogspot.com/2008/07/data-visualisation.html

    Well… it remains mediocre anyways…:)

  • * What is the graphic (trying) to show?
    ** It appeared to be a descriptive statistic showing voter registration by county, as well as some “recent” data to give the reader some idea about the trend. Perhaps county data is for some reason more germane to election results than voter registration?
    * Does the graphic achieve its goal?
    ** I’m a big fan of data with a time component being shown as lines or functions (I’d use a light smooth like loess if we had more data; maybe an FDA analysis with cubic splines if we had a longer time series [for the bling factor]); but if I miss the goal, then what good is my analysis?
    * Are there other data that could make the plot more informative?
    ** Are you a democrat or a republican?
    ** If you’re not, then you’re not shown on this plot.
    ** The values should sum to 100%, and this should be reflected.
    * How can we improve the bar chart?
    ** Ditch the bars.
    ** I decided to give it two tries. If I wanted to do it for real, I would have put a gradient (faded left, full color right to show time)

    ## Some S code for R
    library(gplots)
    ca.cty <- cbind(year=c(1992, 1996, 2000, 2004, 2008), demc=c(23,21,29,37,43), repb=c(35, 37, 29, 21, 15) )
    apply(ca.cty[,2:3], 1, sum) # data check.

    plot(ca.cty[,”year”], ca.cty[,”demc”]/58*100, lty=1, col=”blue”, lwd=2, type=”o”, ylim=c(0,100), xlab=”Year”,ylab=”Percent CA Counties”, pch=18, main=”CA Political Party Registration Aggregated by County” )
    lines(ca.cty[,”year”], ca.cty[,”repb”]/58*100, lty=1, col=”red”, lwd=2, type=”o”, pch=18)
    smartlegend(“left”,”top”, c(“Demacratic Majority”,”Republican Majority”), lty=1, lwd=2, pch=18, col=c(“blue”,”red”) )

    ## Second Go.
    plot(0,0, type=”n”, xlim=c(1992,2008), ylim=c(0,100), xlab=”Year”,ylab=”Percent CA Counties”, main=”CA Political Party Registration Aggregated by County” )
    demc.poly <- list()
    demc.poly$x <- c( ca.cty[,”year”], rev(ca.cty[,”year”]) )
    demc.poly$y <- c(ca.cty[,”demc”]*100/58, rep(0, nrow(ca.cty)) )
    polygon(demc.poly$x, demc.poly$y, col=”blue”)
    polygon(demc.poly$x, c(ca.cty[,”demc”]*100/58, rep(100, nrow(ca.cty))), col=”red” )

    Let me know if you have any thoughts/improvements, or if I missed the point completely.

  • Christ people. Learn to read. Y-axis title is “Majority of democracts and majority of republicans”. Chart title is “…majority party by county”. That adds up to a chart showing the number of counties that have a republican or democrat majority (in whatever electoral body californian counties have).

    This is not a mediocre graph. It sucks and bad. Granted, I had to take a peak at the original data+chart to see the extent of failure… which I think are: It doesn’t explicitly show the proportion that interests us, the x axis has years that are not in the data set (implying a continuous distribution when none exist), and possibly as the worst thing, has real sucky axis and title descriptions. Fix those and it’d be a bit more decent… to top it, color the bars red&blue. I think you all know which one’s red, which one’s blue.

    Better yet. Do “the Tufte” and just show the table :)

  • ah, but as jorge pointed out, there’s more data to complete this story that makes graphing worth it.

  • The data points are measured discretely every four years, so a line chart is less imperative than for most time series. If a line chart is used, it should also have markers for the points to emphasize their discrete nature. The lines should be straight, not smoothed.

    The horizontal axis must be redrawn to accurately label the years where the measurements were made.

    I don’t think it’s too tricky, I’ll work out something and post it on my blog.

  • Brijesh and Blair – very nice. blair’s second go looks a lot like brijesh’s chart. are we missing anything data-wise? there are uh, others besides dems and reps, i think.

  • Here’s my attempt:

    http://d-randommusings.blogspot.com/

    Please enjoy.

  • Here’s my attempt:

    http://d-randommusings.blogspot.com/

    Please enjoy.

  • Amarkos brought up a point in his first that I’d like to address. The actual data is 1992 – 2008 every 4 years. Continuous in the colloquial sense means unbroken. Continually means something that occurs repeatedly over time. The years chosen are every 4 years, but does the data get reported every year? every 2 years? every county election? every 4 years?

    Jake and Amos take a cumulative approach in their construction which I would as well. One charts the graph as continuous and the other as continual.

    I think what many people have said implicitly or directly needs hammering home. The ‘decline to state’ category has the greatest % change from 9.5%(1992) to 19.3%(2008). Without including that data (and all the data), the graph is myopic at best.

    There are so many other problems with this graph that I’m going to use it my stats class next year, but the one that gets me the most…

    Two-tone green? Outta here!

  • Amarkos brought up a point in his first that I’d like to address. The actual data is 1992 – 2008 every 4 years. Continuous in the colloquial sense means unbroken. Continually means something that occurs repeatedly over time. The years chosen are every 4 years, but does the data get reported every year? every 2 years? every county election? every 4 years?

    Jake and Amos take a cumulative approach in their construction which I would as well. One charts the graph as continuous and the other as continual.

    I think what many people have said implicitly or directly needs hammering home. The ‘decline to state’ category has the greatest % change from 9.5%(1992) to 19.3%(2008). Without including that data (and all the data), the graph is myopic at best.

    There are so many other problems with this graph that I’m going to use it my stats class next year, but the one that gets me the most…

    Two-tone green? Outta here!

  • Stack Lee July 18, 2008 at 3:52 pm

    Mr. Peltier beat me to it…….”Keep it Simple”

    .

  • I won’t do my own, but I vote for Jake’s submission. Two points as to conceptual relevance of the original graph: (1) Any graph that doesn’t also include “Independents” and “Other” omits an important shift over the years; (2) The breakdown by county is probably mostly of interest to various county Dem/Rep central committee members. Since most Congressional Districts (and hence Electoral College votes) and State Senate/House districts span multiple counties, I would be a lot more interested in seeing the breakdown by those subdivisions if I were trying to discern the impact of changing registration on actual elections.

  • Conceptually, the data are continuous. You can register to vote, or change your party, at any time*. The numbers could be reported every day, or even every hour. That’s why I went with the line graph rather than the bars.

    *During Board of Elections business hours. Also, in NY, where I live and vote, if you change your party too close before a primary, it won’t take effect until afterward. I don’t know CA that well. But this doesn’t change my point: the numbers could be tabulated at any time.

  • While I like the appearance of the continuous chart better, I’m going to have to disagree and say it should be continual. The data given is from specific points in time (the year of the primary election) and no information is given for those 3 years in between (provided every 4th), which is quite a bit of time…

  • The data are not continuous (I disagree with Amos on this point). There is data only during each presidential primary season (every 4 years).

    I didn’t care for the two area chart submissions for this reason. These charts imply a continuous relationship when what we have is intermittent data. The line charts with markers hint that the lines are interpolating between actual measured data. I also think the stacked nature of the area charts made it difficult to compare between different series.

    I didn’t care for the stacked columns. The left hand one (statewide registration by party) clearly showed the democratic percentage drop over time, but not the democratic drop, which was equivalent. On a line chart (see the end of my post), the two parties are represented by parallel lines. I also couldn’t easily compare the two main parties.

    The right hand stacked bar chart didn’t show the information as clearly as a line chart or a clustered column chart.

    To me the sparkline column chart “minimized” the message of the chart (pun intended), seemingly removing the context.

    I liked the simple line charts (Stack’s political icons were a nice touch!) and clustered column charts. I know clustered column charts are in general disfavor, but for depicting two series at regular intervals they are quite suitable. My favorite among the county comparison charts may be David’s balancing bar chart.

  • Jake, I’ll agree to disagree. By your logic, one would virtually never use line charts for social science data (well, except for financial market data). And yet I find them much easier on the eye for judging trends than bar charts. I feel like my eye has to fight with a bar chart to extract information.

  • it could go either way on continuity – depending on your definition. if you had an observation once per second, you wouldn’t know what happened in between each second.

  • @Tyler Lang: Really great work — I love it, except I think you have reversed Dem and Rep?

  • Either way fun stuff. Nathan… you should do this more often.

  • Jon, the sparklines just tell a different story. It does minimize the chart footprint, but do we need a large chart to display these eight data points?

  • Yeah, sorry, I unknowingly reversed the DEM/REP.

  • Yeah, sorry, I unknowingly reversed the DEM/REP.

  • Jake – Nice. I like how Unregistered has become the greatest affiliation.

  • Jorge –

    I’m not saying we need a large chart. My point is that I think this particular sparkline was as simple as possible, and then simpler. This data needs either a value axis or some labeling to describe the magnitude of the bars. Also, the sparkline shows a difference, which makes it harder to interpret at a glance. This data set really has two series, assuming that one series is merely the absence of the other leads to oversimplification.

    Here’s a sparkline that works better for me:
    http://peltiertech.com/WordPress/wp-content/img200807/CAsparkline.png

  • Nate – why am I not surprised you’re a graphic designer :)

  • Amos’ was the one that best gave me a clue as to the situation described by the data – I saw that and went ‘aha!’ But Nate’s gets points for being so purdy!