An action plan for data science, a decade ago

Posted to Statistics  |  Tags: ,  |  Nathan Yau

Data science has been covered at length during the past couple of years, and we tend to think of it as a field of study just a couple of years older than that. Jeff Hammerbacher and DJ Patil have played roles in further propagating the term as an actual profession in roughly the same timespan. So I was surprised to come across this rarely-cited 2001 paper by statistician William Cleveland, Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics [pdf].

This document describes a plan to enlarge the major areas of technical work of the field of statistics. Because the plan is ambitious and implies substantial change, the altered field will be called “data science.”

For those unfamiliar, Cleveland’s work on graphical perception might ring a bell.

The first time I heard “data science” was in 2007 while reading a proposal that my adviser had passed along, outlining an academic program similar to what we think of as data science. Now that I think of it, the proposal probably had a lot of similarities to the program outlined by Cleveland (which I would have signed up for in a heartbeat).

Cleveland outlines six areas and the percentage of focus for each.

  • Multidisciplinary Investigation (25%) — collaboration with subject areas
  • Models and Methods for Data (20%) — more traditional applied statistics
  • Computing with Data (15%) — hardware, software, and algorithms
  • Pedagogy (15%) — how to teach the subject
  • Tool Evaluation (5%) — keeping track of new tech
  • Theory (20%) — the math behind the data

That sounds like what we associate data science with, but current practitioners focus more on tools and less on pedagogy and theory. Although that’s not to say that data scientists today couldn’t benefit for more traditional statistical knowledge under their belt. And the same goes for statisticians learning more about how to use the tools available (and how to make the tools themselves).

Cleveland’s overall theme is a melting of various fields that obviously fit well together.

Computer scientists, waking up to the value of the information stored, processed, and transmitted by today’s computing environments, have attempted to fill the void. One current of work is data mining. But the benefit to the data analyst has been limited, because the knowledge among computer scientists about how to think of and approach the analysis of data is limited, just as the knowledge of computing environments by statisticians is limited. A merger of the knowledge bases would produce a powerful force for innovation.

Sounds familiar. Of course, John Tukey seemed to have the right idea in the 1970s. In any case, it was refreshing to find this from a statistician over the canned stat scoff that data science is statistics.

[William Cleveland via @drewconway]


  • Amid the cool and fun visuals you keep posting (mind you, no one’s complaining about that), this kind of perspective consolidating stuff is refreshing. And valuable. Thanks.

  • Great commentary! I look at that %age breakdown and I see a very good model for a (bio)informatics training programme. The programmes I have been involved in previously lack, as you point out, the pedagogy and theory components – particularly w.r.t. the information theory underlying the Web technologies themselves. I’m going to grab that paper and have a good think about it! Thanks for bringing it to our attention.

  • Nathan, thank you!
    One question. May be you know some great online learning courses for two first areas — Multidisciplinary Investigation and Models and Methods for Data?


The Best Data Visualization Projects of 2011

I almost didn’t make a best-of list this year, but as I clicked through the year’s post, it was hard …

19 Maps That Will Blow Your Mind and Change the Way You See the World. Top All-time. You Won’t Believe Your Eyes. Watch.

Many lists of maps promise to change the way you see the world, but this one actually does.

10 Best Data Visualization Projects of 2015

These are my picks for the best of 2015. As usual, they could easily appear in a different order on a different day, and there are projects not on the list that were also excellent.

Causes of Death

There are many ways to die. Cancer. Infection. Mental. External. This is how different groups of people died over the past 10 years, visualized by age.