An action plan for data science, a decade ago

Data science has been covered at length during the past couple of years, and we tend to think of it as a field of study just a couple of years older than that. Jeff Hammerbacher and DJ Patil have played roles in further propagating the term as an actual profession in roughly the same timespan. So I was surprised to come across this rarely-cited 2001 paper by statistician William Cleveland, Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics [pdf].

This document describes a plan to enlarge the major areas of technical work of the field of statistics. Because the plan is ambitious and implies substantial change, the altered field will be called “data science.”

For those unfamiliar, Cleveland’s work on graphical perception might ring a bell.

The first time I heard “data science” was in 2007 while reading a proposal that my adviser had passed along, outlining an academic program similar to what we think of as data science. Now that I think of it, the proposal probably had a lot of similarities to the program outlined by Cleveland (which I would have signed up for in a heartbeat).

Cleveland outlines six areas and the percentage of focus for each.

  • Multidisciplinary Investigation (25%) — collaboration with subject areas
  • Models and Methods for Data (20%) — more traditional applied statistics
  • Computing with Data (15%) — hardware, software, and algorithms
  • Pedagogy (15%) — how to teach the subject
  • Tool Evaluation (5%) — keeping track of new tech
  • Theory (20%) — the math behind the data

That sounds like what we associate data science with, but current practitioners focus more on tools and less on pedagogy and theory. Although that’s not to say that data scientists today couldn’t benefit for more traditional statistical knowledge under their belt. And the same goes for statisticians learning more about how to use the tools available (and how to make the tools themselves).

Cleveland’s overall theme is a melting of various fields that obviously fit well together.

Computer scientists, waking up to the value of the information stored, processed, and transmitted by today’s computing environments, have attempted to fill the void. One current of work is data mining. But the benefit to the data analyst has been limited, because the knowledge among computer scientists about how to think of and approach the analysis of data is limited, just as the knowledge of computing environments by statisticians is limited. A merger of the knowledge bases would produce a powerful force for innovation.

Sounds familiar. Of course, John Tukey seemed to have the right idea in the 1970s. In any case, it was refreshing to find this from a statistician over the canned stat scoff that data science is statistics.

[William Cleveland via @drewconway]


  • Amid the cool and fun visuals you keep posting (mind you, no one’s complaining about that), this kind of perspective consolidating stuff is refreshing. And valuable. Thanks.

  • Great commentary! I look at that %age breakdown and I see a very good model for a (bio)informatics training programme. The programmes I have been involved in previously lack, as you point out, the pedagogy and theory components – particularly w.r.t. the information theory underlying the Web technologies themselves. I’m going to grab that paper and have a good think about it! Thanks for bringing it to our attention.

  • Nathan, thank you!
    One question. May be you know some great online learning courses for two first areas — Multidisciplinary Investigation and Models and Methods for Data?