Rise of the Data Scientist

June 4, 2009  |  Design, Statistics

Photo by majamarko

As we've all read by now, Google's chief economist Hal Varian commented in January that the next sexy job in the next 10 years would be statisticians. Obviously, I whole-heartedly agree. Heck, I'd go a step further and say they're sexy now - mentally and physically.

However, if you went on to read the rest of Varian's interview, you'd know that by statisticians, he actually meant it as a general title for someone who is able to extract information from large datasets and then present something of use to non-data experts.

Sexy Skills of Data Geeks

As a follow up to Varian's now-popular quote among data fans, Michael Driscoll of Dataspora, discusses the three sexy skills of data geeks. I won't rehash the post, but here are the three skills that Michael highlights:

  1. Statistics - traditional analysis you're used to thinking about
  2. Data Munging - parsing, scraping, and formatting data
  3. Visualization - graphs, tools, etc.

Oh, but there's more...

These skills actually fit tightly with Ben Fry's dissertation on Computational Information Design (2004). However, Fry takes it a step further and argues for an entirely new field that combines the skills and talents from often disjoint areas of expertise:

  1. Computer Science - acquire and parse data
  2. Mathematics, Statistics, & Data Mining - filter and mine
  3. Graphic Design - represent and refine
  4. Infovis and Human-Computer Interaction (HCI) - interaction

And after two years of highlighting visualization on FlowingData, it seems collaborations between the fields are growing more common, but more importantly, computational information design edges closer to reality. We're seeing data scientists - people who can do it all - emerge from the rest of the pack.

Advantages of the Data Scientist

Think about all the visualization stuff you've been most impressed with or the groups that always seem to put out the best work. Martin Wattenberg. Stamen Design. Jonathan Harris. Golan Levin. Sep Kamvar. Why is their work always of such high quality? Because they're not just students of computer science, math, statistics, or graphic design.

They have a combination of skills that not just makes independent work easier and quicker; it makes collaboration more exciting and opens up possibilities in what can be done. Oftentimes, visualization projects are disjoint processes and involve a lot of waiting. Maybe a statistician is waiting for data from a computer scientist; or a graphic designer is waiting for results from an analyst; or an HCI specialist is waiting for layouts from a graphic designer.

Let's say you have several data scientists working together though. There's going to be less waiting and the communication gaps between the fields are tightened.

How often have we seen a visualization tool that held an excellent concept and looked great on paper but lacked the touch of HCI, which made it hard to use and in turn no one gave it a chance? How many important (and interesting) analyses have we missed because certain ideas could not be communicated clearly? The data scientist can solve your troubles.

An Application

This need for data scientists is quite evident in business applications where educated decisions need to be made swiftly. A delayed decision could mean lost opportunity and profit. Terabytes of data are coming in whether it be from websites or from sales across the country, but in an area where Excel is the tool of choice (or force), there are limitations, hence all the tools, applications, and consultancies to help out. This of course applies to areas outside of business as well.

Learn and Prosper

Even if you're not into visualization, you're going to need at least a subset of the skills that Fry highlights if you want to seriously mess with data. Statisticians should know APIs, databases, and how to scrape data; designers should learn to do things programmatically; and computer scientists should know how to analyze and find meaning in data.

Basically, the more you learn, the more you can do, and the higher in demand you will be as the amount of data grows and the more people want to make use of it.

47 Comments

  • i’m curious about how you chose the term “data scientist” to describe this role. that’s precisely the title we used for folks on the data team at facebook, chosen somewhat arbitrarily as a contraction of “data analyst” and “research scientist”, with the same skills in mind as you mention above. i also titled my chapter for “beautiful data” “information platforms and the rise of the data scientist”. quite amazing if you formulated the phrase separately! something in the air…

    • @jeff – ha, yes, there must be something in the air. i read a grant proposal two or three years ago pushing for a new area of study called “data science.” it’s stuck with me ever since.

  • Nathan – Nice synthesis and thanks for the shout-out. Ben Fry’s model captures the various fields that comprise this interdisciplinary ‘data science’ quite elegantly. I’d even venture to add some bidirectional arrows. Between the four core activities — Munging, Modeling, Visualizing, and Interacting — there’s a lot of feedback.

    And as far as sexiness goes, I’m still holding my breath for People magazine to release its Sexiest Data Scientist Alive issue. It still may be a decade or more away.

  • Power to the Data Scientists!

  • @Michael – re:feedback definitely. fry gets into this as its one of his arguments for an interdisciplinary field – whereas in a collaboration, a person would have to explain to another what he wanted, have some misunderstandings along the way, and then wait.

    re:sexy we’ll have to start a letter campaign :)

  • I have an open question. My graduate training has a foundation in statistics building to psychometrics. I’ve spent some time in human factors and usability, I was even an art major for awhile. This has all given me a solid base to understand the story data is telling as well as put it into visualizations that facilitate understanding among non-specialists. However, while I appreciate the opportunity to have a sexy I job I am currently on track to senior management in a major corporation. So here’s the question how do you balance becoming a specialist and developing the breadth necessary to perform as a business manager?

    • I don’t see that you can’t apply the ‘data scientist’ skill set to the ‘senior management in a major corporation’ job.

      Being able to gather/create, parse/analyze, then present data in a meaningful way would likely not hinder you.

      The only problem of being amazing is it makes others uncomfortable in their own skin, ie: “Jeez, I’ve been here 14 years and I never knew THAT. This guy is dangerous.”

  • I like Fry’s approach but in this time and age you’d have to accept that these skills are not necessarily mastered by the same person.

    The best mathematician could not the best graphic artist who is not necessarily the best interface designer who is not always a subject-matter expert, etc.

    Some people have all those skills and then some, but that’s more the exception than the rule.

  • What about “storyteller?” Not trying to trivialize the issue at all, but the ability to effectively communicate the relevance and import of the findings would seem to be the skill that ties it all together. Completing the analysis isn’t the end of the project, getting the HiPPO’s sign-off is. If all the effort doesn’t go toward meeting an organizational goal, it’s wasted.

  • @Craig: The best answer is to look at senior managers whose work you admire, and see how they did it. For example, often specialized knowledge is replaced by the ability to notice, nurture, and exploit technical talent.

    And note well: if you can’t find a senior manager you’d like to imitate, that’s a sign that being an executive will make you unhappy.

  • @Craig, @Jérôme – i think it’s not so much about learning all there is to know about all the fields. Instead, you’re learning a collection of skills from the fields with the primary purpose of visualization (or computational information design). So for example, you might learn graphic design, but you’re not going to learn logo design; you’re going to learn how to display data.

    From a management standpoint, you’ll know what everyone is talking about and the work involved which makes it easier to delegate and to keep things moving.

  • I would add functional expertise to the list of skills. You need to understand the domain you are analyzing if you really want to understand your data.

  • I appreciate Dr. Fry’s explict recognzition and acknowledgement that “cartographers have mastered the ability to successfully organize geographic data in a manner that communicates effectively.” Fry goes on to suggest cartography could serve as a “useful model” and be extended in the “direction of Computational Information Design.” My only comment to Fry (and to Nathan’s post) is that it already has – a field called ‘Geographic Information Science and Technology’ (GIS&T). The GIS&T body of knowledge was outlined by Mike Goodchild and others in the mid 1990s and fully scoped in 2006 by the UCGIS. Perhaps new ‘data scientists’ will consider the groundwork already laid by so many geographers and cartographers as they expand into new territories and exploit emerging technologies.

    NCGIA Core Curriculum in Geographic Information Science
    http://www.ncgia.ucsb.edu/giscc/units/u002/u002.html

    Geographic Information Science & Technology (GIS&T) Body of Knowledge.
    http://www.aag.org/bok/

    University Consortium for Geographic Information Science (UCGIS)
    http://www.ucgis.org

    And a shameless plug for GIS&T at Shippensburg University (PA)
    http://webspace.ship.edu/geog/GIS/

  • Can we predict a theory [like special theory of relativity] using data?
    Can data help us to prove a theorem like Fermat’s Last Theorem?

    • Subhankar Ray,

      I am not sure what you are getting at, but these are quite difficult questions.

      1)Can we predict a theory [like special theory of relativity] using data?

      Understanding you to not be asking the trivial question of whether or not data and analysis methods are used to make discoveries, but rather asking about the algorithmic discovery of new laws, or regularities– yes, but we are not very good at unaided computer discovery yet. Since the inception of AI, computer scientists, philosophers, mathematicians, and other researchers have endeavored to find methods for automated discovery of regularities from empirical data. Herbert Simon (BACON), and Paul Thagard (PI) are two historical examples of automated discovery. Today, much is being done in the area of machine learning and discovery, but that is a book (or ten) by itself.

      2) Can data help us to prove a theorem like Fermat’s Last Theorem?

      While such theorems can be assisted with computers (see the four color theorem and Coq proof assistant), theorems are derived through mathematical induction, construction, negation, etc., not though empirical data. Quasi-empirical data, however is used, meaning the results of enumeration (such as the ever-growing list of prime numbers) or random evaluation of a complex mathematical object (such as Monte Carlo methods on probability density functions), allow us approximations or enumerations for further mathematical consideration

      Hope this addresses your questions.

  • I think what’s missing in a discussion of incentives: How to get people to contribute data, both C2C or C2W.
    See:
    http://facebook.com/SocialDataRevolution,
    http://blogs.harvardbusiness.org/now-new-next/2009/05/the-social-data-revolution.htm (recent Harvard Business article)
    http://stanford2009.wikispaces.com (course wiki of Spring 2009 Data Mining course)
    Andreas Weigend

  • Pingback: Bing Community

  • It’s interesting that the graphic you represented in this post has similarity to the job of librarians…
    We aquire books (or information), filter these books, or the information, to the right user or client; or mine the stacks or systems in pursuit of these information; the files are represented via catalog (cards or systems); and the reference service refines and interact with our users.
    Do you think that to be a Data Scientist is to have a librarian expertise too? And vice-versa?

  • Well, given that it was a sexy maths lady who first pointed me to this article in the first place… ^^

  • Great summary. The data-munging/ scraping part was something I particularly identified with. Working in the financial services industry (ahead of others when it comes to using data to drive strategy), I am still surprised with the amount of data that just goes untapped and unused. The systems (and the vision) required to capture this data and put in in a form which is ‘analyzable’ just does not exist.

  • Silverfern June 16, 2009 at 8:24 pm

    I would add one other domain/skill to Fry’s descriptions — Library/Information Science — to preserve the data (incl. access and retrieval, etc.) for the indefinite long-term.

  • This post reminds me of a post you made last year, about why Data Visualization isn’t popular. One point you made was that people know, but don’t know what it’s called (so they know.. but they just don’t know they know). But my question is where does ‘data visualization’ stop?

    The post here makes data visualization seem so complex (and I’m sure it is..) that I’m almost afraid to call simple graphs/maps ‘data visualization’.

    Like.. is this visualization?
    http://www.townme.com/san-francisco-ca/yuppie-locator/yuppies
    It’s a map, it uses census data, but it’s certainly not as complex as the typical visualizations on a blog like infosthetics..

    Or, is this blog (that one of your comment-ers suggested on your 37 visualizations post) also visualization?
    http://thisisindexed.com/

    I’m into visualization because reading lots of little words hurts my eyes, because I believe there’s a more efficient way to convey information, and it really is awesome looking sometimes. But, does everything count?

  • More on my previous comment. Here is a definition of visualization by http://eagereyes.org. (http://eagereyes.org/theory/Definition-of-Visualization.html)

    To summarize… Robert Kosara gives a lot of definitions that I think would rule out the links in my previous comment as “data visualizations”.

    And then…the reason data visualization isn’t popular might also be because it’s so exclusive/scientific/precise that people don’t want to touch/enjoy/appreciate it! What do you think?

  • Here is a definition of visualization from (a blog that was also suggested by a reader in your 37 visualizations post) http://eagereyes.org
    http://eagereyes.org/theory/Definition-of-Visualization.html

    To summarize… he gave a lot of very precise definitions and requirements for ‘information that results in a picture’ to be called visualization.

    The definitions (to me at least… casual graph-onlooker) seem pretty intense. And going back to your post about why data visualization wasn’t popular… it might be because data visualization is viewed as a science.. very precise/definite/intense that people are afraid to enjoy/appreciate it at all.

  • This is my favorite unusually good data visualization site: http://www.babynamewizard.com/voyager
    Somebody showed me this one in grad school, and I immediately proceeded to spend several hours riveted to my screen, looking up the the popularity and geographic trends associated with the names of everyone I knew.

Copyright © 2007-2014 FlowingData. All rights reserved. Hosted by Linode.