Rise of the Data Scientist

June 4, 2009

Topic

Design, Statistics

Photo by majamarko

As we’ve all read by now, Google’s chief economist Hal Varian commented in January that the next sexy job in the next 10 years would be statisticians. Obviously, I whole-heartedly agree. Heck, I’d go a step further and say they’re sexy now – mentally and physically.

However, if you went on to read the rest of Varian’s interview, you’d know that by statisticians, he actually meant it as a general title for someone who is able to extract information from large datasets and then present something of use to non-data experts.

Sexy Skills of Data Geeks

As a follow up to Varian’s now-popular quote among data fans, Michael Driscoll of Dataspora, discusses the three sexy skills of data geeks. I won’t rehash the post, but here are the three skills that Michael highlights:

Statistics – traditional analysis you’re used to thinking about
Data Munging – parsing, scraping, and formatting data
Visualization – graphs, tools, etc.

Oh, but there’s more…

These skills actually fit tightly with Ben Fry’s dissertation on Computational Information Design (2004). However, Fry takes it a step further and argues for an entirely new field that combines the skills and talents from often disjoint areas of expertise:

Computer Science – acquire and parse data
Mathematics, Statistics, & Data Mining – filter and mine
Graphic Design – represent and refine
Infovis and Human-Computer Interaction (HCI) – interaction

And after two years of highlighting visualization on FlowingData, it seems collaborations between the fields are growing more common, but more importantly, computational information design edges closer to reality. We’re seeing data scientists – people who can do it all – emerge from the rest of the pack.

Advantages of the Data Scientist

Think about all the visualization stuff you’ve been most impressed with or the groups that always seem to put out the best work. Martin Wattenberg. Stamen Design. Jonathan Harris. Golan Levin. Sep Kamvar. Why is their work always of such high quality? Because they’re not just students of computer science, math, statistics, or graphic design.

They have a combination of skills that not just makes independent work easier and quicker; it makes collaboration more exciting and opens up possibilities in what can be done. Oftentimes, visualization projects are disjoint processes and involve a lot of waiting. Maybe a statistician is waiting for data from a computer scientist; or a graphic designer is waiting for results from an analyst; or an HCI specialist is waiting for layouts from a graphic designer.

Let’s say you have several data scientists working together though. There’s going to be less waiting and the communication gaps between the fields are tightened.

How often have we seen a visualization tool that held an excellent concept and looked great on paper but lacked the touch of HCI, which made it hard to use and in turn no one gave it a chance? How many important (and interesting) analyses have we missed because certain ideas could not be communicated clearly? The data scientist can solve your troubles.

An Application

This need for data scientists is quite evident in business applications where educated decisions need to be made swiftly. A delayed decision could mean lost opportunity and profit. Terabytes of data are coming in whether it be from websites or from sales across the country, but in an area where Excel is the tool of choice (or force), there are limitations, hence all the tools, applications, and consultancies to help out. This of course applies to areas outside of business as well.

Learn and Prosper

Even if you’re not into visualization, you’re going to need at least a subset of the skills that Fry highlights if you want to seriously mess with data. Statisticians should know APIs, databases, and how to scrape data; designers should learn to do things programmatically; and computer scientists should know how to analyze and find meaning in data.

Basically, the more you learn, the more you can do, and the higher in demand you will be as the amount of data grows and the more people want to make use of it.

47 Comments

jeff — June 4, 2009 at 3:31 am

i’m curious about how you chose the term “data scientist” to describe this role. that’s precisely the title we used for folks on the data team at facebook, chosen somewhat arbitrarily as a contraction of “data analyst” and “research scientist”, with the same skills in mind as you mention above. i also titled my chapter for “beautiful data” “information platforms and the rise of the data scientist”. quite amazing if you formulated the phrase separately! something in the air…
- Nathan Yau — June 4, 2009 at 10:29 am
  
  @jeff – ha, yes, there must be something in the air. i read a grant proposal two or three years ago pushing for a new area of study called “data science.” it’s stuck with me ever since.
Michael E Driscoll — June 4, 2009 at 3:38 am

Nathan – Nice synthesis and thanks for the shout-out. Ben Fry’s model captures the various fields that comprise this interdisciplinary ‘data science’ quite elegantly. I’d even venture to add some bidirectional arrows. Between the four core activities — Munging, Modeling, Visualizing, and Interacting — there’s a lot of feedback.

And as far as sexiness goes, I’m still holding my breath for People magazine to release its Sexiest Data Scientist Alive issue. It still may be a decade or more away.
Matt — June 4, 2009 at 9:35 am

Power to the Data Scientists!
Nathan Yau — June 4, 2009 at 10:39 am

@Michael – re:feedback definitely. fry gets into this as its one of his arguments for an interdisciplinary field – whereas in a collaboration, a person would have to explain to another what he wanted, have some misunderstandings along the way, and then wait.

re:sexy we’ll have to start a letter campaign :)
Pingback: Data scientists « Jabberwocky Ecology
Craig — June 4, 2009 at 12:37 pm

I have an open question. My graduate training has a foundation in statistics building to psychometrics. I’ve spent some time in human factors and usability, I was even an art major for awhile. This has all given me a solid base to understand the story data is telling as well as put it into visualizations that facilitate understanding among non-specialists. However, while I appreciate the opportunity to have a sexy I job I am currently on track to senior management in a major corporation. So here’s the question how do you balance becoming a specialist and developing the breadth necessary to perform as a business manager?
- Mike Figueroa — June 12, 2009 at 12:40 am
  
  I don’t see that you can’t apply the ‘data scientist’ skill set to the ‘senior management in a major corporation’ job.
  
  Being able to gather/create, parse/analyze, then present data in a meaningful way would likely not hinder you.
  
  The only problem of being amazing is it makes others uncomfortable in their own skin, ie: “Jeez, I’ve been here 14 years and I never knew THAT. This guy is dangerous.”
JÃ©rÃ´me Cukier — June 4, 2009 at 1:18 pm

I like Fry’s approach but in this time and age you’d have to accept that these skills are not necessarily mastered by the same person.

The best mathematician could not the best graphic artist who is not necessarily the best interface designer who is not always a subject-matter expert, etc.

Some people have all those skills and then some, but that’s more the exception than the rule.
John — June 4, 2009 at 1:56 pm

What about “storyteller?” Not trying to trivialize the issue at all, but the ability to effectively communicate the relevance and import of the findings would seem to be the skill that ties it all together. Completing the analysis isn’t the end of the project, getting the HiPPO’s sign-off is. If all the effort doesn’t go toward meeting an organizational goal, it’s wasted.
anon — June 4, 2009 at 3:40 pm

@Craig: The best answer is to look at senior managers whose work you admire, and see how they did it. For example, often specialized knowledge is replaced by the ability to notice, nurture, and exploit technical talent.

And note well: if you can’t find a senior manager you’d like to imitate, that’s a sign that being an executive will make you unhappy.
Nathan Yau — June 4, 2009 at 5:24 pm

@Craig, @JÃ©rÃ´me – i think it’s not so much about learning all there is to know about all the fields. Instead, you’re learning a collection of skills from the fields with the primary purpose of visualization (or computational information design). So for example, you might learn graphic design, but you’re not going to learn logo design; you’re going to learn how to display data.

From a management standpoint, you’ll know what everyone is talking about and the work involved which makes it easier to delegate and to keep things moving.
Pingback: Analytics Team » Blog Archive » The role of the data scientist
Enrique — June 5, 2009 at 6:04 am

I would add functional expertise to the list of skills. You need to understand the domain you are analyzing if you really want to understand your data.
Scott Drzyzga — June 5, 2009 at 12:21 pm

I appreciate Dr. Fry’s explict recognzition and acknowledgement that “cartographers have mastered the ability to successfully organize geographic data in a manner that communicates effectively.” Fry goes on to suggest cartography could serve as a “useful model” and be extended in the “direction of Computational Information Design.” My only comment to Fry (and to Nathan’s post) is that it already has – a field called ‘Geographic Information Science and Technology’ (GIS&T). The GIS&T body of knowledge was outlined by Mike Goodchild and others in the mid 1990s and fully scoped in 2006 by the UCGIS. Perhaps new ‘data scientists’ will consider the groundwork already laid by so many geographers and cartographers as they expand into new territories and exploit emerging technologies.

NCGIA Core Curriculum in Geographic Information Science
http://www.ncgia.ucsb.edu/giscc/units/u002/u002.html

Geographic Information Science & Technology (GIS&T) Body of Knowledge.
http://www.aag.org/bok/

University Consortium for Geographic Information Science (UCGIS)
http://www.ucgis.org

And a shameless plug for GIS&T at Shippensburg University (PA)
http://webspace.ship.edu/geog/GIS/
Subhankar Ray — June 5, 2009 at 12:38 pm

Can we predict a theory [like special theory of relativity] using data?
Can data help us to prove a theorem like Fermat’s Last Theorem?
- John L. Taylor — June 10, 2009 at 12:48 pm
  
  Subhankar Ray,
  
  I am not sure what you are getting at, but these are quite difficult questions.
  
  1)Can we predict a theory [like special theory of relativity] using data?
  
  Understanding you to not be asking the trivial question of whether or not data and analysis methods are used to make discoveries, but rather asking about the algorithmic discovery of new laws, or regularities– yes, but we are not very good at unaided computer discovery yet. Since the inception of AI, computer scientists, philosophers, mathematicians, and other researchers have endeavored to find methods for automated discovery of regularities from empirical data. Herbert Simon (BACON), and Paul Thagard (PI) are two historical examples of automated discovery. Today, much is being done in the area of machine learning and discovery, but that is a book (or ten) by itself.
  
  2) Can data help us to prove a theorem like Fermatâ€™s Last Theorem?
  
  While such theorems can be assisted with computers (see the four color theorem and Coq proof assistant), theorems are derived through mathematical induction, construction, negation, etc., not though empirical data. Quasi-empirical data, however is used, meaning the results of enumeration (such as the ever-growing list of prime numbers) or random evaluation of a complex mathematical object (such as Monte Carlo methods on probability density functions), allow us approximations or enumerations for further mathematical consideration
  
  Hope this addresses your questions.
Andreas Weigend — June 5, 2009 at 5:53 pm

I think what’s missing in a discussion of incentives: How to get people to contribute data, both C2C or C2W.
See:
http://facebook.com/SocialDataRevolution,
http://blogs.harvardbusiness.org/now-new-next/2009/05/the-social-data-revolution.htm (recent Harvard Business article)
http://stanford2009.wikispaces.com (course wiki of Spring 2009 Data Mining course)
Andreas Weigend
Pingback: Google needs more sexy statisticians « The Bernoulli Trial
Pingback: Bing Community
SandrÃ©a — June 9, 2009 at 2:19 pm

It’s interesting that the graphic you represented in this post has similarity to the job of librarians…
We aquire books (or information), filter these books, or the information, to the right user or client; or mine the stacks or systems in pursuit of these information; the files are represented via catalog (cards or systems); and the reference service refines and interact with our users.
Do you think that to be a Data Scientist is to have a librarian expertise too? And vice-versa?
- Nathan Yau — June 10, 2009 at 11:52 am
  
  having worked with information scientists, there are definitely several parallels in what we do.
Dave — June 10, 2009 at 11:45 am

Well, given that it was a sexy maths lady who first pointed me to this article in the first place… ^^
Pingback: Science Etcetera, Jupiterday 20090611 | ideonexus.com
Pingback: Data Scientists Apply Within - The Environment - Firefly Ecometrics
Krish Swamy — June 14, 2009 at 6:11 pm

Great summary. The data-munging/ scraping part was something I particularly identified with. Working in the financial services industry (ahead of others when it comes to using data to drive strategy), I am still surprised with the amount of data that just goes untapped and unused. The systems (and the vision) required to capture this data and put in in a form which is ‘analyzable’ just does not exist.
Pingback: Cerebral Mastication » Blog Archive » Keeping Technical Talent or Why I Just Quit My Job
Pingback: Four short links: 16 June 2009 | Tech-monkey.info Blogs
Silverfern — June 16, 2009 at 8:24 pm

I would add one other domain/skill to Fry’s descriptions — Library/Information Science — to preserve the data (incl. access and retrieval, etc.) for the indefinite long-term.
Pingback: Visual Communications » Information Graphics contâ€™d
Pingback: Four short links: 16 June 2009 | Design Website
Pingback: The Rise of the Data Scientist [From Flowing Data] | Computational Legal Studies
Jessica — June 30, 2009 at 6:56 pm

This post reminds me of a post you made last year, about why Data Visualization isn’t popular. One point you made was that people know, but don’t know what it’s called (so they know.. but they just don’t know they know). But my question is where does ‘data visualization’ stop?

The post here makes data visualization seem so complex (and I’m sure it is..) that I’m almost afraid to call simple graphs/maps ‘data visualization’.

Like.. is this visualization?
http://www.townme.com/san-francisco-ca/yuppie-locator/yuppies
It’s a map, it uses census data, but it’s certainly not as complex as the typical visualizations on a blog like infosthetics..

Or, is this blog (that one of your comment-ers suggested on your 37 visualizations post) also visualization?
http://thisisindexed.com/

I’m into visualization because reading lots of little words hurts my eyes, because I believe there’s a more efficient way to convey information, and it really is awesome looking sometimes. But, does everything count?
Jessica — June 30, 2009 at 7:14 pm

More on my previous comment. Here is a definition of visualization by http://eagereyes.org. (http://eagereyes.org/theory/Definition-of-Visualization.html)

To summarize… Robert Kosara gives a lot of definitions that I think would rule out the links in my previous comment as “data visualizations”.

And then…the reason data visualization isn’t popular might also be because it’s so exclusive/scientific/precise that people don’t want to touch/enjoy/appreciate it! What do you think?
Jessica — June 30, 2009 at 7:17 pm

Here is a definition of visualization from (a blog that was also suggested by a reader in your 37 visualizations post) http://eagereyes.org
http://eagereyes.org/theory/Definition-of-Visualization.html

To summarize… he gave a lot of very precise definitions and requirements for ‘information that results in a picture’ to be called visualization.

The definitions (to me at least… casual graph-onlooker) seem pretty intense. And going back to your post about why data visualization wasn’t popular… it might be because data visualization is viewed as a science.. very precise/definite/intense that people are afraid to enjoy/appreciate it at all.
Pingback: The Rise of The Data Scientist « Visualness
Pingback: Rise of the Data Scientist « GIS and Science
Jen — July 7, 2009 at 3:56 pm

This is my favorite unusually good data visualization site: http://www.babynamewizard.com/voyager
Somebody showed me this one in grad school, and I immediately proceeded to spend several hours riveted to my screen, looking up the the popularity and geographic trends associated with the names of everyone I knew.
Pingback: In The Tech News « Caintech.co.uk
Pingback: Ivan Frantar (ivanico) 's status on Wednesday, 08-Jul-09 08:14:31 UTC - Identi.ca
Pingback: Daily Links #75 | CloudKnow
Pingback: Flow » Blog Archive » Daily Digest for July 9th - The zeitgeist daily
Pingback: Data Scienist > Data Geek > Designer « Visualizing Economics
Pingback: Coast to Coast Bio Podcast » Blog Archive » Episode 23: So why were you talking about iPhones?
Pingback: The LinkedIn Blog » Blog Archive Data Scientists: Wrangling Data for Professionals «
Pingback: Data Scientists: Finding Patterns in LinkedIn Data | CITI Recruitment
Pingback: 9 Tips for Paid Search Newbies | Wise SEO Service

Rise of the Data Scientist

Topic

Sexy Skills of Data Geeks

Oh, but there’s more…

Advantages of the Data Scientist

An Application

Learn and Prosper

47 Comments

Second Edition

Visualize This: The FlowingData Guide to Design, Visualization, and Statistics (2nd Edition)

Rise of the Data Scientist

Topic

Sexy Skills of Data Geeks

Oh, but there’s more…

Advantages of the Data Scientist

An Application

Learn and Prosper

Related

47 Comments

Second Edition

Visualize This: The FlowingData Guide to Design, Visualization, and Statistics (2nd Edition)