As data grows cheaper and more easily accessible, the people who analyze it aren’t always statisticians. They’re likely to not even have had any statistical training. Biostatistics professor Jeff Leek says we need to adapt to this broader audience.
What does this mean for statistics as a discipline? Well it is great news in that we have a lot more people to train. It also really drives home the importance of statistical literacy. But it also means we need to adapt our thinking about what it means to teach and perform statistics. We need to focus increasingly on interpretation and critique and away from formulas and memorization (think English composition versus grammar). We also need to realize that the most impactful statistical methods will not be used by statisticians, which means we need more fool proofing, more time automating, and more time creating software. The potential payout is huge for realizing that the tide has turned and most people who analyze data aren’t statisticians.
Those who disagree tend to worry what might happen — what kind of data-based decisions will be made — by non-statisticians, and that should definitely be a priority as we move forward. Non-statisticians often make incorrect assumptions about the data, forget about uncertainty, and don’t know much about collection methodologies.
However, as a statistician (or someone who knows statistics), you can shoo everyone else away from the data and gripe when they come back, or you can help them get things right.
Wow, this is exactly the same sentiment, and worries, ongoing in the cartographic community. And this is the healthiest and best approach I’ve seen, so succinctly put. Good on you, Nathan and Jeff.
There has been plenty of damage already done by non-statisticians.
As a non-statistician analyst with a reasonable mathematical ability, what are the resources that can get me up to a basic working knowledge of statistics?
Having done basic Statistics/Probability courses required by most math/science/engineering undergraduate degrees, what is the next step?
Great article, guys, and excellent question posed by Liam. As a marketer, I’ve noticed that everyone loves talking about how data is making analytical insights once reserved for the Fortune 500 accessible to SMBs, yet no one is connecting the dots between collecting data and reaping its benefits. Where is the CodeAcademy-like solution for statistics?
Jeff Leek and Roger Peng both teach online courses. Worth a look:
Data Analysis, https://www.coursera.org/course/dataanalysis
Computing for Data Analysis, https://www.coursera.org/course/compdata
I can vouch for Jeff Leek’s Data Analysis course. I hope he’ll teach it again but in the mean time the course content has been published here
And my own thoughts: http://www.vislives.com/2013/03/coursera-data-analysis-mooc-wrap-up.html
And, it looks like the Roger Peng’s Computing for Data Analysis are available, as well:
(Sorry, didn’t know that would embed itself like that…)
I think this is on balance a good thing. People with access to statistics but not the skill to interpret will be an issue but as long as that is happening in open communities then mistakes can be corrected and everyone has the opportunity to learn. Offline, behind-closed-door, “expert” analysis in this environment has its dangers as it has always done. What was once the preserve of specialised skills and a guild mentality is being opened up by opensource software, crowdsourcing, online communities and open knowledge such as Wikipedia. We are witnessing democratisation of knowledge and the rebirth of the polymath, not seen since the Renaissance.
I’d argue for a slightly different interpenetration. Statistics and applying the data should be taught as a liberal art rather than as mathematics. The more data that exists the more “lies, damn lies, and statistics” hold true. Thus, we should learn it under the umbrella of critical thinking.
We should also be more aware of our biases towards data and statistical thinking. No only does the lay-person need more education, but so do the professionals! The article says, “Data became so cheap that it couldn’t be confined to just a few highly trained people. So raw data started to trickle out in a number of different ways. … The doctor stopped telling you what to do and started presenting you with options and the risks that went along with each.” Yet the work of Khaneman and Taversky clearly demonstrated that even professionals failed at statistical thinking when presented one situation, framed in two different ways, elicits very different responses.
Couldn’t agree more. Spend more time teaching what tests and analyses to use in various situations and how to interpret the results, less time on proofs and formulas. Note I didn’t say eliminate proofs and formulas — there’s still great value in having an understanding of what you’re doing. But in an era when as a practical matter no one is whipping out pencil & paper or basic calculator to determine complex statistics and anyone can generate multiple stats with a few lines of code, the most important issue is using statistics wisely and well.
RE: “Those who disagree tend to worry WHAT MIGHT HAPPEN — what kind of data-based decisions will be made — by non-statisticians, and that should definitely be a priority as we move forward. Non-statisticians often make incorrect assumptions about the data, forget about uncertainty, and don’t know much about collection methodologies.”
Jeff & Nathan’s point is the right one. No MIGHT HAPPEN about it. Non-statisticians ARE making data-based decisions, often without an adequate background in statistics–not even enough to know what they (OK, we) don’t know.
Statisticians need to help those of us in decision making positions understand the technical aspects of our data, and we need to take responsibility for learning about statistics. (I took Leek’s great course on R via Coursera in my “spare time” as a starting point, and I’m taking additional related courses from others).
Re: Jason’s point on teaching statistics as part of a liberal arts curriculum, I would have been far better served with courses in probability and statistics than the courses in calculus I took in high school and college. It wasn’t until business school that I got my first formal exposure to probability or statistics, and I got KILLED on my exam about interpretation and critique of stats. A sobering experience for a previously high performing student.
Interesting article, thanks for sharing.
I think it helps to recognize two things that are in play in this piece. First, there is more data (often better, more useful data as well) that is easier to access in real time. The second thing is to recognize that there are different roles, i.e. the data analyst/statistician is not always the person using the data to make a decision.
Leek’s examples of when statisticians stopped being data analysts are all situations where the individual person has always been the decision-maker (e.g. where to go to dinner). It’s just that the individual now has more, better data (restaurant ratings of places near you!) that are available anywhere via mobile device.
The real challenge, I believe Leek is saying, is about the larger impacts or risks in situations where expert analysts (statisticians, scientists), interpreted data for experts who made decisions (medicine/doctors, policy/elected officials, etc.)
With better access to more data, it’s easier for non-experts to offer analysis, which in turn puts at risk the quality decisions based on that analysis. The same phenomena – the proverbial firehose of data – may also make it more difficult for even expert analysts to distinguish signal from noise.
Leek’s recommendations are spot on: improving statistical literacy (AND visualization literacy as well!), directing statisticians toward interpretation and critique rather than pure technique, and fool-proofing the tools.
I think I’m a prime example of this. I’m not a statistician and I don’t have a technical background. Yet, in my profession as a customer acquisition / lead generation marketer (most recently served as the GM of a 22M+ member B2C company), I have to make highly analytical decisions on a daily basis.
I don’t need to see more data, just the relevant data that I can effectively apply to answer my core needs. This sentiment rings true for non-technical end users of data, across all verticals. We need the meaning of the data, in an actionable format, that allows us to solve real world problems.
To best serve a ‘non-technical’ end users and prevent rampant misappropriation of data, tools aimed at this user group should be very limited in scope to address a specific functionality. Give us a tool that can do everything and we won’t do anything with it. But make a tool that solves a specific question AND gives me direction on what action to take next, the combination is unstoppable. I’m currently tackling this idea in my new startup, DataScore.