We’re statisticians. We don’t program.
— Anonymous statistician
I was talking to a small group of statisticians a few months ago, and someone said that to me when I told them how I go about mucking around with data. It still annoys me just thinking about it. It wasn’t that he didn’t know how to program — because that’s perfectly understandable — but he said it in a way as if programming and statistics were so separate that there was no possible way the two could go together.
Wrong.
Let’s set things straight before this silly idea spreads further. Programming and statistics belong together, and you don’t have to be a coding genius for it to work.
As a mathematician, I feel the same when some mathematicians say “numerical analysis is not analysis”.
Ruben
This attitude is one reason I left academia to work at SAS: I love to combine statistics and programming, and at SAS they appreciate all of my skill sets, not just one.
@Rick – How about putting in a word at work about getting SAS ported to the Mac again… Based on their Q4 earnings report, Apple is here to stay ;^)
And there are plenty of data analysts/viz professionals on Macs out there and the SAS Institute needs to stop ignoring them!
Wow. It is hard to believe someone could have that view. Perhaps some people just get their data handed to them in pristine form?
Your view is absolutely true. At a recent mini-conference w/ other students in Earth sciences, our conversations led to ‘what language do you use for your stats?’ It was really interesting to see how certain languages and tools vary by discipline. The cartographers among us preferred R, while the materials and polymer science students commonly used MatLab/Octave and SciPy.
That certain languages out-perform others at different types of analyses is proof that programming and stats are linked pretty tightly.
There’s a name for statisticians who don’t program: Unemployed.
Or consultants. A Statistician that has a great deal of programming ability is a “Data Scientist.” I am so glad there is finally a term to distinguish us from the SAS/SPSS/pencil and paper statisticians.
What the SAS dude said. As a statistician and former statistical software designer, stats & programming totally go together!
How ironic that a blog about Statistics takes as significant (not in the statistical sense) the comment of a single individual, probably not very bright. *All* the statisticians I know do some coding, even the ones that do only theory. Most of them in R and MATLAB, and some of them, like the Real Programmers™, use FORTRAN.
I’d say you barking off the wrong tree.
@gappy — I’m a statistician who programs, and I know lots who do, too. But there are many — not just this one — who I’ve met, and they don’t touch code. Then when you step out of the “statistician” job title and also include “analysts,” then code grows more rare.
Again though, I’m not saying statisticians don’t program. They do and should. It’s just that not everyone thinks that way, which bothers me.
These people irritate me to no end. Statisticians SHOULD program, but I am of the mindset that they are two totally separate skills, because computer science, basic software engineering and programming in general are poo-pooed in academia.
Nathan’s not wrong here. I work with statisticians who couldn’t program their way out of a paper bag. They think their strength is knowing what SAS procedure to run to get a p-value. And they do expect the data that reaches them to be analysis-ready. This may represent a minority of statisticians, to be sure, but the mindset _is_ out there.
Maybe this is the difference between “applied” statisticians and “mathematical” statisticians.
Although I understand the math grounds statistics (theory and all), I’ve always thought it odd that the science of “making sense of data” is not assumed to be inextricably linked with manipulating data… i.e. programming. It has to be done.
A lot of great statistical problems have been discovered by mucking around in the data. Any statistician who does not program is missing a very important half of the profession.
I thought *all* statisticians were programmers. This is illuminating. Just to understand his point of view though, what does he feel he ought to do? How does he even pull his data together? How does he analyze it?
I’ve known lots of people with that attitude. They looked down on me for “programming my own stunts” because real statisticians had minions to do the mucking about with data. Real statisticians were writing articles about their theories. Like Rick, that attitude is one of the reasons I left academia. Okay, well, that and the money!
I was so disappointed to find out that earth/physical scientists do so much programming, too. Because I love science, but I do not like writing code. Am sure glad there are folks who do, though!
I think the guy is right to a certain extend. With jmp there is no need to program but you can do if you like. So both worlds, for those who like it and for those who can’t do it or don’t like it.
If you don’t program, how on earth do you use SPSS properly…??
Maybe they’re making an artificial distinction between scripting and programming…?
Upon further reflection, I have two additional comments:
1) Let’s not forget that an awful lot of great statistics was done before computers were widely available (Fisher and Galton, anyone?). Even Tukey developed many techniques for hand computation (like the box plot and the stem-and-leaf diagram). It’s just that TODAY’s data tends to be so large (and messy!) that computers are necessary to manipulate and visualize the data. Plus, nonparametric methods free the modern statistician from the “procrustean statistics” (to quote John D. Cook) of the mid-20th-century.
2) Many government and industries split data analysis between two groups. The “statistical programmers” who use ETL to extract, clean, and prepare the data, and the statisticians who swoop in to carry out the analysis using pre-approved, established, methods. These statisticians can do much of their job by using a software GUI, simple scripting, or a procedure-based analysis in SAS. Since they have a team of programmers working with them, they don’t have to do any programming themselves.
So, yes, it’s POSSIBLE, but I think it’s becoming more rare.
I agree with @RickWicklin, but it makes me wonder whether that kind of distinction is actually the difference between reporting and analysis. In other words, the people working in the organizations @RickWicklin describes are generating reports and not necessary what many of us would consider statistical analysis.
Reporting is a big part of “Business Intelligence.” It is yet another term to throw into the confusion of Statistics and Data Science.
As a statistical programmer, I found this to be an interesting topic. The statisticians for whom I program have rudimentary programming skills, but rely on me to write efficient code and make the output conform to their specs. They typically review my code and I believe they understand it, but their days are full evaluating potential collaborators, haggling over priorities, writing reports and journal articles, travelling to conferences, etc. They are perfectly happy to let me sweat the details of the programming while they focus on the big picture. The work we do is a team effort and I enjoy it.
Where are you and can I hire you now? That said, even in this system, the scientists (statisticians, economists) should be able to write enough code to do the development work and do the proof of concept. The gap between theory and practice once it hits the data can be big.
Clearly, the future belongs to analysts, data scientists and statisticians who code. I couldn’t imagine taking anyone on who couldn’t analyse _and_ code.
This reminds me of my graduate school. My supervisor was using MatLab and simulate to get the answer. The chair of the department was using pencil and paper to solve the model and get the answer.
I don’t know who’s right or wrong. I’m using R now.
One intersecting thing about this discussion is several commentators essentially equating “programming” with “coding”. Programming languages in the context of statistics are a symbolic abstraction that are used to translate from the traditional language of statistics – symbolic math – to one that a computer can understand. There are visual programming environments like Clementine that attempt to provide a non-text based programming model. Focus on the end result, not the method I say. SAS folks have often said to me things like “if you MUST have a GUI” and end with some dismissive comments about those who prefer not to master their 20th proprietary text based programming metaphor. As someone who leads a team of data scientists, I want to find ways of exploring ideas as quickly as possible. Some things don’t lend themselves to visual environments for sure, but I submit that “programming” does not equal “coding” necessarily, and that text based programming approaches are not always superior to graphical ones, and vise versa.
You want to find ways of exploring ideas as quickly as possible? Give jmp a try: http://www.jmp.com. By the way it does explorative analysis and programming.
Darwinism will ensure ignorance are punished.
I have no idea how one would manipulate datasets (create, link, clean etc), review the data (exploring), and do some statistical modeling work without coding. And then quality control is also so much easier and quicker.I hate GUIs: in the long run they waste time and it is so much easier to make a mistake. If you love statistics and is developing models, it is well worth the time investment to learn how to program. You will make up all that time really fast and deliver a better product – and have the benefit of drawing upon code you used previously for a new project.
I am thinking about a blog post on this. It is difficult to explain. Yes, if you use R to do data analysis you have to write code, but it is “boxed” code. It is like writing SPSS, SAS, Mathematica, Matlab code. The skills are not transferrable to new situations, although many companies do acknowledge R now. I think the consensus is that “programming” and “coding” means writing code in a language that is universally accepted such as C, C++, Java, Python etc. IMO, that defines a Data Scientist vs a Statistician. I argue that a Data Scientist builds things that process, utilize and visualize data, and deals with online decision making and its scalability… whereas a Statistician does not.
I agree with you about GUIs. No purpose in research (except for visualization and user experience), too many dependencies (except with frameworks like wx and Swing), and a waste of time. That is why companies hire people to just work on frontend :)
General programming subsumes specialist programming. We can still write the pure ‘stats’ stuff in R and call that into a python environment etc. Which is what scipy does with classic FORTRAN packages etc. Re-use, recycle :)