R is an ‘epic fail’ – or how to make statisticians mad

Posted to Software, Statistics

Statisticians are mad and out for blood. Someone called R an epic fail and said it wasn't the next big thing.

I know that R is free and I am actually a Unix fan and think Open Source software is a great idea. However, for me personally and for most users, both individual and organizational, the much greater cost of software is the time it takes to install it, maintain it, learn it and document it. On that, R is an epic fail. It does NOT fit with the way the vast majority of people in the world use computers. The vast majority of people are NOT programmers. They are used to looking at things and clicking on things.

How dare she, right? Here's the thing. She's right. Wait, wait, hear me out. For the general audience - the people who use Excel as their analysis tool - R is not for them. In this case, the one that appeals to non-statistician analysts, R, as they say, is an epic fail (and that is the last time I will say that stupid phrase).

However, R wasn't designed to enable everyday users to dig into data. It was designed to enable statisticians with computing power. It's a statistical computing language largely based on S, which was developed in the 1970s by the super smart John Chambers of Bell Labs. The 1970s. Weren't people using slide rules still? Or maybe it was the abacus. Can't remember. Oh wait, I wasn't born yet. In any case, there's really no need to get into the whole R-for-general-audience conversation — just like we don't need to talk about why The SpongeBob SquarePants Movie lacked emotional depth.

The Next Big Thing

Instead, let's look at the main point of the post that got lost in the R-bashing. What is the next big thing? The next big thing is "data visualization" and "analysis of unstructured data." Okay, so she sort of got methods and tools mixed up, but we'll let that one pass. There are a growing number of applications that can help you with this, without the need for programming. Check the sidebar for a couple of great options. Even R has a couple of nice user-friendly interfaces. I guess SAS lets you do this to an extent too (I've never used it).

Who's behind the software though? Who are these people who are making others' lives easier with click-and-play analysis tools? It's people who can code. It's the people who know how to dive deep into data.

Matt Blackwell made a good analogy to the iPhone/iPad craze. It's sexy, it's user-friendly, and that is because talented Apple developers created amazing software, and third-party app developers are putting out quality products that consumers can buy in the app store (and making a good living out of it). Similarly, those who can build visualization and analysis tools are the ones who will provide the next big thing.

So don't get too upset, R programmers, or all data scientists for that matter. While the software was bashed, you're getting a thumbs up. R is not the next big thing. You are. Besides, we all know that data is the new sexy, and in the end it's not about the tools that you use, but what you do with the tools.

35 Comments

  • I use JMP for data analysis because it offers a hybrid strategy: point and click with a rich, underlying scripting language. For example, I may spend some time dragging and dropping fields, adjusting the axes and colors for one graph, then simply save the underlying script so I never have to do that again.

    Anything you can do within the GUI, you can do via script, and the automation is key to my work flow.

    And rumor has it, JMP9 will have an R interface. Whatever that means.

  • Nice post. I agree R isn’t for everyone and isn’t a great tool for someone that needs to rely on a pointing-and-click interface.

    Ironically enough, those kinds of black-box software applications aren’t always the best tools for implementing her two picks for “the next big thing”: Analyzing enormous quantities of unstructured data, and Data visualization.

    Perhaps the more appropriate discussion that should take place is whether or not basic programming skills are essential to doing modern science in the same way that basic mathematical and statistical skills are considered essential?

    Disclaimer: I should be clear that — as you might have already guessed — I’m a pro-programming R fan. ;)

    • oh put me down for that disclaimer too :). all statisticians and anyone who wants to seriously analyze & visualize data needs to know how to program. whether it’s R or something else, it doesn’t really matter – as long as it gets the job done. Otherwise, you’re always waiting for someone who does know code.

    • I also don’t know why big data and visualization are given as the next big thing. Aren’t they the current big thing?

  • Does R exist for development in Microsoft technologies like .NET/ASP.NET/C# etc?

  • I’ll admit I have not read the original blog post. Based on your post and the fury it’s caused on Twitter, I get what the author is trying to say, but I think the field is changing. My opinion is that SAS and friends are becoming more and more intro level tools, and “pure statisticians” are using R or some other programming language. Other statisticians in the social sciences, psychometrics etc. prefer SAS and friends for their canned routines. This was apparent in an informal poll that was taken at the ACM Data Mining conference I attended last month.

    I think it’s comical that the person behind “The Julia Group” is from USC ;-), but I digress :).

  • I’ll admit to mostly using R for canned routines (lm, glmer, principal) but it takes a lot of grinding to get data into the shape you want and to integrate data from different sources.

    But if your job is selling this to other people, the reluctancy to push R is understandable. Nathan is right: if you think R needs a GUI or needs to be more user friendly, it isn’t the tool for you. Move along, nothing to see here. If the idea of a statistical tool that is also a functional programming language doesn’t turn you on, then R is not for you. There’s nothing wrong with that, and R proponents shouldn’t get so bent out of shape about people not liking it.

    Anyway, a single thing like R could ever be “the next thing.” In this game, the next thing is never a tool but rather a new class of problems to solve out of which new tools grow (these were the types of larger problems that De Mars cited.)

  • R isn’t an epic fail, but R documentation might be.

    IMO, the key to R’s widespread adoption is clear documentation. I’m not a classically-trained programmer, but I’ve definitely picked up web-based languages quickly because of it (PHP and jQuery come to mind–yes I know, jQuery is a framework not a language).

    If R documentation can approach the clarity that php.net or docs.jquery.com have, I think you’d see a proliferation of easier-to-use analytical tools, for both the basic and power user. Then we could spend less time in how-the-hell-do-i-get-this-to-work mode and more time in analytical mode.

    The next best alternative is to buy a >$40 book on R. And if you’re like me, chances are you’re using R because you’re cheap.

    • I’m not sure why everyone dogs on R documentation. I find it really helpful. Maybe I am just odd. ;-)

      • i think it’s pretty decent too. explanation up top, arguments in the middle, and examples on the bottom. a lot of the time you have to go to the parent class to find available arguments though (e.g. for graphing).

        but yeah, it’d be way cooler if there was a structured place for R docs like for jQuery, etc. someone should update the R site in the process too. I number of R downloads would double simply on a site redesign bringing us into 2010.

  • Matt Platte April 22, 2010 at 9:21 pm

    Yes, we were still using slide rules in the 70s. Now get off my lawn, you punk kid!

  • JustPassingThrough April 22, 2010 at 10:02 pm

    What about RExcel? Does that solve the problem?

    http://rcom.univie.ac.at/

  • Need “data visualization” and “analysis of unstructured data”? Try the Incanter library from the Clojure programming language. Its more powerful than R.

    • When you say more powerful, I assume you’re comparing R and Clojure as languages.

      I’m definitely behind languages like Clojure taking over from R if they can build up a similarly comprehensive library of functions. Incanter goes a long way toward making this happen, although the chief thing it is missing is the idea that the package is not just a way to run statistical functions, but that it is also an interactive shell for doing analysis, where you want functions that print easy-to-read model output.

      Of course the bigger hurdle right now that Clojure is a lot more difficult to install than R :).

      I’m still wondering if Fortress will ever go anywhere. Its mathematical syntax would be a great base for statistical computing.

  • Ayush Raman April 23, 2010 at 2:21 am

    I think ur blog reminds me of a famous adage that IGNORANCE IS BLISS!!

    Coming up with an example and Comparing with the everyday user not using R is ridiculous. Does English/Econ majors use C/C++ .. No!! but that doesn’t mean that it is not popular or a failure.. similarly if one is not using R that doesn’t mean that it is a failure.. See how many are really using everyday and in how many different ways to solve problems..

    R has some really good Data Visualization packages and really nice GUIs. Please read about those packages or do enough background check and then come up with this blog.. because it really make you feel stupid among those who are actually using it everyday.

  • Before I came across this site, I started using python and matplotlib for data analysis and visualization. This seems to do the job fairly well, although I wish the docs for matplotlib were a little clearer.

    Has a comprehensive comparison been done across different data crunching methods?

  • Josh Hemann April 23, 2010 at 7:47 am

    I have been in the commercial analytical software business for the past decade and one thing I don’t hear people discuss a lot is tools/languages for prototyping versus full on application development. R is a great tool to have in your toolbox for doing ad hoc, exploratory data analysis. It is not a good tool for application development. Ihaka said the same thing in the November, 2008 issue of Technometrics. I see this thing a lot in computational finance where Excel still rules in a major way. People in this space bludgeon everything with Excel because that is the tool they know.

    Speaking of Ihaka, he and Duncan Temple Lang have a nice article in the 2008 COMPSTAT Proceedings (in Computational Statistics) where they argue for porting R functionality and syntax to a “Lisp-based engine” to improve speed and scalability. Interestingly, they make a lot of comparisons with Python, a language I use as my mainstay for data analysis. From the authors’ viewpoint, the reason for not picking Python as the “next stat language” is because while it is often faster than R, it is not fast enough (and not nearly as fast as Lisp). The good thing about Python though is that because it has been strongly adopted by computer scientists, the language itself is evolving quickly (e.g. Unladen Swallow and PyPy) as opposed to R (adopted only by statisticians), where the number of libraries is increasing quickly but the R core team is small and overworked with respect to the language itself.

    Ideally, I think we would travel back in time to the early 1980s and all algorithmic development would always take place in C first. Then, people in domain specific areas could take their language of choice (e.g. Python, R) and wrap these C routines. Thus, you’d have the same algorithms available in your prototyping, exploratory language as you would in the language you might build large scale applications with.

    • Lots of R libraries are still written in C.

      I never use C anymore, probably to my detriment.

      • Josh Hemann April 23, 2010 at 6:29 pm

        True, and much of SciPy and other packages have C underneath (and Fortran), not to mention that the R and Python interpreters themselves are written in C. The problem though is that often times it is not realistic to pull out this code and use it in a pure C application. The case is more clear for routines that simply wrap Netlib say, but even then there can be issues.

  • Travis McTeer April 23, 2010 at 9:42 am

    I don’t have a programming background, but I do a lot of statistical work with very large data sets. I’ve generally been able to do anything I need to do using SPSS. I have been meaning to learn R (I like the idea of free), but just haven’t had the time. One thing that does strike me as a bit discordant about this thread is this idea that graphical representation of data is wonderful and sexy, but graphical representation of code (GUI) isn’t. Not necessarily disagreeing since I admittedly don’t know enough about programs like R, just observing.

    • you can do a lot through gui-based stuff, but eventually you have to design your workflow around the software no matter how good it is. When you’re dealing with more complex stuff, you want it to be the other way around.

      • just to be clear. i love gui-based stuff. i use it as much as possible, but programming lets me do more once i can’t do anymore with the gui. i don’t have to wait for someone to add a new feature.

    • One commonly cited benefit of code-based (vs. gui-based) analyses is thier repeatability. If I want to repeat an analysis on a given data set, I don’t need to replicate all my pointing and clicking. Just swap out the data file and let it run. Not sure if some GUIs allow for such repetition, but it’s the default in scripting platforms.

      • @Paul – totally agree. the gui-based stuff does let you save files, and you have an undo path for some programs, but for one of the projects i’m working on for example, i had generated a bunch of different graphs and then found out i had to do it with same but different data. luckily have all of my scripts setup so i can just run it all over again easily.

      • I just learned this last week, with the GUI-based SPSS. Really surprised me.

        All of the model fitting dialog boxes have a “Paste” button. When you click it, the code that will reproduce the model and settings you’ve specified will be pasted into a new syntax file. It is quite handy, especially since the online help is not as good at explaining the newer modeling syntax compared to the old user manuals from the 80s.

  • I distinctly remember having a fancy TI calculator in the late 70’s. No slide rule for me!

    As for R, I’m surprised by its sudden popularity, although it’s “free-ness” has a lot to do with it. I have always liked SAS as a pseudo-programming language. I also like the effort that SAS has put into its JMP product. It’s as close to point-and-click as we’re going to get for professional statisticans/data analysts.

  • I can see why R on its own is difficult to master for a non-programmng person. I found it difficult at first, but then I discovered the R Commander gui and the R Rattle gui. I now use both to do an initial exploratory analysis of any dataset and they work fine for most of what I do. When some deeper analysis is required I roll up my sleeves and tackle the documentation and plough headlong into it.

    And its free! I guess many independent consultants like me cant afford the steep costs of SAS or SPSS

  • In the little teaching I’ve done, I’ve found that people tend to struggle with 3 things when they’re “learning R”:

    1. Learning (and learning to apply) basic mathematical and statistical concepts.
    2. Learning basic programming skills like using vectors, loops, functions, etc.
    3. R syntax, functions, etc.

    By far the most painful of the three seems to be the 1st and 2nd of these, and not so much the third!

  • I’m a grad student veering out of an interdisciplinary wasteland into political science. I think Paul’s right that R isn’t the hardest part, but R can be a frustrating bedfellow. I’m sticking with it (so far with no instruction) because statisticians and others have convinced me I’m going to end up there anyway to answer the questions I want to address.

    I will say that the documentation seems thorough, but is entirely illegible to those who aren’t already very well versed in statistics and accustomed to the way computer documentation is written. Luckily, I’m pretty good at reading UNIX man pages.

  • I’ve spent several decades in commercial statistical software development (working in a variety of R&D roles at SYSTAT, StatView, JMP, and SAS), and I now do custom JMP scripting, etc., to make my prejudices clear.

    I can say with hard-won authority that:

    – good statistical software development is difficult and expensive
    – good quality assurance is more difficult and expensive
    – designing a good graphical user interface is difficult, and expensive
    – a good GUI is worthwhile, because the easier it is to try more things, the more things you will try, &
    – creative insight is worth a lot more than programming skill

    Even commercial software tends to be under-supported, and I’ll be the first to admit that my own programming is as buggy as anybody else’s, but if I’m making life-and-death or world-changing decisions, I want to be sure that I’m not the only one who’s looked at my code, tested border cases, considered the implications of missing values, controlled for underflow and overflow errors, done smart things with floating point fuzziness, and generally thought about any given problem in a few more directions than I have. I want to know that when serious bugs are discovered, the knowledge is disseminated and somebody’s job is on the line to fix them.

    For all these reasons, I temper my sincere enthusiasm about the wide open frontiers of open source products like R with a conservative appreciation for software that has a big company’s reputation and future riding on its accuracy, and preferably a big company that has been in the business long enough to develop the paranoia that drives a fierce QA program.

    R is great for what it is, as long as you bear in mind what it isn’t. Your own R code or R code that you find sitting around is only as good as your commitment to testing and understanding of thorny computational gotchas.

    I share the apparently-common opinion that R’s interface leaves a lot to be desired. Confidentiality agreements prevent me from confirming or denying the rumors about JMP 9 interfacing with R, but I will say that if they turn out to be true, both products would benefit from it. JMP, like any commercial product, improves when it faces stiff competition and attends to it, and R, like most open source products, could use a better front end.

    [An expanded version of my comments are cross-posted on Global Pragmatica LLC’s blog as http://globalpragmatica.com/?p=230.

7ads6x98y