Why learning code for data is worthwhile

July 12, 2011

Topic

Guides / programming, R

There are lots of tools that have come out in the past couple of years that make data easier to handle, analyze, and visualize. Maybe you’ve used them. I use them all the time. However, no matter what software you use, there is always going to be a limitation in what you can do with it.

Have you ever been using an application (not just for data) and wished it could do something else? If you want a new feature, you have to wait for someone else to develop it, but if you program, you could implement your own features.

With a little bit of coding know-how, you gain more flexibility — and a little goes a long way.

I think a lot of people avoid programming, because it seems scary and they have no idea where to start. I felt the same way when I first started learning code in college. I had no idea what I was doing, and it was actually one of the reasons I wanted to get away from my engineering major and jump into statistics. That programming background came in handy though and grew more useful as I played with more data. Now it’s hard for me to imagine doing data without having that flexible tool in my back pocket.

Just think of learning code as learning a new language (because, that’s basically what you’re doing). When you start learning a new language, you’re not writing essays the first day. You learn punctuation, grammar, spelling, and other basics, and then you build up to paragraphs, essays, or even books. Same thing with code. You learn the syntax and logic, and then apply what you learn to bigger problems.

It might feel a little slow at first, but another plus is that you can reuse code, meaning you could end up saving time in the long run.

Again, this is not to say you should abandon the other tools and use code exclusively. Rather, it’s another tool in your box — a powerful one that increases its utility the more it is used.

In the end, the more ways you can explore, analyze, and present your data the less likely you are to get stuck and the more likely you’ll be able to figure out what your data has to say.

20 Comments

Mike — July 12, 2011 at 12:49 am

Which coding language would you suggest, though?
- Nathan Yau — July 12, 2011 at 12:52 am
  
  @Mike – It depends on what you want to do:
  
  https://flowingdata.com/2009/09/03/what-visualization-toolsoftware-should-you-use-getting-started/
- Damian — July 12, 2011 at 7:44 am
  
  Python. Python. Python!
Alessandro — July 12, 2011 at 1:10 am

Thank you Nathan! I’m learning code for this specific purpose at the moment, and your post motivates me a lot!
Ajmal — July 12, 2011 at 1:53 am

Yeah I have the same question as Mike. What language would you suggest or prefer based on your experience?
Somesiena — July 12, 2011 at 5:11 am

I to would love your advice on which langauge to learn, and also your thoughts on processing.
Paulo — July 12, 2011 at 5:52 am

Nathan, i’m statistician and I learnt R at college. I think it’s a great tool for data analysis and manipulation, but you have to avoid loops and write the code in a way more, let’s say, R-friendly. I also learnt C and now I’m trying to understand object-oriented laguages, starting with Java.
R. Jordan Prescott — July 12, 2011 at 6:26 am

a timely post – thanks!
Aaron — July 12, 2011 at 7:04 am

I agree completely. I first learned how to use VBA with Excel in undergrad and it changed my whole understanding of what I could do with data. All of a sudden, I didn’t need to rearrange spreadsheets cell by cell! VBA is kind of old and clunky, but for someone who had minimal programming experience, VBA was a great place to start (and it actually helped me feel comfortable when I started with R).
Dan — July 12, 2011 at 7:17 am

MIT’s introduction to computer science course is cogent and it has a great perspective on computer science not just programming. Plus the coursebook and video lectures are free!

Thought I’m not sure if Python is useful for most data visualization tools.

http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-00-introduction-to-computer-science-and-programming-fall-2008/
- jerome cukier — July 13, 2011 at 2:12 am
  
  python can be a great help in the “data” part of data visualization.
  Python is good for instance for getting data from APIs, processing thousands of files, analyzing natural language etc.
- Damian — July 13, 2011 at 4:49 am
  
  You can call R directly from Python, effectively making Python a data visualization tool.
  
  See modules rpy and rpy2 here: http://rpy.sourceforge.net/
  
  R: http://www.r-project.org/
  - Damian — July 13, 2011 at 4:58 am
    
    And if spatial data is your thing, you can call GRASS-GIS directly from Python too!
    
    http://grass.osgeo.org/wiki/GRASS_and_Python
David Torres — July 12, 2011 at 8:14 am

An excellent post, indeed. Thanks for the link too, as I want to start creating interactive infographics and need a starting point.
Arti K — July 12, 2011 at 11:28 am

General question to everyone – speaking to Nathan’s suggestion about being able to reuse code, has anyone figured out a good system to annotate and retain code you’ve used, so that you can reuse it again? I have been saving each program I write to a text file, but then, when I really need to refer to something, I can never bring up the right one and I end up just googling the concept and starting over from scratch.
- PR — July 12, 2011 at 1:24 pm
  
  Depending on what you’re asking there are two possible answers.
  
  The first is documentation. Taking the time to document/comment what you’re doing is always worthwhile. There are tools that can make this a little less painful, but at the end of the day, it’s just something that has to be done.
  
  But you appear to be specifically talking about reuse (not just re-finding). And simply put, this is one of the most powerful things about using a programming language in the first place. The specifics will depend on the language – but in general – the first way to go about reusing code is to write a library.
  
  For example, I do a bunch of data manipulation in Actionscript (since it allows for my final front-end), but actionscript does not have a function for Mean or StdDev. So I wrote functions for both and put them in an include-able file. Meaning my various programs can call in this include file, and run my “generic”, reusable code. This way I have one place to edit such things, but any number of individual programs can use them.
  
  So, if you find yourself doing the same thing twice (copying and pasting), it’s almost always worth taking the time to move that doubled-up logic to an external, called-in file of some sort. There’s overhead associated with doing this (both your time and possibly code-execution overhead) but it’s almost always worth it.
  
  That’s a basic first step. The next step is to use an object oriented language, wherein, instead of calling external functions, your new code can inherit the properties (and abilities) of some existing superset of code. This is much more powerful in many ways – but will depend on the language as to whether this is an option.
  
  That’s a start anyway… (maybe?)
  - Nathan Yau — July 13, 2011 at 11:59 am
    
    +1 to this. The main thing for me is documentation. I try to put in a lot of comments in my code, with the intention of coming back to it in a year. That and making sure I use descriptive names for my filenames and folders :)
- Sebastian H. — July 14, 2011 at 8:29 am
  
  I started creating how-to documents for different topics/software packages. I often know when a piece of code might be useful in the future. What I do is to create a minimal example for a problem/solution instead of using the whole code (which usually also includes other stuff that will confuse me after a few weeks).
  
  I have a document (in LaTeX which automatically creates an index and such) of code snippets for Stata Graphics, Stata in general, LaTeX, Python and other software. Most code pieces are working on their own, i.e. I previously define which dataset to load (often the same so I don’t have to look up what’s in it) or I create random data on in the example so I can be sure it will always work (I also highlight the important parts in the code).
  
  Sounds time consuming at first but usually copy-pasting and adjusting an example only takes a few minutes or less and I have saved ten times the work over the past years.
  - Arti K — July 14, 2011 at 11:14 am
    
    @PR, @Nathan and @Sebastian H. – Thanks for your responses and ideas. As a programming newbie, I felt like I was missing a part of the puzzle and figured that those in the know were taught good note-taking skills along with programming. I’m hearing that’s not quite the case.
    I started tentatively with Excel VBA programming and have been saving the code (all of it, instead of just the relevant, reusable pieces as suggested here) in Google Docs so I have access to it wherever I go. It’s just not the most intuitive code library, and I need to look for a better repository / code-storage method. Thanks, your responses have been really helpful!
  - Nathan Yau — July 14, 2011 at 11:39 am
    
    @Arti – For code repositories, GitHub, SVN, and CVS are popular.

Why learning code for data is worthwhile

Topic

20 Comments

Projects by FlowingData See All →

Wealthy Percentiles Rising

Percentage of People Who Married, Given Your Age

Data Underload #9 – Big Graphic Blueprint

Least Preferred Sandwich

Why learning code for data is worthwhile

Topic

Related

20 Comments

Projects by FlowingData See All →

Wealthy Percentiles Rising

Percentage of People Who Married, Given Your Age

Data Underload #9 – Big Graphic Blueprint

Least Preferred Sandwich