How Open Should Open Source Data Visualization Be?

August 29, 2008

Topic

Visualization

I used to ride my bike to school, and I always forgot my U-lock. Instead of riding back for it, I’d just stash my bike unlocked in between a cluster of bikes. I told my friend jokingly, “It’ll be OK. 98% of people are good.” One day I got out of class, and my bike was stolen.

I was cleaning up some Actionscript in preparation for a tutorial post on how to make your own animated Walmart map, but a couple of bad memories involving stolen code and bad knockoffs (of my work) stopped me midway. I had to think:

Is releasing my code the best thing to do?

I’m sure the consensus is a resounding yes, but what’s to stop some lazy person from ripping off my code and pawning it off (or worse, selling it) as their own? What if I want to sell my visualizations? I am after all a lowly graduate student. It’d be nice to have another income stream.

On the other hand, had others before me not released their work under that wonderful BSD license, I would not be able to do what I do. At least not as easily. Modest Maps? Free. TweenFilterLite? Free. Flare Visualization Toolkit? Free. If I don’t follow suit, does that make me selfish? Yes, it does.

Giving Back to the Community

I’ve heard that phrase, giving back, so many times in both the real-life sense and the digital one, but it never made much sense to me. I mean, I got it, but I never really got it.

Perhaps I never understood it, because I wasn’t using much of the community’s resources nor did I have anything to give back. I have something to give back now. I can help people learn in the same way that others before me have and still do. I’m incredibly thankful to those who maintain these open source projects and still help me out from time to time when there’s really nothing in it for them.

The least I can do is continue to promote this idea of openness and help this small field of data visualization flourish into what it deserves to be. It’s why I blog, and it’s why I should give back, but to what extent?

Making the Case for Open Source Data Visualization

My dilemma brought me back to a Data Evolution post on open source data visualization. It highlighted three things:

Open Tools â€“ As in freely available software tools like R and Processing.
Open Code â€“ How often have you seen a visualization and wondered, “How did they do that? If only the code were available.”
Open Data â€“ Oh so important in data visualization. The core. Open data means more people can try out different methods.

It’s not always possible to attain all three. For example, we pay money for software because the companies would not exist otherwise. It’s a business, and to think that software companies would develop a bunch of free software is unrealistic. Also, oftentimes, data just can’t be shared â€“ usually because of privacy issues. Lastly, open code doesn’t make sense a lot of the time. The DE post grades The New York Times with a D for openness, but they’re a news business, not a visualization repository.

While we can’t always attain all of three things, there’s no reason why we can’t try to strive towards that ideal. As someone I know likes to say â€“ strive for perfection. You might not reach that standard, but you could end up with something close.

Open source is a development method for software that harnesses the power of distributed peer review and transparency of process. The promise of open source is better quality, higher reliability, more flexibility, lower cost, and an end to predatory vendor lock-in.

Tweeting Thoughts

I of course tweeted this in the middle of the night while watching the day’s remaining olympic events – to release code or not to release code. Here are are some of the replies:

@rpj: To release, always! (When legally possible.)

@ehrenc: re: code. You could always release half the code :)

@pims to release code. There’s some brilliant people around that can build on top of what you did. Open world :)

As for me, well, let’s just say you should expect to see tutorials â€“ complete with code â€“ in the coming weeks.

FlowingData Delivered to Your Inbox

20 Comments

Daniel McLaren — August 29, 2008 at 8:08 am

This has always been a dilemma for me since my fledgling business is built on my source code. Like you, I dread knock-offs and underselling. I hope that someday my business will rely on reputation more than “secret algorithms” and I can safely release the code.

In the meanwhile, portions of it plus code from other projects appear in tutorials on my blog as my way of giving back to a community that has given so much.
Moritz Stefaner — August 29, 2008 at 9:40 am

Nice post, I like the development :)

As you know, I know this dilemma as well. On the hand, I believe good visualization requires craft, experience and expertise. So, to use an analogy, publishing the code of a particular, crafted piece could be compared to a music band publishing their notes. It might help in imitating, someone might pull one or two tricks out of it, or learn from it, but it will not lead to great music if you are not a great musician. And great musicians would not need the notes either. They just “read” the music and can apply what they heard in their own way of making music. Moreover, generic frameworks should be built by engineers not designers.

That said, I fully agree we all owe the “open” scene so much, that, indeed, it would be a shame not to give something back. In this context, tutorials are a great thing, because they can also transport the process and thinking behind the end result. I will take it to heart, thanks for reminding me.
Andrea — August 29, 2008 at 9:42 am

As a researcher, I risk losing “competitive advantage” by releasing my code and data for scientific analysis workflows (e.g. http://www.myexperiment.org/users/384/workflows). However, I believe there’s more potential benefit than risk in sharing my so-called intellectual property. Open science ideals (as exemplified by sharing data, analysis, and results) are highly congruent with the values of the open source communities that I study, and I can’t help but conclude that the institutionalized incentive systems for academics that make us hesitate to share knowledge are overdue for revision.

I also feel ethically bound to share the work when it is funded by public money like NSF grants, and I fully expect that sharing research artifacts will become a requirement for most publicly funded research. Sharing data is already a requirement for NIH projects; for example, new genome sequences must be deposited within 24 hours of discovery, and depositors get 6 months to use the data before others get access. I believe that papers must also be deposited into an open access repository like PubMed. This system may not be perfect, but I think it’s a step in the right direction.

I share my work on principle; science is supposed to be about truth and knowledge, not hoarding data and hiding tools from others for our own personal benefit, to the potential detriment of the greater community. Perhaps that’s just my youthful idealism; we’ll see how things stand in another 10 years.
Pierce Wetter — August 29, 2008 at 10:21 am

Ha Ha Ha Ha.

In most cases, having someones’s source code is like reading a novel in Russian, if you don’t read Russian.

I seriously doubt that there are hordes of people hunting for data visualization code who are going to see what you do and go “aha! now I can make millions!”.

Much, Much more likely is that someone who needs to use your code will pay for you to help them integrate it. But that won’t happen UNLESS you release your source code.

Consider the source code your resume. There are only a small number of people in the world who are paid to work directly on data visualization. Your only chance of ever being hired to work on that field is if your package becomes famous, and for it to become famous, it has to be freely available.

In other words, what I’m trying to say, is market yourself, not your code.
Nathan Yau — August 29, 2008 at 10:34 am

@Andrea – i’m working towards that ideal :)

@Pierce – i agree, which is why i’m opening up. i’ve never thought there were hordes of people, but it’s that bad 2% that i’ve had the experiences with. I’m coming from the graduate student perspective, and when someone writes a paper on my work, I can’t help but get mad.
Robert Kosara — August 29, 2008 at 11:07 am

I agree with Andrea that it’s not a good sign for academics to be sitting on their code rather than sharing it – we should be required to share our code as a condition of publication. That would speed up work considerably, since we could build on the work of others (with attribution, of course), instead of having to redo it. Also, teaching would be so much easier if we could easily get all those programs and let the students use them, instead of just showing pictures.

There will always be people who rip off your stuff. I’ve had articles from my website copied and reposted somewhere else, sometimes with my name, sometimes without. The same has happened with my pictures on Flickr. Do I care? No. What good would it do me? Half those pages don’t have any kind of contact information, it would take a lot of time and grief to get them taken down. People who have an interest in those topics will eventually find me and my work. And the rest might assume for a moment that random user 4986 on geocities came up with those thoughts and move on – what difference would it make for me to claim ownership?

So giving away means letting go. Stop worrying about those 2%, if they don’t rip off your stuff, they’ll rip off somebody else’s. They’re not worth the trouble.
Nathan Yau — August 29, 2008 at 11:19 am

@Robert – “as a condition of publication” for sure. my main problem has been sharing code before publication with the intention to help and then later finding that a paper was written claiming that work with little modification before my own paper.
Greg J. Smith — August 29, 2008 at 11:41 am

@Pierce Wetter

“market yourself, not your code”. – Well said!
Michal Migurski — August 29, 2008 at 12:07 pm

Excellent post, I think Pierce nailed it with “market yourself”.

There’s a right fit for released code: it ought to help others build on and extend your work, and present a programming interface that’s a natural break point for a separation of concerns. “All I want is a decent hammer.” We released Modest Maps because it felt like something that would help people like yourself create beautiful online maps. We don’t, however, release the source code for all the client projects (Trulia, London Olympics, etc.) we do that use Modest Maps. There’s a number of reasons for this, mostly boiling down to economy and perceived usefulness – would it help anyone to see how we placed beveled buttons over a map? It’s not 2000 and we’re not Praystation, so most likely not.

Open data is a whole other thing. I’m not even sure how to begin addressing it. Almost all of the data we use in our work is driven via public API’s of some variety, so it’s really as “open” as it’s reasonably going to get. The cluster of interlocking ownership and copyright issues makes it truly hard to think about “libre” openness and data value.

Much as open source code has been the front line of concern for the past fifteen years, open data will be the front line for the next fifteen.
Hadley — August 29, 2008 at 2:52 pm

@Nathan: call them out on it. If they’ve published in a journal, contact the editor and complain. If you published online, there’s some chance that your work has been captured by archive.org and you can show that you had prior claim.
Hadley — August 29, 2008 at 2:53 pm

@Robert: I’d love to read the source code your applications. Where can I find it?
Robert Kosara — August 29, 2008 at 8:28 pm

@Nathan: what Hadley said, if somebody really ripped you off, you can’t just let them do that. The Vis conference has somebody who deals with ethics, and organizations like IEEE/ACM/etc. do, too. Talk to them, they can tell you what your options are.

@Hadley: I knew somebody would call me on it ;) I’m going to release the code for our Parallel Sets program by the end of the year, and I am also working on making other things available. I’ve actually already published the source for my recent Presidential Demographics applet on EagerEyes.org on launchpad.net, I just haven’t linked to it yet (search for my name and you’ll find it). Other things will follow.
Hira — August 29, 2008 at 9:14 pm

Great post! You’ve expressed the dilemma well.

In my experience, the creator is fundamentally more valuable than the creation because no two problems are identical. There will always be those who steal, but there will equally be those who appreciate the talent that your creations reveal. Techniques can be stolen, not talent.
Ted Dunning — August 30, 2008 at 2:27 pm

There are a few times when keeping code closed is a good thing.

But for real people, the balance almost always goes the other way, at least in terms of their personal advantage. For graduate students, that balance is overwhelmingly in favor of openness.

The reason for this is two-fold. Firstly, the value to yourself of almost anything you do is massively outweighed by the value of what you will do. Whether you get a chance to do that future work depends critically on whether others will give you that chance. If you are well known, they will, if not, the may well not. This is a different take on the market yourself commentary of others.

The second reason is just as important. If you can keep your work open, you not only leave it available to others, you keep it available for yourself. If your work is proprietary, there is a good chance that the first closed shop you work for (say, the New York Times) will force you somehow to assign key rights to them. If your work is irretrievably public, then that will be very hard for them to do.

My own case is an excellent example of this. I have one publication that has been cited nearly a thousand times (per citeseer) and I released the software associated with that article publicly. In the years since then, the ability to freely use my own code and then use it again at the next place I work has been absolutely key to my success. In addition, I have benefited in other ways from code I released. Some sample programs that I gave away wound up in a text-book used by an intern I was trying to hire. He later said that was a “final-straw” factor in coming to work with me. His efforts definitely helped us be successful.

So even on a very, very selfish basis, releasing my code was one of the best things I ever did. Your case may differ, but I think that the impact will be even more markedly positive for you because of the nature of what you do.
Pingback: Open Data, Open Visualization and a new blog : business|bytes|genes|molecules
Nathan Yau — August 30, 2008 at 4:52 pm

thanks, everyone for all the really valuable input. i think it’s clear now which direction i should go as far as openness is concerned.
Matthijs — August 31, 2008 at 10:03 am

Maybe you’ll have to figure out what parts have the most value. The code itself? The knowledge how to use and apply the code? Or the authority, reputation and network you build by publishing about the code?

By the way, since when do you have these inline text advertisements? They are highly annoying and make reading the text very difficult. If you need the income, please give me the option to not see them, maybe in exchange of me clicking one or two other adds.
Tim / pims — August 31, 2008 at 6:36 pm

I bet you a beer that by releasing your code, you’ll learn something from the community, which will make you a better developer. It’s guaranteed.

If you plan to share a huge amount of data, provide an API. You get best of both worlds :)
Nathan Yau — August 31, 2008 at 9:20 pm

@Tim – i’ll take that bet. i’m looking forward to losing.
Mike D — September 1, 2008 at 4:51 pm

Nathan – Thanks for the shout-out, and for sharing your own thoughts on the question of “to free or not to free” your code.

As someone contemplating a business model in the data / analytics / visualization space, I’ve struggled on the same point. But ultimately, unless software is at the absolute core of one’s competitive advantage (think Adobe), keeping code private is on the losing side of history. I echo Andrea’s, Ted’s, and others’ sentiments: more is to be gained by sharing it than by stashing it.

I look forward to seeing some of the code under the covers here at FlowingData, and I’m aiming to practice what I’ve preached as well.

How Open Should Open Source Data Visualization Be?

Topic

Is releasing my code the best thing to do?

Giving Back to the Community

Making the Case for Open Source Data Visualization

Tweeting Thoughts

FlowingData Delivered to Your Inbox

20 Comments

Second Edition

Visualize This: The FlowingData Guide to Design, Visualization, and Statistics (2nd Edition)

How Open Should Open Source Data Visualization Be?

Topic

Is releasing my code the best thing to do?

Giving Back to the Community

Making the Case for Open Source Data Visualization

Tweeting Thoughts

FlowingData Delivered to Your Inbox

Related

20 Comments

Second Edition

Visualize This: The FlowingData Guide to Design, Visualization, and Statistics (2nd Edition)