Thursday, February 11, 2010

Release, or Not Release (the code)


It's Constitution Day, and everybody has the day off. Which in practical terms of course just means I've spent the day working from home; tide and deadlines wait for no man. while sitting in my study/storage room/garbage dump, I found a good blog post about a problem with software in science. Science geekery ahead; feel free to skip it.

Science basically depends on being open about what you do. You don't just publish your final results, but you also describe your methods and the reasoning that brought you there. That way other researchers can evaluate your claims and determine if your methods and analysis is sound; they can redo the same experiment or analyze the same model you did, or do a similar one and see how the result differs; they can analyze your data with different methods to see if there's more to learn from it and so on.
However, the last couple of decades has seen two big changes. First, we use computers for everything nowadays. "Analyzing data" is practically always done using tools like R or Matlab, or using code you or people in your lab have written by yourselves. Experiments - all kinds, not just robotics or the like - are usually at least partially automated, with data collection and protocols controlled by some combination of off-the-shelf and local code. Computational models used to be rather simple, if only because deriving results wasn't possible with a complex model, but today models can be very large, very complex, and implemented by an equally large, complex stack of code.

The second big change is that the amount of data has exploded. We can grab huge amounts of data today, things that simply weren't possible a generation ago. Think of the possibilities of DNA analysis. Then think of the amount of data it represents. It's the same in all fields. The Large Hadron Collider alone will generate about 15 petabytes of data per year1 when fully operational. This process is a two-way street of course; the spread of computers has made the collection and analysis of huge datasets possible, and the enormous increase in data has been a major driver for the spread of software tools in science.

This is good, of course; we're getting a lot more done than we could ever have before, and we have entire avenues of research that would simply not have been possible to pursue without these tools. However, it means that a lot of research results today depends on massive data collections or complex software. The traditional way of publishing your results, with everything summarized in a journal paper of a dozen pages or so, is no longer sufficient.

Researchers already share data when they can; if you have a legitimate request for published data (you're a researcher working in a related field, say) then more often than not they'll send it to you. That doesn't always mean it'll show up quickly, though, or at all.

First, the group may have spent a lot of time, money and effort in generating their data, and they'll rightly want to mine it for all the results they can before giving it to others. Sometimes it's not their data to give away: it may be data from a different group in turn, or confidential company data and you have to turn to the original owners. Sometimes there's legal restrictions: human medical studies rightly have very strict confidentiality for its participants, so data can only be used for the specific use it was collected and can not be shared or passed on to others. Making medical data sharing mandatory would render any human studies impossible. But overall, data is already being shared as needed.

However, as data sets become larger and more complex, the data itself is no longer enough. When the data analysis and modelling can no longer be done by hand you need the accompanying software in order to make sense of it.

We normally describe our software functionality in condensed form - as equations - in our papers, but the software implementation can - and does - have bugs and issues that can affect the final result. And of course, sometimes a model may simply be too complex to summarize fully in a paper. More and more often you really need to release the software along with the data in order to work with the results of a project.

Of course, you don't need to release standard pieces of software; you can just tell people what version and parameters you used, like you would with any lab equipment. In papers describing human and animal experiments you'll often see equipment listed very precisely: "...colliculus were cut by using a Microslicer (DTK2000; Dosaka EM, Kyoto, Japan) and then incubated in...", or "Training sessions [...] were conducted in a Plexiglas chamber enclosed in a sound attenuating box (Med Associates, St. Albans, VT) provided with ...". Standard software sometimes already is listed in the same way.

Custom software created locally is sometimes released, but often it's not. This is is usually not due to any wish for secrecy. Rather, it's because most custom software is deeply uninteresting. It's very specialized, sometimes written just to analyze one specific experiment or implementing one model, and other researchers are generally interested in the overall functionality as described in publications. The interest from outsiders in using or viewing the actual code has generally hovered right around zero. Releasing software can take quite a bit of time and effort (you may need to clean up the code, write usage instructions, get permission from your funding agency, have it vetted for licensed code and for potential patents, you may need to get your your collaborators and your university legal department to give you a go-ahead) so people generally don't bother. If it becomes a general requirement few people would object, I believe; as with data, I've found that people generally will send you code if you ask nicely and promise not to spread it further.

One question is, where do you draw the line for code that should be released? You don't need to release the source for Photoshop, or MS Word, or LaTeX or vim just because you used them to prepare your manuscript. Code you write yourself for data analysis should normally be released. But there's large gray areas where there is yet no consensus on best practices. For instance:

  • You use a spreadsheet to do some calculations. Can you just write the expressions in your paper, or do you need to make the spreadsheet file available? Does it matter whether you could have done it on paper or using a desk calculator?

  • You use Mathematica or Maxima to derive some equations rather than doing it by hand. Can you give the derivation only, just as if you did it manually, or do you need save your worksheet and make it available? Do you need to make the tool code available (easy for Maxima, not so for Mathematica)?

  • You use some hideously expensive, closed Matlab toolbox for your data analysis. Your analysis depends heavily on a couple of functions in that toolbox (your own code might be little more than a wrapper to get your data into the toolbox). You make the code available, but is your code really published when you can't give the source to the toolbox code?

    For normal commercial code we'd accept just giving the name and version since people generally can get hold of it, but does that extend to things like a specialist toolbox most users simply can't afford, and would never buy just to check somebody else's result if they could? What if the software was under export restrictions that excludes many researchers around the world

  • Your several years old code depends on a closed piece of complex software (that expensive matlab toolbox for instance) that is no longer available. Your lab has a copy and still uses it, but for all practical purposes your analysis tools can no longer be used by other researchers and your analysis can no longer be fully replicated. Are you now barred from using your own trusted, well-established analysis tools for new data? Does this prevent your previously published data from being used or referred to in future projects?

It's worth noting that some of these issues are solvable simply by avoiding closed software, and use open source whenever possible. And quite often it is possible; R, for instance, has arguably become the, most widely used statistical analysis system today. Some other areas are only partially covered, unfortunately, so these questions are still relevant.

Also, while publishing our tools is important, the practical impact on research quality will be fairly small. Most one-shot software tools people make for themselves just aren't all that complex, and they tend to get fairly well tested during the course of the project. They may have plenty of glitches and odd corner cases but serious errors that materially affect the result are likely rare. In an experimental situation you have so many other sources of error, after all, that glitchy software doesn't make much overall difference. And because of experimental error, people generally don't blindly trust the results from only one group and one experiment, so errors will get caught when other groups don't get results that match.

Which brings me to a defence of the one-shot, locally written analysis software: when everybody writes their own software, errors will be diluted. It is extremely unlikely that every group in a field make the same programming mistake, leading to the same software error. The variety of tools make sure that errors get filtered out and caught over time. If everybody uses the same analysis tool, one subtle error could taint everyone's results. Also, we learn by doing, and writing your own software makes you understand the analysis at a deeper level than simply using a canned piece of code.

#1 "15 petabytes" = "A huge freaking amount of information". According to one estimate, the total amount of printed information in the world, ever, is on the order of 200 petabytes. The LHC alone will generate that amount of data in about a decade.

No comments: