Wednesday, November 10, 2010

Saving Data for Posterity

Test Tubes
Research generates data. Sometimes lots of data - the LHC accellerator at CERN generates petabytes even after heavy filtering - and sometimes less. All of that data has to be processed and analysed. Analysis leads to results. Results lead to papers. And papers lead to funding which leads to money for more research as well as food, shelter and clothing for the researcher1.

But once the paper is published and the project is over the data remains. And not just the data, but also the source code for simulation models or analysis tools, hardware specifications for one-off lab equipment, customized reagents and so on. And keeping that data, source code and other material around is important. We need to keep the data so that other researchers can compare results and replicate the work. We need the data so we can resolve any discrepancies from different labs. We need the data so we can use new, better analysis tools to wring out more information when they become available in the future. Data retention can even become political as with the climate research data controversy at University of West Anglia2.

Keeping research data is hard, much harder than it may seem at first glance. Ars Technica is running a really intresting series on the issue of data retention in science; here's part 1 and part 2. But a major problem is only hinted at so far: whose responsibility is it to store data over long time frames - years and decades - and who pays for this?
I built a model in a previous project, ran it against some inputs, and ended up with a journal paper. The model, the inputs and the resulting data is about, oh, one Gb in all. If anybody wants to build on our work that code and data would be quite useful. A Gb isn't that much today; you can comfortably store it on a DVD or memory stick, or put on a webserver somewhere. But whose DVD, and whose webserver? In our current project both the model and the analysis is a lot more complicated, and the resulting data sets are larger. Where to we park this data so that it'll still be publicly available in five years? Ten years? Thirty years?

The individual researcher - me and my colleagues - can't really have sole responsibility. Your departmental web pages and file stores are only good for as long as you work there. When you leave you lose your access rights so you can no longer maintain the data. I've switched projects and workplaces often enough that I don't even bother setting up a local web page.

We bring our data and code with us when we move of course - I have a copy of my own of everything important - but that's not very reliable over the long term. I don't have the discipline or the will to pay for decades of off-site backups of multiple gigabytes of simulation data that nobody is ever likely to ask for, so I have only a local backup. One house fire or earthquake could wipe it all out. When people leave research altogether they may no longer want to keep all that old data around, and when we die our relatives are very unlikely to keep any data or personal online repositories any longer. Besides, even if researchers keep their data, actually finding them years or decades after the publication can be very difficult.

Putting responsibility on the department or lab is no better. Active researchers can keep their project data around, but web pages and data sets from old projects by long-gone people tend to disappear over time since nobody has an active interest in it. Lab-level servers are usually maintained by a busy graduate student or faculty member and are none too reliable as a result. Servers die or move, backups fail or get misplaced and when nobody feels ownership over that old data it just never gets restored or propagated properly over time. Entire departments and labs can disappear, taking any local data with it into oblivion.

Universities can (and should, perhaps) set up a central data repository. They have research libraries and often already have a publication repository so they have the know-how. But there are a number of obstacles to make a university-wide repository work.

One major problem is to make researchers actually use such a repository. Our universities already have publication repositories and other such services, and unfortunately most researchers simply don't use them. It takes time and it takes effort, and unless you have a way to compel people it'll go unused. That means making it mandatory, easy to use and be completely data agnostic, and resources (money and manpower) must be available to help, prod and force researchers to use it properly.

It also takes a good deal of money, money that needs to come from somewhere. Using some of the cut universities take from research grants3 is an obvious solution, but that money is already spoken for so you'd have some ugly fights ahead of you to make that happen. And any university with a large collection of potentially valuable research data will be sorely tempted to lock it in to exploit the money-making opportunities rather than make it open and accessible, making this a non-solution to the original problem. Also, published research happens at other places than universities so this would not solve the data retention problem in general.

The public funding agencies would be natural for keeping a data repository. They are paying for the data in the first place after all, and have the clout to specify every aspect of a project already. They typically require extensive reports and documentation at the end of a project; it would be just another small step to require a documented data and source code dump as well. Funding agencies typically don't have a direct commercial interest in the results and don't have the same incentives as universities to keep the data to themselves. Of course, not all research is funded this way, so again, it would not be a general solution.

Lastly, the journals themselves should do this. They are the ones publishing the papers based on the data after all, and have an interest in making the data available for analysis and control. Many journals already charge high fees from researchers for publishing their papers4, and they already have a content repository system for all their published papers and supplementary materials. They could add a data dump to the system, and require authors to submit relevant data sets and source code as a precondition for publication. Data and source code would be a legitimate use of supplemental material - more so than the annoying trend among some journals to have the printed paper only be an introduction, with most actual details in the supplementary online section (a topic best left for another post).

So, the journals and the funding agencies should be the primary stakeholders for research data retention. That doesn't let the rest of us off the hook though; I feel we still have a personal obligation to make data, source and papers available to others as best we can, within the legal and other limits that apply. You can't disclose raw data that would identify patients or experimental subjects for instance, and you have no obligation to send off a sample of a bacterial strain to somebody that doesn't have the facilities to make use of it. But the default assumption should be one of openness - we should keep and disclose our data unless we have specific reasons not to do so. Not the other way around.

#1 Never underestimate basic necessities and security as motivation for people. There are those who seem to think science should be a calling; a noble intellectual pursuit for truth and the betterment of humanity far removed from any grubby considerations of money. Those people either are not scientists themselves, they have tenure or they are independently wealthy. The rest of us have a perfectly sensible interest in seeing that we and our loved ones have food to eat, a home to eat it in and savings to fall back on once we're too old to get funded any longer.

#2 Just to make it crystal clear: the controversy is purely political. There was no fraud or deception, just some sloppy data retention and private emails never meant for a wide audience.

And there really is no doubt any longer that man-made climate change is very real; any scientific debate is about the magnitude, the mechanisms and the nature and distribution of effects over time. If you're opposed to climate change for political reasons then sorry - reality doesn't care about what we want to be true. Holding your ears and shouting "lalalalalala" won't make it go away.

#3 As little as 35% or as much as 50% of the awarded grant money. It goes to pay for lab and office space, secretary and admininstrative services and so on. Data storage and retention would fit right in. However, at most universities the external science research funds are an extra income; the cut they take is larger than needed to pay for the research-related costs - you pay this whether you actually use any services or not - and so research actually subsidizes education. Cutting research would, for many universities, mean losing money, not saving it.

#4 Yes, you pay the journal to publish your work. You volunteer to review submitted papers for journals, without pay. If you're a high-level researcher you may get invited to serve as editor - gratis - for an issue of the journal. Then your university library pays dearly for a yearly subscription so other people can read about the research you've done. Journal publishing is a serious moneymaker.

No comments:

Post a Comment

Comment away. Be nice. I no longer allow anonymous posts to reduce the spam.