by Agnieszka Gautier, NSIDC
A graphic illustrating the code behind big data, which is revolutionizing the way information is stored and discovered. Credit: infocux Technologies, Mexico
Anne Thessen, an oceanographer by training, sits in her home office surrounded by boxes and boxes full of binders with data on paper. “To be honest,” she says, “I don’t have the space anymore to keep all this paper, which is sad because I’d like to keep it.”
“In terms of preservation,” says Thessen, “you can’t beat paper.” But paper is hard to translate to others. Digitizing data also takes time. Data sheets have to get scanned. Read-me files must be written to explain abbreviations and methods used. It is slow going, but necessary if others are to find your work. Thessen says, “One of the more frustrating parts of research is knowing that a data set is out there, but not being able to access it because someone will not share.”
Dark data doomsday
Then again, there may be a reason for the data management disenchantment, especially for researchers. Too many have been exposed to the dark data hole. According to Bryan Heidorn, author of Shedding Light on Dark Data in the Long Tail of Science, dark data is not carefully indexed and stored so it becomes nearly invisible to scientists and other potential users. Underutilized, this data is eventually lost.
“Depositing data is a burden,” Thessen says. “Many scientists see it as a chore and not as something that is the right thing to do because it makes science better.” Often, funding agencies mandate a particular repository to be used. The information gets put in, out of obligation, with not much afterthought about retrieval; it becomes a sinkhole. Then, there are too many repositories, and many don’t catch on. Information gets disconnected and difficult to find.
For much of science, finding data is key. “If I want a pH measurement from the Gulf of Mexico at some point in time, I’m sure that the measurement exists,” Thessen says. “Unfortunately, it’s sitting on a piece of paper somewhere or a three-inch floppy disk in a desk drawer and I can’t locate it.” That leads researchers in a spiral of having to keep continually recollecting the same measurement over and over.
The digital dilemma
In comes the digital age. It offers scientists the ability to share, and to be rid of redundant research, rendering science more efficient. But will this data be useable in fifty years? “I think the Data Conservancy has really taken a very serious look at the issue of preservation,” Thessen says. They’re asking the tough questions.
Is it enough to just put the information on an exterior hard drive or another server? “I don’t think it’s that simple,” Thessen adds.
Data preservation should be about reuse, and really about smart reuse. It is not enough to just store data in a repository. Information needs to be categorized. Beyond file formatting issues, there is just the simple issue of language itself. “People use different terms for the same thing, and the same term for different things,” Thessen says. You need software smart enough to sort it all out.
Anne Thessen has not yet used the Data Conservancy’s new software, but she was there in the beginning. “We envisioned a system where a file would be uploaded and key metadata could be automatically extracted.” This would save researchers time from the administrative chore of cataloging their research—perhaps enticing researchers to share more.
The Data Conservancy may not be there yet, but the vision is there. Unlike institutional or disciplinary repositories, it has been specifically architected to archive data. It allows disciplines to cross, rendering discovery more rounded and integrated. So researchers know about studies outside of their particular field that may still be relevant. It opens up scientific conversations. It truly shares.
Thessen has high hopes for the Data Conservancy’s role in data preservation. “Once I give the Data Conservancy my data,” Thessen says, “I don’t have to worry about preserving it.” She can finally get rid of the boxes of paper in her office. She can feel assured someone at some point in the future will be able look at her data, and be able to use it.