Tim DiLauro, Johns Hopkins University
(from an abstract submitted to the 2013 AGU Conference)
A modern data archive must support a variety of functions and services for a broad set of stakeholders over a variety of content. Data producers need to deposit this content; data consumers need to find and access it; journal publishers need to link to and from it; funders need to ensure that it is protected and its value enhanced; research institutions need to track it; and the archive itself needs to manage and preserve it. But there is not an optimal information model that supports all of these tasks. The attributes needed to manage format transformations for long-term preservation are different from, for example, those needed to understand provenance relationships among the various entities modeled in the archive. Exposing all possible properties to every function burdens users and makes it difficult to maintain a separation of concerns among the functional components. The Data Conservancy Software (DCS) manages these overlapping information needs by defining strict interfaces between components and providing mappers between the layers of the architecture. Still, work remains to make deposit more intuitive. Currently, depositing content into a DCS instance requires either very simple objects (e.g., one file equals one data item), significant manual effort, or detailed knowledge of DCS-internal data model serializations. And if one were to deposit that content into another type of archive, it would be necessary to repeat this effort.
To allow data producers and consumers to interact with data in a more natural manner, the Data Conservancy is developing a packaging approach that eases this burden and allows a semantic overlay atop the directory/folder and file metaphor that is more familiar. The standards-based packaging scheme augments the payload and validation capabilities of Bagit with the relationship and resource description capabilities of the Open Archives Initiative (OAI) Object Reuse and Exchange (ORE) model. In the absence of the ORE resource description, the DCS instance will be able to provide default mappings for the directories and files within the package payload and enable support for deposited content at a lower level of service. Internally, the DCS will map these hybrid package serializations to its own internal business objects and their properties. Thus, this approach is highly extensible, as other packaging formats could be mapped in a similar manner.
In addition, this scheme supports establishing the fixity of the payload while still supporting update of the semantic overlay data. This allows a data producer with scarce resources or an archivist who acquires a researcher’s data to package the data for deposit with the intention of augmenting the resource description in the future.
The Data Conservancy is partnering with the Sustainable Environment Actionable Data project to test the interoperability of this new packaging mechanism.