JHU Data Archive Archival Process

By Betsy Gunia, JHU Data Management Consultant

The Johns Hopkins University Data Management Services (JHU DMS), offered by the Sheridan Libraries, helps researchers develop data management plans and both preserve and share research data. We provide research data archiving through an instance of the Data Conservancy (DC) stack, called the JHU Data Archive. Johns Hopkins University researchers can elect to archive their research data into the JHU Data Archive by including us in the budget of their grant.

Our archiving service, newly launched in October of 2011, has recently reached its first milestone of engaging with researchers in the archival process. Dave Fearon and I, data management consultants for JHU DMS, are currently working with two researchers, both within the field of biomolecular engineering, as an exploratory opportunity to hone our archiving process. This fall, we will deposit their data associated with their recent publications into the JHU Data Archive. This project differs from our anticipated model for grant-funded projects, for which the archival process will begin at the early stages of research projects.

Our general approach is to first understand how research data is created and how it “flows” throughout each study (e.g., what process led to this particular data, which in turn led to this other data). By carefully reading each publication, we diagram this data flow and use it as the basis for discussion with the researchers to gather and organize the specific folders and files associated with the data. As we acquire the data files, we discuss with the researchers existing metadata, file naming conventions, required software for rendering files, and which data should be archived.

One of the first issues that we have come across are inconsistencies in how both folders and files are named, even for data generated using the same technique or instrument. Part of the problem is that collaborators do not often discuss appropriate conventions for naming prior to data collection, so each researcher devises an individual system. Because we are working with data that has already been collected, we do not have the opportunity to provide guidance on naming conventions. Instead, we have to work with the researchers to document the particular conventions used and ask them questions about the importance of retaining those conventions.

The Data Conservancy system is flexible in how data can be arranged, which requires us to consider how best to organize and package the data for future discoverability and use. Does it make sense to organize the data by method, data type, individual publication, or by results? As this is an opportunity for us to hone our process, we explored which lens worked best to assemble data from the researcher. We started out gathering their data by method, but then switched to an organization based on results, in particular figures and tables, and found it was easier for the researcher to both organize and describe their data to us through this lens. Because we haven’t finished gathering the data, we haven’t firmed up our decisions on how best to arrange the data in the JHU Data Archive. Instead, we are documenting the data so these decisions can be made without having to return to the researchers for more information. It is very important that we be as efficient as possible our researcher’s time.

This leads to a final issue to note, and one that everyone can relate to, that of time constraints. Although our archiving service is not intended to provide in-depth curation of data, our level of involvement does require significant amounts of time on behalf of both parties. Between familiarizing ourselves with their data and publications, conducting meetings, and examining the research data that will be deposited into the JHU Data Archive, Dave and I conservatively estimate that we have spent a combined total of 80 hours for both researchers (~40 hrs/researcher), which does not include the archiving that will take place this fall. The good news is that as we learn more about each investigator’s research, we can accomplish more during each subsequent meeting. In addition to our expanding knowledge of their research, we have also experimented with ways to make the meetings more efficient. For example, as mentioned above we create diagrams based on how data is produced, and we have then turned these from a visual document to one that involves a table. One column of the table are the data products diagrammed in the flow chart, with each row being one particular product, and in subsequent columns is where we record pertinent information about the associated data files. This information includes format, instrument, associated metadata, naming conventions, whether it should be archived, and any notes that we take while meeting with the researcher. We found that by sending this table to our researchers, they often took time prior to a meeting to organize their data and even fill in the table, resulting in a more productive meeting.

Our first foray into the archival process has afforded us an opportunity to both hone our workflow process and reflect on ways to improve it. This experience impressed upon us how important it will be for us to work closely with researchers early on in the research process so that the archival process will run smoothly. We look forward to further refining our process as we deposit this data into the JHU Data Archive this fall and continue working with additional scientists in other domains.

Leave a Reply