Big Biological Data Demands Value

By David Shorthouse, Marine Biological Laboratory

Biology is an exceptionally diverse science whose pursuit spans timeframes of nanoseconds to millennia, operates on scales from nanometers to thousands of kilometers, describes all life from minute, aeolian fauna that plunge down to mountain tops to bizarre chemotrophs in the deepest ocean trenches, and is communicated in all languages on Earth. There is an immense amount of data and, as aptly pointed out by Forbes’ Eric Savitz, it would be foolish to invest solely in the infrastructure to store and archive more and more data. Wading through these data without capacity to discover relevancy, is “an impossible wild goose chase in which everyone is blindfolded”.

There are few standard points of reference that traverse Life Sciences. As a consequence, practitioners often use unstructured narratives to provide their audience with essential context. Scientific names are additionally used as universal moorings to circumvent the fragmentation of the discipline, the vagaries of language, and to integrate disparate sources of data (Patterson et al., 2010). Although scientific names could be used as a lattice to crisscross the entirety of Life Sciences and provide near real-time indexing value, they are mutable; they are tags that envelope a body of evidence and are best understood as snapshot theories about the organization of life on Earth. Although governed by rules that impose stability, relatively new techniques such as DNA sequencing uncover additional evidence for how to best delineate species. The danger of this relentless regrouping and subsequent renaming of species is the potential for obfuscating data that were once attached to older, alternative scientific names. Thus, if we expect scientific names to permit deep, rapid dives into data warehouses, then we must also acknowledge and accommodate the fluidity of scientific names. At the very least, a scientific name index needs to be periodically refreshed to mitigate discovery rot, which is just as damaging to the return on investment for a data warehouse as is bit rot.

The Data Conservancy team at the Marine Biological Laboratory is charged with developing a sound, though flexible indexing architecture for scientific names that will help organize and make discoverable any biologically meaningful data in an archive, capable of scaling to the entirety of biology. Our team consists of David J. Patterson (PI), Anne Thessen, Dmitry Mozzherin, and David Shorthouse. We have four main areas of activity to tackle this challenge: discovery, resolution, engagement and communication.

Names Discovery and Parsing

The first step in the process is to provide tools and services that construct a simple index of scientific names held within an archive. Thanks to prior work done by others at the Marine Biological Laboratory, we developed a simple command-line based wrapper that locates verbatim scientific names in freeform text. We expose this work as a web page and service, and a Google Chrome extension. This service extracts scientific names from PDFs, images, Microsoft Office-based documents such as Word and Excel, and automatically executes optical character recognition when required. The Google Chrome extension highlights scientific names on any web page and provides a facility to copy these for pasting elsewhere. Within the next few weeks, we will be re-indexing the Biodiversity Heritage Library for scientific names. Throughout the process we will be gathering metrics that will result in requirements for improving the algorithms for increased accuracy, recall and speed. We expect to have a parallelized solution that will scale to rapidly index millions of documents.

In addition to finding scientific names in free form text or in data sets, we also have a well-tested scientific name parser that accurately decomposes a name into its components. This tool has pervasive uses such as resolving names lists even when names within lists have or do not have authority information (e.g. Homo sapiens vs. Homo sapiens Linnaeus, 1758 vs. Homo sapiens L.).

Names Resolution

Mining data for verbatim scientific names is the first step in intelligently organizing biological data. The next step is to resolve or cross-reference names against alternate lists of names. This is conceptually equivalent to discovering areas of overlap using geospatial queries, but without the luxury of a coordinate system. Although there are emerging standards to help names list providers organize and transport their scientific names packages, there remain substantial idiosyncrasies. Hierarchies and synonymic representations vary across data sets and between the major Kingdoms of life. Notably lacking is an identifier system for scientific names. Nonetheless, we have been developing a resolver that accepts lists of scientific names, algorithmically accommodates their context (e.g. understood to contain only names for snout-nosed beetles), misspellings and lexical variations (e.g. with or without authority information) and produces scored matches for each name for an equivalent name in a list of interest. Most critically, the response associates optionally provided identifiers with those in the alternate list. Thus, users will be better able to integrate disparate sources of data without having to massage names within their data files.

Engagement

Data integration using names discovery and resolution services must have assurances that background lists are thorough and accurately reflect current taxonomic opinion. In other words, authoritative lists upon which resolution services will be executed must have the capacity to be rebuilt, refreshed, and adjusted. Without a universally accepted scientific name registration facility (though these are coming online), the only recourse is to have a delightful names editing interface that gives experts the power to rapidly shuffle, append, rename, and reorganize names within hierarchical structures. We developed and continue to enhance such a tool. It is a fully collaborative environment where users may simultaneously work within branches of life and chat with one another to resolve conflicting opinions or misunderstanding. At any point, they may publish their hierarchical structure, which then becomes available to the resolver service. There are currently 135 users and 133 working classifications being developed. The largest of such trees has 1.07M names organized into a defensible hierarchy.

Communication

We make our code available on GitHub. Likewise, we provided Data Conservancy’s Infrastructure Research and Development (IRD) team with documentation and example code for integrating scientific names discovery into its Feature Extraction Framework. We will also be working with the IRD team to help develop the facility to periodically refresh and resolve their names indices.

We have also reviewed technical and sociological issues facing the Life Sciences and conclude that while the former is approaching resolution, the latter issues are significantly more challenging to address (Thessen & Patterson, 2011). There are deeply entrenched data cultures within the Life Sciences that require transformation before data can be indexed for scientific names.

Next Steps

Our tools and services are tantalizingly close to permit the construction of a robust, universal index for the Life Sciences using scientific names. Name discovery and resolution are maturing and are being used by other NSF-funded projects such as iPlant Collaborative. Our efforts will soon concentrate on taxonomic resolution, a far more challenging problem whose solution requires deep knowledge of synonymic relationships among names. In order to provide a scalable solution, we endeavor to engage more partners and to enhance our names editing environment to best capture these complex, semantic relationships. We expect to emerge from this work with the tools and services for data archivists to create and refresh their scientific names-based indices. This means that users need not have to wade through Life Sciences data archives as though on wild goose chases.

References

Patterson, DJ., J. Cooper, P.M. Kirk, R.L. Pyle, and D.P. Remsen. 2010. Names are key to the big new biology. Trends in Ecology and Evolution. 25(12): 686-691. doi: 10.1016/j.tree.2010.09.004

Thessen, Anne E. and David J. Patterson. 2011. Data issues in the life sciences. Zookeys 150: 15-51. doi: 10.3897/zookeys.150.1766.

Leave a Reply