visualizing primary data form taxonomic literature

1
Aggregating data from XML publishing and legacy literature markup The 41 articles in this body of literature contain a total of 884 species treatments based on 8,172 specimens. ¾ of the treatments are published in Biodiversity Data Journal, but more than 58% of the specimens are cited among the much larger total number of Zootaxa articles. Data can be queried regardless of whether the semantic encoding is applied prospectively as part of an XML-based publication process or retrospectively through markup. Visualizing Primary Data from Taxonomic Literature Taxonomic Research Life Cycle Researchers need to maintain data structure to conduct meaningful research. By the time manuscripts are ready to submit, data on specimen records, species traits, and more have been compiled, curated, and analyzed by the author. The author selects a publication venue and submits the manuscript for peer review. If the author selects a modern cybertaxonomic journal (such as Biodiversity Data Journal), data structure is retained and data elements are available for sharing with a community of Global Biodiversity Informatics Initiatives. But if a traditional publication is selected, data structure is lost. Plazi has developed the GoldenGATE XML Editor and TaxonX schema to unlock “PDF Prison”, restoring data structure. With structure restored, data elements are shared with Global Biodiversity Informatics Initiatives. Jeremy Miller 1,2 , Donat Agosti 1 , Lyubomir Penev 3 , Guido Sautter 2,4 , Teodor Georgiev 3 , Terry Catapano 2 , David Patterson 5 , David King 5 , Serrano Pereira 1 , Rutger Aldo Vos 1 , Soraya Sierra 1 1 Naturalis Biodiversity Center, Leiden, Netherlands; 2 www.Plazi.org, Bern, Switzerland; 3 Pensoft, Sofia, Bulgaria; 4 KIT, Karlsruhe, Germany; 5 University of Sidney, Sydney, Australia; 6 The Open Uiniversity, Milton Keynes, United Kingdom www.eubon.eu Acknowledgments Thanks to Slavena Peneva for providing original illustrations. The Value of Data Extracted from Literature Aggregated data from taxonomic publications contribute to Global Biodiversity Informatics Initiatives including GBIF. Although records extracted from literature remain a very small part of GBIF, more than half of the species in the literature we marked up had no previous representation in GBIF. Data structure also helps later researchers verify and build upon earlier work. This potential for scrutiny enhances the scientific quality of the work and supports error correction. Metrics of taxonomic research activity The California Academy of Sciences (CAS) was the institution associated with the largest number of specimens in this body of literature. Metrics tracking the activity of individuals and institutions reveal concentrations of taxonomic research, including the most prolific collectors, most active lending collections, and the authors who examine the most specimens. Cybertaxonomic Publication Venue Semantic Structure Preserved with XML XML Markup Restores Semantic Structure Data Structure Lost Collect specimens Make Observations Analyze Data and Write Manuscript Primary Data Primary Data Primary Data PDF Prison Traditional Publication Venue GoldenGATE XML Editor TaxonX Schema Submit Manuscript and wait… XML markup and interoperability XML markup was applied to all (37) open-access articles published in the journal Zootaxa containing treatments on spiders (Order: Araneae). This was combined with data from (5) articles containing treatments on spiders published in Biodiversity Data Journal where XML structure is part of the routine publication process. XML markup using GoldenGATE can extract structured primary biodiversity data that can be aggregated and jointly queried with data from other Darwin Core-compatible sources. Rarity and new species descriptions Rare species are a conspicuous part of diverse communities. More than ⅓ of new species descriptions in this body of literature were based on only one specimen. Treatments Pardosa zyuzini is one of 92 new species described in this body of literature. Charts efficiently summarize key information about the specimens cited in a treatment. For example, most specimens were collected during June and July, useful information for someone planning field work to recollect this species. Query and visualize biodiversity data Aggregated, structured, digital data offer novel ways to explore and query taxonomic research information from geographic, institutional, temporal, individual, and specimen- oriented perspectives. Charts can communicate key information about any institution, country, collector, author, article, treatment, taxonomic rank, type status, or combination of these, among other variables. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 308454

Upload: millerjeremya

Post on 11-Aug-2015

76 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Visualizing Primary Data form Taxonomic Literature

Aggregating data from XML publishing and legacy literature markupThe 41 articles in this body of literature contain a total of 884 species treatments based on 8,172 specimens. ¾ of the treatments are published in Biodiversity Data Journal, but more than 58% of the specimens are cited among the much larger total number of Zootaxa articles. Data can be queried regardless of whether the semantic encoding is applied prospectively as part of an XML-based publication process or retrospectively through markup.

Visualizing Primary Data from Taxonomic Literature

Taxonomic Research Life Cycle

Researchers need to maintain data structure to conduct meaningful research. By the time manuscripts are ready to submit, data on specimen records, species traits, and more have been compiled, curated, and analyzed by the author. The author selects a publication venue and submits the manuscript for peer review. If the author selects a modern cybertaxonomic journal (such as Biodiversity Data Journal), data structure is retained and data elements are available for sharing with a community of Global Biodiversity Informatics Initiatives. But if a traditional publication is selected, data structure is lost. Plazi has developed the GoldenGATE XML Editor and TaxonX schema to unlock “PDF Prison”, restoring data structure. With structure restored, data elements are shared with Global Biodiversity Informatics Initiatives.

Jeremy Miller1,2, Donat Agosti1, Lyubomir Penev3, Guido Sautter2,4, Teodor Georgiev3, Terry Catapano2, David Patterson5, David King5, Serrano Pereira1, Rutger Aldo Vos1, Soraya Sierra1

1 Naturalis Biodiversity Center, Leiden, Netherlands; 2 www.Plazi.org, Bern, Switzerland; 3 Pensoft, Sofia, Bulgaria; 4 KIT, Karlsruhe, Germany; 5 University of Sidney, Sydney, Australia; 6 The Open Uiniversity, Milton Keynes, United Kingdom

www.eubon.eu

AcknowledgmentsThanks to Slavena Peneva for providing original illustrations.

The Value of Data Extracted from LiteratureAggregated data from taxonomic publications contribute to Global Biodiversity Informatics Initiatives including GBIF. Although records extracted from literature remain a very small part of GBIF, more than half of the species in the literature we marked up had no previous representation in GBIF. Data structure also helps later researchers verify and build upon earlier work. This potential for scrutiny enhances the scientific quality of the work and supports error correction.

Metrics of taxonomic research activityThe California Academy of Sciences (CAS) was the institution associated with the largest number of specimens in this body of literature. Metrics tracking the activity of individuals and institutions reveal concentrations of taxonomic research, including the most prolific collectors, most active lending collections, and the authors who examine the most specimens.

Cybertaxonomic Publication Venue

Semantic StructurePreserved with XML

XML MarkupRestores Semantic Structure

Data Structure Lost

Collect specimens

Make Observations

Analyze Data and Write Manuscript

Primary Data

Primary Data

Primary Data

PDF PrisonTraditional

Publication VenueGoldenGATEXML Editor

TaxonX Schema

Submit Manuscript and wait…

XML markup and interoperabilityXML markup was applied to all (37) open-access articles published in the journal Zootaxa containing treatments on spiders (Order: Araneae). This was combined with data from (5) articles containing treatments on spiders published in Biodiversity Data Journal where XML structure is part of the routine publication process. XML markup using GoldenGATE can extract structured primary biodiversity data that can be aggregated and jointly queried with data from other Darwin Core-compatible sources.

Rarity and new species descriptionsRare species are a conspicuous part of diverse communities. More than ⅓ of new species descriptions in this body of literature were based on only one specimen.

TreatmentsPardosa zyuzini is one of 92 new species described in this body of literature. Charts efficiently summarize key information about the specimens cited in a treatment. For example, most specimens were collected during June and July, useful information for someone planning field work to recollect this species.

Query and visualize biodiversity dataAggregated, structured, digital data offer novel ways to explore and query taxonomic research information from geographic, institutional, temporal, individual, and specimen-oriented perspectives. Charts can communicate key information about any institution, country, collector, author, article, treatment, taxonomic rank, type status, or combination of these, among other variables.

This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 308454