a generic data import layer for the berlin taxonomic information model

16
A generic data import layer for the Berlin Taxonomic Information Model Anton Güntsch, Andreas Müller & Walter G. Berendsohn Botanic Garden and Botanical Museum Berlin- Dahlem Dept. of Biodiversity Informatics and Laboratories

Upload: herne

Post on 28-Jan-2016

63 views

Category:

Documents


0 download

DESCRIPTION

A generic data import layer for the Berlin Taxonomic Information Model. Anton Güntsch, Andreas Müller & Walter G. Berendsohn Botanic Garden and Botanical Museum Berlin-Dahlem Dept. of Biodiversity Informatics and Laboratories. The Berlin Taxonomic Information Model. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A generic data import layer for the Berlin Taxonomic Information Model

A generic data import layer for the Berlin Taxonomic

Information Model

Anton Güntsch, Andreas Müller & Walter G. BerendsohnBotanic Garden and Botanical Museum Berlin-Dahlem

Dept. of Biodiversity Informatics and Laboratories

Page 2: A generic data import layer for the Berlin Taxonomic Information Model

A. Güntsch: A generic data import layer for the Berlin Model

The Berlin Taxonomic Information Model

Name Concept Reference

„FactualData“

Relation

• Concepts as name-reference pairs

• Explicit representation of relations between concepts

• Mechanisms for calculating factual data

Page 3: A generic data import layer for the Berlin Taxonomic Information Model

A. Güntsch: A generic data import layer for the Berlin Model

Berlin Model used by

Euro+MedMed-ChecklistIOPI Species Plantarum InitiativeAlgaterraDendroflora of El SalvadorGerman Standard List of Vascular Plants

and FernsReference List of the German MossesEDIT WP6

Page 4: A generic data import layer for the Berlin Taxonomic Information Model

A. Güntsch: A generic data import layer for the Berlin Model

Data imports (1)

Heterogeneous sources (e.g. text files, printer-formatted data, spread sheets, DBs)

Complex target model

Imports consume a substantial fraction of project costs which are often substantially underestimated.

Page 5: A generic data import layer for the Berlin Taxonomic Information Model

A. Güntsch: A generic data import layer for the Berlin Model

Data imports (2)

Analysesource

Identifysemantic

units

Transforminto

appropriateprocessable

format

Parse toformat close

to targetmodel

Duplicatedetection and

importTesting

Page 6: A generic data import layer for the Berlin Taxonomic Information Model

A. Güntsch: A generic data import layer for the Berlin Model

Data imports (2)

Analysesource

Identifysemantic

units

Transforminto

appropriateprocessable

format

Parse toformat close

to targetmodel

Duplicatedetection and

importTesting

Needs a great deal of human input

Can be automated

Page 7: A generic data import layer for the Berlin Taxonomic Information Model

A. Güntsch: A generic data import layer for the Berlin Model

Step-by-step transformation of taxonomic information: preparation

TargetBerlinModel

Database

XMLSource

XMLSoftSchema

XMLStrictSchema

Phase I Phase IIIPhase II

largely notautomatable

largelyautomated

fullyautomated

feedback

• Identify patterns

• Communicate problems

• Export to simple XML

Page 8: A generic data import layer for the Berlin Taxonomic Information Model

A. Güntsch: A generic data import layer for the Berlin Model

Step-by-step transformation of taxonomic information: preparation<Aizoaceae xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

<AcceptedTaxa><Taxon>

<ID>7814</ID><Genus>Acrodon</Genus><Epithet>bellidiflorus</Epithet><AllAuthorsString>N.E.Br.</AllAuthorsString><SubSpeciesEpi>v</SubSpeciesEpi><AllAuthorsStringSubSpecies/><SpeciesName>Acrodon bellidiflorus</SpeciesName>

</Taxon><Taxon>

<ID>8566</ID><Genus>Acrodon</Genus><Epithet>subulatus</Epithet><AllAuthorsString>(Miller) N.E.Br.</AllAuthorsString><AllAuthorsStringSubSpecies/><SpeciesName>Acrodon subulatus</SpeciesName>

</Taxon></AcceptedTaxa><SynonymTaxa> […] </SynonymTaxa>

</Aizoaceae>

Page 9: A generic data import layer for the Berlin Taxonomic Information Model

A. Güntsch: A generic data import layer for the Berlin Model

Step-by-step transformation of taxonomic information: phase I

TargetBerlinModel

Database

XMLSource

XMLSoftSchema

XMLStrictSchema

Phase I Phase IIIPhase II

largely notautomatable

largelyautomated

fullyautomated

feedback

• Transform into soft schema xml

• Re-arrange, lump and split elements

• Don‘t check „taxonomic integrity“

• Tools: XSLT, Taxonomic Transformation Library (TTL), and others

Page 10: A generic data import layer for the Berlin Taxonomic Information Model

A. Güntsch: A generic data import layer for the Berlin Model

Step-by-step transformation of taxonomic information: phase I<BMIDataSource xmlns="http://www.bgbm.org/schemas/BMI/s0.7" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.bgbm.org/schemas/BMI/s0.7P:\XMLSchema\ImportSchicht\BMISoft0.7.xsd">

<MetaData> […] </MetaData><ConceptReference>

<RefCategory>database</RefCategory><RefString>Aizoaceae</RefString>

</ConceptReference><PotentialTaxa>

<PTaxon><TaxonName>

<Rank>species</Rank><GenusEpi>Acrodon</GenusEpi><SpeciesEpi>bellidiflorus</SpeciesEpi><AllAuthors>N.E.Br.</AllAuthors>

</TaxonName><TaxonStatus>Accepted</TaxonStatus><IdInSource>7814</IdInSource><RelatedTaxon ref="21" relType="basionym"/>

</PTaxon>[…]

</PotentialTaxa></BMIDataSource>

Page 11: A generic data import layer for the Berlin Taxonomic Information Model

A. Güntsch: A generic data import layer for the Berlin Model

Step-by-step transformation of taxonomic information: phase II

TargetBerlinModel

Database

XMLSource

XMLSoftSchema

XMLStrictSchema

Phase I Phase IIIPhase II

largely notautomatable

largelyautomated

fullyautomated

feedback

• Transform into strict schema XML

• Check data integrity

• Report malformed data

• Tool: TTL

Page 12: A generic data import layer for the Berlin Taxonomic Information Model

A. Güntsch: A generic data import layer for the Berlin Model

Step-by-step transformation of taxonomic information: phase II

<BMIDataSource xmlns="http://www.bgbm.org/schemas/BMI/0.7" […]><MetaData> […] </MetaData><ConceptReference>

<RefCategoryAbbrev>BK</RefCategoryAbbrev><RefString>refString</RefString><DatabaseID>4</DatabaseID>

</ConceptReference><PotentialTaxa>

<PTaxon><TaxonName>

<SpeciesName><GenusEpi>Acrodon</GenusEpi><SpeciesEpi>bellidiflorus</SpeciesEpi><AuthorTeam><AuthorTeamCache>N.E.Br.</AuthorTeamCache></AuthorTeam>

</SpeciesName></TaxonName><TaxonStatusAbbrev>A</TaxonStatusAbbrev><IdInSource>7814</IdInSource><RelatedTaxa> […] </RelatedTaxa>

</PTaxon></PotentialTaxa>

</BMIDataSource>

Page 13: A generic data import layer for the Berlin Taxonomic Information Model

A. Güntsch: A generic data import layer for the Berlin Model

Step-by-step transformation of taxonomic information: phase III

TargetBerlinModel

Database

XMLSource

XMLSoftSchema

XMLStrictSchema

Phase I Phase IIIPhase II

largely notautomatable

largelyautomated

fullyautomated

feedback

• Import into database

• Duplicate detection and resolution

• No User interaction required

• Tools: Berlin Model Object Layer (BMOL)

Page 14: A generic data import layer for the Berlin Taxonomic Information Model

A. Güntsch: A generic data import layer for the Berlin Model

Berlin Model Object Layer (BMOL)

Hides the database key systemDuplicate detectionCore-Module provides objects

corresponding to database entitiesMapper-Module interfaces with databasePersistence-Module manages data flow

between core-module and mapper-module

Page 15: A generic data import layer for the Berlin Taxonomic Information Model

A. Güntsch: A generic data import layer for the Berlin Model

Outlook

Method has been successfully tested for import of Med Checklist I, II & IV

Further imports planned for 2006Programming of additional mapper

modules desirable

Page 16: A generic data import layer for the Berlin Taxonomic Information Model

A. Güntsch: A generic data import layer for the Berlin Model

www.bgbm.org/biodivinf/