seed-based generation of personalized bio- ontologies for information extraction

35
Seed-based Generation of Personalized Bio-Ontologies for Information Extraction Cui Tao & David W. Embley Data Extraction Research Group Department of Computer Science Brigham Young University Supported by NSF

Upload: burton-coffey

Post on 02-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Seed-based Generation of Personalized Bio- Ontologies for Information Extraction. Cui Tao & David W. Embley Data Extraction Research Group Department of Computer Science Brigham Young University. Supported by NSF. Personalized Information Harvesting. - PowerPoint PPT Presentation

TRANSCRIPT

Seed-based Generation of PersonalizedBio-Ontologies for Information Extraction

Cui Tao & David W. EmbleyData Extraction Research GroupDepartment of Computer Science

Brigham Young University

Supported by NSF

Personalized Information Harvesting

• Biology domain huge (other domains too)• Data collection

– Many (web) sources– Only a tiny subpart wanted– Personalized view

• Personalized extraction ontology– Creation: Form specification– Application: Seed-based harvesting

Example• Harvest information about large proteins in humans

and the functions of these proteins– Find proteins in humans that are >20 kDa – Find all the proteins in humans that serve as receptors– ...

• Information sources various online repositories– NCBI– Gene Cards– The Gene Ontology– GPM Proteomics Database – …

Extraction Ontology

Instance: ^\d{1,5}(\.\d{1,2})?

Context: weight|wght|wt\.

Unit: kilodaltons?|kdas?|kds?|das?|daltons?

T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15…

Extraction Ontology

Instance: ^\d{1,5}(\.\d{1,2})?

Context: weight|wght|wt\.

Unit: kilodaltons?|kdas?|kds?|das?|daltons?

T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15…

Can We Make Construction Easier?• Forms

– General familiarity– Reasonable conceptual framework– Appropriate correspondence

• Transformable to ontological descriptions• Capable of accepting source data

• Instance recognizers– Some pre-existing instance recognizers– Lexicons

• Need for a full extraction ontology?

Form Creation User InterfaceBasic form-construction facilities:• single-entry field• multiple-entry field• nested form• …

Created Sample Form

Generated Ontology View

Source-to-Form MappingEstablishing a Seed

Source-to-Form MappingEstablishing a Seed

Source-to-Form MappingEstablishing a Seed

Source-to-Form MappingEstablishing a Seed

Almost Ready to Harvest …

• Need reading path: DOM-tree structure• Need to resolve mapping problems

– Split/Merge– Union/Selection

Almost Ready to Harvest …

• Need reading path: DOM-tree structure• Need to resolve mapping problems

– Split/Merge– Union/Selection

Voltage-dependent anion-selective channel protein 3VDAC-3hVDAC3Outer mitochondrial membrane Protein porin 3

Name

Almost Ready to Harvest …

• Need reading path: DOM-tree structure• Need to resolve mapping problems

– Split/Merge– Union/Selection

Voltage-dependent anion-selective channel protein 3VDAC-3hVDAC3Outer mitochondrial membrane Protein porin 3

Name

Almost Ready to Harvest …

• Need reading path: DOM-tree structure• Need to resolve mapping problems

– Split/Merge– Union/Selection

Name

T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15

Almost Ready to Harvest …

• Need reading path: DOM-tree structure• Need to resolve mapping problems

– Split/Merge– Union/Selection

Name

T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15

Can Now Harvest

Name

Can Now Harvest

Name

14-3-3 protein epsilonMitochondrial import stimulation factor LsubunitProtein kinase C inhibitor protein-1KCIP-114-3-3E

Can Now Harvest

Name

Voltage-dependent anion-selective channel protein 3VDAC-3hVDAC3Outer mitochondrial membrane Protein porin 3

Can Now Harvest

Name

Tryptophanyl-tRNA synthetase, mitochondrial precursorEC 6.1.1.2Tryptophan—tRNA ligaseTrpRS(Mt)TrpRS

Harvesting Populates Ontology

Harvesting Populates Ontology

Also helps adjust ontology constraints

Can Harvest from Additional Sites

Name

T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15

Larger Picture

• Information Harvesting– Not only for biology, but for any application– Not only from one site, but from many sites

• Opportunities– Extraction ontology creation– Automating site-to-site information harvesting– Automatic semantic annotation– Data/Ontology transformations

Extraction Ontology CreationLexicons

Name

14-3-3 protein epsilonMitochondrial import stimulation factor LsubunitProtein kinase C inhibitor protein-1KCIP-114-3-3E

Name

T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15

Name

Tryptophanyl-tRNA synthetase, mitochondrial precursorEC 6.1.1.2Tryptophan—tRNA ligaseTrpRS(Mt)TrpRS

…14-3-3 protein epsilonMitochondrial import stimulation factor LsubunitProtein kinase C inhibitor protein-1KCIP-114-3-3E…T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15…Tryptophanyl-tRNA synthetase, mitochondrial precursorEC 6.1.1.2Tryptophan—tRNA ligaseTrpRS(Mt)TrpRS…

Automatic Source-to-Form Mapping

Automatic Semantic Annotation

Extraction Ontology CreationInstance Recognizers

Number Patterns Context Keywords and Phrases

Automatic Source-to-Form Mapping

Automatic Semantic Annotation

Recognize and annotate with respect to an ontology

Ontology TransformationOWL & RDF: standard ontology languages

XML & XMLS: data exchange

Forms: form filling to populate an ontology

Ontology Transformation

Transformations to and from all

Contributions

• Personalized ontology creation• Mapping from sources• Information harvesting

• Opportunities for further work– Extraction ontology creation– Semantic Annotation– Data/Ontology transformations

www.deg.byu.edu