literature based framework for semantic descriptions of e-science resources

60
Hammad Afzal PhD, Computer Science The University of Manchester, UK Seminar at: National University of Sciences and Technology, Islamabad Dated: April, 2010 A Literature based framework for semantic descriptions of e-Science resources [email protected]

Upload: hammad-afzal

Post on 11-May-2015

878 views

Category:

Education


4 download

DESCRIPTION

Literature Based Framework for Semantic Descriptions of e-Science resources

TRANSCRIPT

Page 1: Literature Based Framework for Semantic Descriptions of e-Science resources

Hammad AfzalPhD, Computer Science

The University of Manchester, UK

Seminar at: National University of Sciences and Technology,

Islamabad

Dated: April, 2010

A Literature based framework for semantic descriptions of e-Science

resources

[email protected]

Page 2: Literature Based Framework for Semantic Descriptions of e-Science resources

Who am I

A former PhD Student at University of Manchester (Finished in Dec, 2009).

A former Research Fellow at Digital Enterprise Research Institute (DERI), National University of Ireland (Finished in Dec, 2009)

At University of Manchester: Text Mining Group. Worked to automate the process of Semantic Service Descriptions of

Bioinformatics resources using Natural Language Processing (NLP) techniques

on large amount of relevant literature available online At DERI:

Unit of Natural Language Processing. Worked on development of methods for the semi/automatic generation of

multilingual lexicons for domain ontologies, exploiting Web-based and

language resources.

Page 3: Literature Based Framework for Semantic Descriptions of e-Science resources

e-Science Perspective

Development in Web has changed the way of research.

The resources are now mostly outside a researcher’s office, Scientific data, knowledge and computational resources are

typically distributed over the Internet. This paradigm is largely known as e-Science.

E-Science is an infrastructure for systematic development of research methods that involve distributed resources (Web services, data and knowledge resources, and computational resources) and their application to research

Page 4: Literature Based Framework for Semantic Descriptions of e-Science resources

e-Science Resources

The resources involved in e-Science are known as e-Science resources, which can be

Scientific literature databases (e.g. PubMed, PubChem etc).

Tool repositories (e.g. bioinformatics tools and services provided by the European Bioinformatics Institute (EBI) etc).

Social network like portals where scientists can exchange knowledge and comments etc (e.g. myExperiment, F1000 Biology).

Page 5: Literature Based Framework for Semantic Descriptions of e-Science resources

Semantic Web

Provides machine understandability by adding machine processable semantics to conventional Web infrastructure

revolutionised the paradigms of resource sharing and service provision by adding meaning to resources (services, data) through associated semantics (formal descriptions of their meaning).

Page 6: Literature Based Framework for Semantic Descriptions of e-Science resources

Semantic Web

Ontologies Integral part of Semantic Web. The specification of conceptualisations,

used to help programs and humans share knowledge (Gruber, 1995).

Capture and computationally present knowledge shared by people in a certain community (Hadzic and Chang, 2004).

Represents a set of concepts (typically with precise definitions), which are mutually linked through a number of relationships.

Examples:

Page 7: Literature Based Framework for Semantic Descriptions of e-Science resources

Bioinformatics e-Resources

Bioinformatics – a pioneer adopter of e-Science use of computational and mathematical techniques to

store, manage, and analyse the data from molecular biology in order to answer questions about biological phenomena (Lord et al., 2004).

emerged from molecular biology laboratories, enormous amount of data is produced, various tools (Web services) that operate on that data.

Bioinformaticians typically decompose high-level tasks into simpler modules and choose the most appropriate class of service to accomplish each sub-task using different data resources, many of which are distributed (Wroe et al., 2004).

Page 8: Literature Based Framework for Semantic Descriptions of e-Science resources

Bioinformatics e-Resources

Page 9: Literature Based Framework for Semantic Descriptions of e-Science resources

Semantic Descriptions of Bioinformatics e-Resources

A number of bioinformatics tools and resources available for service use and composition guessimate is 3000+ Web Services publically

available how to find a service, what is out there to use? provenance?

Efficient use of resources require making them discoverable by potential users. Their functional capabilities need to be described, so

that they are not only accessible by humans but also by machines (resource crawlers, software agents etc).

Page 10: Literature Based Framework for Semantic Descriptions of e-Science resources

BioCatalogue

Beta version at http://beta.biocatalogue.org/Launch June 2009 at ISMB

Page 11: Literature Based Framework for Semantic Descriptions of e-Science resources

Semantic annotation of bioinformatics services annotate functional capabilities e.g. Taverna, myGrid, myExperiment, EBI

Not only services and tools databases, repositories, corpora

Manual curation e.g. myGrid, BioCatalogue etc. e.g. Taverna/Feta: only ~15-20% functionally

described backlog – and the number of services is growing

Semantic Descriptions in Bioinformatics Domain

Page 12: Literature Based Framework for Semantic Descriptions of e-Science resources

Our approach – Mine the literature

Literature: Still the largest and most popular source of knowledge.

Hypothesis: The semantic profiles of entities and events can be extracted from the domain literature.

Page 13: Literature Based Framework for Semantic Descriptions of e-Science resources

text

ExampleSemantically Annotated Web Service

Annotations combine textual descriptions ontological mappings

Page 14: Literature Based Framework for Semantic Descriptions of e-Science resources

Detailed approach

Page 15: Literature Based Framework for Semantic Descriptions of e-Science resources

The rest of the talk

Methodology A literature based methodology to develop and

maintain existing domain knowledge representations, in particular Controlled Vocabularies, Ontologies.

An integrated literature based methodology for extraction of resource description profiles

Building semantic networks of resources from their descriptions.

What next?

Page 16: Literature Based Framework for Semantic Descriptions of e-Science resources

1st Module Building Controlled Vocabulary

from Literature

Page 17: Literature Based Framework for Semantic Descriptions of e-Science resources
Page 18: Literature Based Framework for Semantic Descriptions of e-Science resources

Terminology Building

First step towards knowledge acquisition from unstructured text.

Structurally organised terms help in Information Retrieval (IR) Information Extraction (IE) etc Document Summarization etc

Used in annotation tasks, predefined and authorised terms known as controlled

vocabularies (CVs) provide domain-specific tags to enrich data or textual resources

Terms provide basis for Ontologies, Controlled Vocabularies, Taxonomies used in Semantic Web

Terms are automatically identified in literature using Automatic Term Recognition (ATR) techniques

Page 19: Literature Based Framework for Semantic Descriptions of e-Science resources

Controlled Vocabulary Building – a challenging task

In dynamic domains, new terms representing new domain concepts are continuously introduced.

Generic ATR techniques fail to differentiate between terms related to a specific task and generic domain terms in heterogeneous text (in particular scientific articles in cross-disciplinary domains)

Page 20: Literature Based Framework for Semantic Descriptions of e-Science resources

Term Classification Assigning terms to domain-specific classes.

Narrowing down the specific meaning of a concept described by a given term. For example, in biomedicine, terms can be assigned to classes such as genes,

proteins, mRNAs, diseases, etc.

Can help in building controlled vocabularies by classifying Instances of specific and focused sub-classes of interest.

Controlled Vocabulary Building – Solution

Page 21: Literature Based Framework for Semantic Descriptions of e-Science resources

Building controlled vocabulary from literature

Page 22: Literature Based Framework for Semantic Descriptions of e-Science resources

Term Classification driven approach

1) get a corpus

2) get all terms

3) get seed examples

4) find relevant ones using term profiling and comparison to seed examples

Learn bioinformatics terms from literature

Page 23: Literature Based Framework for Semantic Descriptions of e-Science resources

Bioinformatics terminology

Use seed terms to bootstrap e.g. known descriptors used in existing

service descriptions, either in literature or service repositories 250 terms identified, manual pruning after

automatic term recognition examples of lexical constituents and

textual behaviour (pragmatics) lexical profiling contextual profiling

Page 24: Literature Based Framework for Semantic Descriptions of e-Science resources

Bioinformatics terminology

Lexical profiling what is in the name

Contextual profiling characterise sentences in which terms

appear (nouns, verbs and context-patterns) Comparing candidate term profiles to

average seed term best-match

Page 25: Literature Based Framework for Semantic Descriptions of e-Science resources

Lexical Profile

Term (t) Lexical Profile LP(t)

protein (1) Protein

Protein sequence

(1) protein

(2) sequence

(3) protein sequence

protein sequence

alignment

(1) protein

(2) sequence

(3) alignment

(4) protein sequence

(5) sequence alignment

(6) protein sequence

alignment

Page 26: Literature Based Framework for Semantic Descriptions of e-Science resources

Contextual Profile

Verb

ProfileProduce

Noun

Profilegenscan, program, list, transcript

Left

Pattern

(LP)

Class-Level (LP1) <Term>, produce, <NP>, of

Right

Pattern

(RP)

Class-Level (RP1) of, <NP>

SentenceGenscan program node can produce a list of nucleotide

FASTAs of predicted transcripts

Page 27: Literature Based Framework for Semantic Descriptions of e-Science resources

Profile Comparisons

)()()()( evance(t)Overal_Rel tCPRtCVRtCNRtLR

n

i i

i

S

S

n 1avg )(CNP CNP(t)

|)(CNP CNP(t)|2

1 (t)CNR

)(CNP CNP(t)

|)(CNP CNP(t)|2 (t)CNR maxmax

i

i

STS S

S

i

Page 28: Literature Based Framework for Semantic Descriptions of e-Science resources

Bioinformatics terminology

Comparison between Profile based term classification and generic Term Recognition (c-Value method)

Page 29: Literature Based Framework for Semantic Descriptions of e-Science resources

Statistics about textual corpus

Full Text

Articles

# of documents 2,691

# of distinct candidate

terms

113,280

# of candidate term

occurrences

533,418

# of distinct sentences 294,614

# of distinct context

noun stems

~79,000

# of distinct context

verb stems

~2,500

Page 30: Literature Based Framework for Semantic Descriptions of e-Science resources

The Bioinformatics Controlled Vocabulary

Number of

Terms

ATR (C-Value) – total number of candidate

terms113,280

Number of terms with lexical similarity to

resource terms95,437

Number of terms with context noun

similarity to resource terms103,104

Number of terms with context verb

similarity to resource terms73,478

Number of terms with context pattern

similarity to resource terms21,182

Number of terms with combined

contextual similarity (Nouns ∪ Verbs ∪

Patterns)

98,307

Page 31: Literature Based Framework for Semantic Descriptions of e-Science resources

2nd Module Mining Semantic Descriptions

from Literature

Page 32: Literature Based Framework for Semantic Descriptions of e-Science resources
Page 33: Literature Based Framework for Semantic Descriptions of e-Science resources

Mining service descriptions

Page 34: Literature Based Framework for Semantic Descriptions of e-Science resources
Page 35: Literature Based Framework for Semantic Descriptions of e-Science resources

Informatics concepts general concepts of data, data

structures, databases, metadata

Bioinformatics concepts domain-specific data sources

and algorithms for searching and analysing data

e.g. Smith-Waterman algorithm

Semantic classes – myGrid Ontology

Page 36: Literature Based Framework for Semantic Descriptions of e-Science resources

Molecular biology concepts higher level concepts used to describe

bioinformatics data types, used as inputs and outputs in services e.g. protein sequence, nucleic acid

sequence

Task concepts generic tasks a service operation can

perform e.g. retrieving, displaying, aligning

Semantic classes – myGrid Ontology

Page 37: Literature Based Framework for Semantic Descriptions of e-Science resources

Semantic classes identification

Engineered from MyGrid bioinformatics sub-ontology

Semantic

class

Typical terminological heads

Applicationapplication, tool, service, software, system,

program

Algorithmalgorithm, method, approach, procedure,

analysis, alignment

Data data, record, report, sequence, structure

Data

Resource

resource, database, dataset, repository

Page 38: Literature Based Framework for Semantic Descriptions of e-Science resources

Resource mentions

Named-entity recognition (NER) task Recognition of service mentions using

terminological (semantic) heads of automatically recognised terms Apollo2Go Web Service is an Application BIND database is a Data source assign the corresponding semantic class

Hearst patterns (co-ordinations, appositions, enumerations, etc.)

Page 39: Literature Based Framework for Semantic Descriptions of e-Science resources

Semantic classes and instances

Page 40: Literature Based Framework for Semantic Descriptions of e-Science resources

Semantic classes and instances

Page 41: Literature Based Framework for Semantic Descriptions of e-Science resources
Page 42: Literature Based Framework for Semantic Descriptions of e-Science resources

Extraction/functional rules

Manually designed predicate-driven rules: Subject (Arg) – Verb (Predicate) – Object (Arg)

Applied on dependencyparsed sentences Stanford parser no phrase structures complex sentences information in sub-clause

“Matrix Global Alignment Tool MatGAT generates similarity/identity matrices for DNA or protein

sequences” “Term_App generates similarity/identity matrices for

DNA or protein sequences”

Page 43: Literature Based Framework for Semantic Descriptions of e-Science resources

Extraction/functional rules

Phrase structuresidentified and integratewith the dependency

Predicate-dependent rules applied to extractspecific ‘content’ andprofile the services

Profiles collated for all mentions service name variation

“Matrix Global Alignment Tool MatGAT generates similarity/identity matrices for DNA or protein sequences”

“Term_App generates similarity/identity matrices for DNA or protein sequences”

Page 44: Literature Based Framework for Semantic Descriptions of e-Science resources

Extraction/functional rules

Page 45: Literature Based Framework for Semantic Descriptions of e-Science resources

Extraction/functional rules

Predicate-driven rules: each verb associated with the type of “information content” it provides

Function Associated verbs

Generic functionality/

Task specification

applied, access, achieve, align, allow,

based, developed, implemented,

present, provide, used, is a, called

Inputs, outputs

accept, applied, create, provide,

query, retrieve, starts with, take,

used, generate

Comparison outperform, perform, compare

Implementation

technique,

Programming language

implement(ed)

Composition, subtaskscontain(ed), construct(ed),

generate(d)

Availability available

Page 46: Literature Based Framework for Semantic Descriptions of e-Science resources

Information Extraction

SC instance (resource) Matrix Global Alignment Tool MatGAT

SC Application

Task Generate

Predicted input DNA or protein sequences

Predicted output similarity/identity matrices

Descriptorssimilarity/identity matrices, DNA or protein

sequences

Input Sentence: “Matrix Global Alignment Tool MatGAT generates similarity/identity matrices for DNA or protein sequences”

Page 47: Literature Based Framework for Semantic Descriptions of e-Science resources
Page 48: Literature Based Framework for Semantic Descriptions of e-Science resources

Experiments

2657 BMC Bioinformatics articles full-text articles before March 2008

108 predicates used

Semantic Class Total # of instances

Algorithm 5,722

Application 2,076

Data 2,662

Data Resource 1,992

Total 12,452

Page 49: Literature Based Framework for Semantic Descriptions of e-Science resources

Example – GeneClass

1) Resource descriptors

Descriptors

Frequency of co-

occurrence

motif data 4

differential gene expression 3

reliable predictive model 2

genome-wide protein-DNA binding

data

2

transcriptional gene regulation 2

gene expression data 1

2) MyGrid terms

BIND

3) Related resources

Robust GeneClass Algorithm

Page 50: Literature Based Framework for Semantic Descriptions of e-Science resources

Example – GeneClass

Functional

Content

Predic

ate

(Task)

Subject Functional Description

Input/

Outputpredict

GeneClas

s

Algorithm

predicting differential

gene expression starts

with a candidate set of

motifs x003bc

Page 51: Literature Based Framework for Semantic Descriptions of e-Science resources

Example – GeneClass

Sentences1. We also show how to incorporate genome-wide protein-DNA

binding data from ChIP chip experiments into the GeneClass algorithm, and we use an improved noise model for gene expression data [PMC 1810316].

2. The GeneClass algorithm for predicting differential gene expression starts with a candidate set of motifs; representing known or putative regulatory element sequence patterns and a candidate set of regulators or parentSS [PMC 1810316].

3. Target set: We extend the original GeneClass algorithm to use all target genes for which both motif and expression data is available [PMC 1810316].

Page 52: Literature Based Framework for Semantic Descriptions of e-Science resources

Evaluated for their capability to be used for semantic description of a given bioinformatics resource

(0) irrelevant

(1) partially useful

(2) useful

HeatMapperThe HeatMapper tool has already proven to be very useful in several studies

KalignTo compare Kalign to other MSA programs, the following test sets were used. Cognitor

To add a new species to the COG system, the annotated protein sequences from the respective genome were compared to the proteins in the COG database by using the BLAST program and assigned to pre-existing COGs by using the COGNITOR program

Evaluation of semantic profiles

Page 53: Literature Based Framework for Semantic Descriptions of e-Science resources

Two experiments: 15 well-known resources with descriptions already

available 15 new resources

Evaluation of semantic profiles

Quality comparison of various components of resource description profiles from the two

experiments

Page 54: Literature Based Framework for Semantic Descriptions of e-Science resources

3rd Module Mining Semantic Networks from

Literature

Page 55: Literature Based Framework for Semantic Descriptions of e-Science resources
Page 56: Literature Based Framework for Semantic Descriptions of e-Science resources

What next?

Good recall, poor precision context needs a better model

Mining parameter values sub-language of parameters

Candidate service/resource mentions an entity whose profile looks like a service comparison of semantic profiles network of services [ISMB 2009]

Do we have good service ontologies?

Page 57: Literature Based Framework for Semantic Descriptions of e-Science resources

What Next ? (Proposed in BioHackathon2010)

Phylogenetic TreePhylogenetic Tree

Generated byGenerated by

ClustalW ProgramClustalW Program

MultialignmentMultialignment

Is used forIs used for

Phylogenetic trees are then generated by the ClustalW program by the neighbour-joining method [PMC1973088].

We also used the CLUSTALW program for multialignment as a control process [PMC434493].

Resource1 Resource2 Resource3

Phylogenetic TreePhylogenetic Tree ClustalW ProgramClustalW Program MultialignmentMultialignment

RDF Store

#Data#Data

#Task#Task

Page 58: Literature Based Framework for Semantic Descriptions of e-Science resources

Conclusion

Literature mining approach to service description and annotation

Aims reduce curation efforts provide semantic synopses of services for the Semantic

Web Potential of text mining

integration with other annotation approaches extracting the entire service context is still challenging

Page 59: Literature Based Framework for Semantic Descriptions of e-Science resources

Related Selected Publications

Hammad Afzal, James Eales, Robert Stevens, Goran Nenadic (2010): Mining Semantic Networks of Bioinformatics Web Resources from the Literature, Journal of Biomedical Semantics.

Hammad Afzal, Robert Stevens, Goran Nenadic (2009): Mining Semantic Descriptions of Bioinformatics Web Resources from the Literature,

6th European Semantic Web Conference (ESWC) on the Semantic Web: Research and Applications. Heraklion, Crete, Greece, Springer-Verlag

Hammad Afzal, Robert Stevens, Goran Nenadic (2008): Towards Semantic Annotation of Bioinformatics Services: Building a Controlled Vocabulary,

Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008).

Page 60: Literature Based Framework for Semantic Descriptions of e-Science resources

Thanks