literature based framework for semantic descriptions of e-science resources

Hammad AfzalPhD, Computer Science

The University of Manchester, UK

Seminar at: National University of Sciences and Technology,

Islamabad

Dated: April, 2010

A Literature based framework for semantic descriptions of e-Science

resources

[email protected]

Who am I

A former PhD Student at University of Manchester (Finished in Dec, 2009).

A former Research Fellow at Digital Enterprise Research Institute (DERI), National University of Ireland (Finished in Dec, 2009)

At University of Manchester: Text Mining Group. Worked to automate the process of Semantic Service Descriptions of

Bioinformatics resources using Natural Language Processing (NLP) techniques

on large amount of relevant literature available online At DERI:

Unit of Natural Language Processing. Worked on development of methods for the semi/automatic generation of

multilingual lexicons for domain ontologies, exploiting Web-based and

language resources.

e-Science Perspective

Development in Web has changed the way of research.

The resources are now mostly outside a researcher’s office, Scientific data, knowledge and computational resources are

typically distributed over the Internet. This paradigm is largely known as e-Science.

E-Science is an infrastructure for systematic development of research methods that involve distributed resources (Web services, data and knowledge resources, and computational resources) and their application to research

e-Science Resources

The resources involved in e-Science are known as e-Science resources, which can be

Scientific literature databases (e.g. PubMed, PubChem etc).

Tool repositories (e.g. bioinformatics tools and services provided by the European Bioinformatics Institute (EBI) etc).

Social network like portals where scientists can exchange knowledge and comments etc (e.g. myExperiment, F1000 Biology).

Semantic Web

Provides machine understandability by adding machine processable semantics to conventional Web infrastructure

revolutionised the paradigms of resource sharing and service provision by adding meaning to resources (services, data) through associated semantics (formal descriptions of their meaning).

Semantic Web

Ontologies Integral part of Semantic Web. The specification of conceptualisations,

used to help programs and humans share knowledge (Gruber, 1995).

Capture and computationally present knowledge shared by people in a certain community (Hadzic and Chang, 2004).

Represents a set of concepts (typically with precise definitions), which are mutually linked through a number of relationships.

Examples:

Bioinformatics e-Resources

Bioinformatics – a pioneer adopter of e-Science use of computational and mathematical techniques to

store, manage, and analyse the data from molecular biology in order to answer questions about biological phenomena (Lord et al., 2004).

emerged from molecular biology laboratories, enormous amount of data is produced, various tools (Web services) that operate on that data.

Bioinformaticians typically decompose high-level tasks into simpler modules and choose the most appropriate class of service to accomplish each sub-task using different data resources, many of which are distributed (Wroe et al., 2004).

Bioinformatics e-Resources

Semantic Descriptions of Bioinformatics e-Resources

A number of bioinformatics tools and resources available for service use and composition guessimate is 3000+ Web Services publically

available how to find a service, what is out there to use? provenance?

Efficient use of resources require making them discoverable by potential users. Their functional capabilities need to be described, so

that they are not only accessible by humans but also by machines (resource crawlers, software agents etc).

BioCatalogue

Beta version at http://beta.biocatalogue.org/Launch June 2009 at ISMB

Semantic annotation of bioinformatics services annotate functional capabilities e.g. Taverna, myGrid, myExperiment, EBI

Not only services and tools databases, repositories, corpora

Manual curation e.g. myGrid, BioCatalogue etc. e.g. Taverna/Feta: only ~15-20% functionally

described backlog – and the number of services is growing

Semantic Descriptions in Bioinformatics Domain

Our approach – Mine the literature

Literature: Still the largest and most popular source of knowledge.

Hypothesis: The semantic profiles of entities and events can be extracted from the domain literature.

text

ExampleSemantically Annotated Web Service

Annotations combine textual descriptions ontological mappings

Detailed approach

The rest of the talk

Methodology A literature based methodology to develop and

maintain existing domain knowledge representations, in particular Controlled Vocabularies, Ontologies.

An integrated literature based methodology for extraction of resource description profiles

Building semantic networks of resources from their descriptions.

What next?

1st Module Building Controlled Vocabulary

from Literature

Terminology Building

First step towards knowledge acquisition from unstructured text.

Structurally organised terms help in Information Retrieval (IR) Information Extraction (IE) etc Document Summarization etc

Used in annotation tasks, predefined and authorised terms known as controlled

vocabularies (CVs) provide domain-specific tags to enrich data or textual resources

Terms provide basis for Ontologies, Controlled Vocabularies, Taxonomies used in Semantic Web

Terms are automatically identified in literature using Automatic Term Recognition (ATR) techniques

Controlled Vocabulary Building – a challenging task

In dynamic domains, new terms representing new domain concepts are continuously introduced.

Generic ATR techniques fail to differentiate between terms related to a specific task and generic domain terms in heterogeneous text (in particular scientific articles in cross-disciplinary domains)

Term Classification Assigning terms to domain-specific classes.

Narrowing down the specific meaning of a concept described by a given term. For example, in biomedicine, terms can be assigned to classes such as genes,

proteins, mRNAs, diseases, etc.

Can help in building controlled vocabularies by classifying Instances of specific and focused sub-classes of interest.

Controlled Vocabulary Building – Solution

Building controlled vocabulary from literature

Term Classification driven approach

1) get a corpus

2) get all terms

3) get seed examples

4) find relevant ones using term profiling and comparison to seed examples

Learn bioinformatics terms from literature

Bioinformatics terminology

Use seed terms to bootstrap e.g. known descriptors used in existing

service descriptions, either in literature or service repositories 250 terms identified, manual pruning after

automatic term recognition examples of lexical constituents and

textual behaviour (pragmatics) lexical profiling contextual profiling


Lexical profiling what is in the name

Contextual profiling characterise sentences in which terms

appear (nouns, verbs and context-patterns) Comparing candidate term profiles to

average seed term best-match

Lexical Profile

Term (t) Lexical Profile LP(t)

protein (1) Protein

Protein sequence

(1) protein

(2) sequence

(3) protein sequence

protein sequence

alignment

(1) protein

(2) sequence

(3) alignment


(5) sequence alignment


alignment

Contextual Profile

Verb

ProfileProduce

Noun

Profilegenscan, program, list, transcript

Left

Pattern

(LP)

Class-Level (LP1) <Term>, produce, <NP>, of

Right

Pattern

(RP)

Class-Level (RP1) of, <NP>

SentenceGenscan program node can produce a list of nucleotide

FASTAs of predicted transcripts

Profile Comparisons

)()()()( evance(t)Overal_Rel tCPRtCVRtCNRtLR

n

i i

i

S

S

n 1avg )(CNP CNP(t)

|)(CNP CNP(t)|2

1 (t)CNR

)(CNP CNP(t)

|)(CNP CNP(t)|2 (t)CNR maxmax

i

i

STS S

S

i


Comparison between Profile based term classification and generic Term Recognition (c-Value method)

Statistics about textual corpus

Full Text

Articles

# of documents 2,691

# of distinct candidate

terms

113,280

# of candidate term

occurrences

533,418

# of distinct sentences 294,614

# of distinct context

noun stems

~79,000

# of distinct context

verb stems

~2,500

The Bioinformatics Controlled Vocabulary

Number of

Terms

ATR (C-Value) – total number of candidate

terms113,280

Number of terms with lexical similarity to

resource terms95,437

Number of terms with context noun

similarity to resource terms103,104

Number of terms with context verb


Number of terms with context pattern


Number of terms with combined

contextual similarity (Nouns ∪ Verbs ∪

Patterns)

98,307

2nd Module Mining Semantic Descriptions

from Literature

Mining service descriptions

Informatics concepts general concepts of data, data

structures, databases, metadata

Bioinformatics concepts domain-specific data sources

and algorithms for searching and analysing data

e.g. Smith-Waterman algorithm

Semantic classes – myGrid Ontology

Molecular biology concepts higher level concepts used to describe

bioinformatics data types, used as inputs and outputs in services e.g. protein sequence, nucleic acid

sequence

Task concepts generic tasks a service operation can

perform e.g. retrieving, displaying, aligning

Semantic classes – myGrid Ontology

Semantic classes identification

Engineered from MyGrid bioinformatics sub-ontology

Semantic

class

Typical terminological heads

Applicationapplication, tool, service, software, system,

program

Algorithmalgorithm, method, approach, procedure,

analysis, alignment

Data data, record, report, sequence, structure

Data

Resource

resource, database, dataset, repository

Resource mentions

Named-entity recognition (NER) task Recognition of service mentions using

terminological (semantic) heads of automatically recognised terms Apollo2Go Web Service is an Application BIND database is a Data source assign the corresponding semantic class

Hearst patterns (co-ordinations, appositions, enumerations, etc.)

Semantic classes and instances

Extraction/functional rules

Manually designed predicate-driven rules: Subject (Arg) – Verb (Predicate) – Object (Arg)

Applied on dependencyparsed sentences Stanford parser no phrase structures complex sentences information in sub-clause

“Matrix Global Alignment Tool MatGAT generates similarity/identity matrices for DNA or protein

sequences” “Term_App generates similarity/identity matrices for

DNA or protein sequences”


Phrase structuresidentified and integratewith the dependency

Predicate-dependent rules applied to extractspecific ‘content’ andprofile the services

Profiles collated for all mentions service name variation

“Matrix Global Alignment Tool MatGAT generates similarity/identity matrices for DNA or protein sequences”

“Term_App generates similarity/identity matrices for DNA or protein sequences”


Predicate-driven rules: each verb associated with the type of “information content” it provides

Function Associated verbs

Generic functionality/

Task specification

applied, access, achieve, align, allow,

based, developed, implemented,

present, provide, used, is a, called

Inputs, outputs

accept, applied, create, provide,

query, retrieve, starts with, take,

used, generate

Comparison outperform, perform, compare

Implementation

technique,

Programming language

implement(ed)

Composition, subtaskscontain(ed), construct(ed),

generate(d)

Availability available

Information Extraction

SC instance (resource) Matrix Global Alignment Tool MatGAT

SC Application

Task Generate

Predicted input DNA or protein sequences

Predicted output similarity/identity matrices

Descriptorssimilarity/identity matrices, DNA or protein

sequences

Input Sentence: “Matrix Global Alignment Tool MatGAT generates similarity/identity matrices for DNA or protein sequences”

Experiments

2657 BMC Bioinformatics articles full-text articles before March 2008

108 predicates used

Semantic Class Total # of instances

Algorithm 5,722

Application 2,076

Data 2,662

Data Resource 1,992

Total 12,452

Example – GeneClass

1) Resource descriptors

Descriptors

Frequency of co-

occurrence

motif data 4

differential gene expression 3

reliable predictive model 2

genome-wide protein-DNA binding

data

2

transcriptional gene regulation 2

gene expression data 1

2) MyGrid terms

BIND

3) Related resources

Robust GeneClass Algorithm


Functional

Content

Predic

ate

(Task)

Subject Functional Description

Input/

Outputpredict

GeneClas

s

Algorithm

predicting differential

gene expression starts

with a candidate set of

motifs x003bc


Sentences1. We also show how to incorporate genome-wide protein-DNA

binding data from ChIP chip experiments into the GeneClass algorithm, and we use an improved noise model for gene expression data [PMC 1810316].

2. The GeneClass algorithm for predicting differential gene expression starts with a candidate set of motifs; representing known or putative regulatory element sequence patterns and a candidate set of regulators or parentSS [PMC 1810316].

3. Target set: We extend the original GeneClass algorithm to use all target genes for which both motif and expression data is available [PMC 1810316].

Evaluated for their capability to be used for semantic description of a given bioinformatics resource

(0) irrelevant

(1) partially useful

(2) useful

HeatMapperThe HeatMapper tool has already proven to be very useful in several studies

KalignTo compare Kalign to other MSA programs, the following test sets were used. Cognitor

To add a new species to the COG system, the annotated protein sequences from the respective genome were compared to the proteins in the COG database by using the BLAST program and assigned to pre-existing COGs by using the COGNITOR program

Evaluation of semantic profiles

Two experiments: 15 well-known resources with descriptions already

available 15 new resources

Evaluation of semantic profiles

Quality comparison of various components of resource description profiles from the two

experiments

3rd Module Mining Semantic Networks from

Literature

What next?

Good recall, poor precision context needs a better model

Mining parameter values sub-language of parameters

Candidate service/resource mentions an entity whose profile looks like a service comparison of semantic profiles network of services [ISMB 2009]

Do we have good service ontologies?

What Next ? (Proposed in BioHackathon2010)

Phylogenetic TreePhylogenetic Tree

Generated byGenerated by

ClustalW ProgramClustalW Program

MultialignmentMultialignment

Is used forIs used for

Phylogenetic trees are then generated by the ClustalW program by the neighbour-joining method [PMC1973088].

We also used the CLUSTALW program for multialignment as a control process [PMC434493].

Resource1 Resource2 Resource3

Phylogenetic TreePhylogenetic Tree ClustalW ProgramClustalW Program MultialignmentMultialignment

RDF Store

#Data#Data

#Task#Task

Conclusion

Literature mining approach to service description and annotation

Aims reduce curation efforts provide semantic synopses of services for the Semantic

Web Potential of text mining

integration with other annotation approaches extracting the entire service context is still challenging

Related Selected Publications

Hammad Afzal, James Eales, Robert Stevens, Goran Nenadic (2010): Mining Semantic Networks of Bioinformatics Web Resources from the Literature, Journal of Biomedical Semantics.

Hammad Afzal, Robert Stevens, Goran Nenadic (2009): Mining Semantic Descriptions of Bioinformatics Web Resources from the Literature,

6th European Semantic Web Conference (ESWC) on the Semantic Web: Research and Applications. Heraklion, Crete, Greece, Springer-Verlag

Hammad Afzal, Robert Stevens, Goran Nenadic (2008): Towards Semantic Annotation of Bioinformatics Services: Building a Controlled Vocabulary,

Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008).

Thanks

literature based framework for semantic descriptions of e-science resources

Education

resources services

knowledge resources

escience resources

resources available

language resources

computational resources

semantic networks of

science resources email