copyright © 2005 by limsoon wong some interesting issues in constructing gene/protein networks...

25
opyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

Upload: rudolf-anthony

Post on 19-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

Copyright © 2005 by Limsoon Wong

Some Interesting Issues in

Constructing Gene/Protein

Networks

Limsoon WongInstitute for Infocomm

ResearchSingapore

Page 2: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

Copyright © 2005 by Limsoon Wong.

Issues

• Sound:– Is the contents of our databases correct?– Trying our hands on a data cleansing problem

• Complete:– Is the structure of our databases expressive

enough to capture critical information explicitly?

• Understandable:– Is our databases or search results

understandable?

• Other issues relating to NLP/IE

Page 3: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

Copyright © 2005 by Limsoon Wong

Soundness:Is the contents of our databases correct?

This part is based on work of Judice Koh and Vladimir

Brusic

Page 4: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

Categories of errors found

Copyright © 2005 by Limsoon Wong.

Page 5: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

Copyright © 2005 by Limsoon Wong

Context of the misspellingsCorrectionsMisspellings

EMBL:Y18050 E.faecium pbp5 geneTITLE Modification of penicillin-binding protein 5 asociated with highlevel ampicillin resistance in Enterococcus faeciumgi|1143442|emb|X92687.1|EFPBP5G

associatedasociated

Swiss-Prot:P03385Env polyprotein precursor DEFINITION Env polyprotein precursor [Contains: Surface protein (SU) (GP70);Tranmembrane protein (TM) (p15E); R protein].gi|119478|sp|P03385|ENV_MLVMO

transmembranetranmembrane

Patent Database:A76783 Sequence 11 from Patent WO9315210CDS <1..150/note="gene cassete encoding intercalating jun-zipper andlinker"gi|6088638|emb|A76783.1||pat|WO|9315210|11[6088638]

CassetteCassete

GenBank:AAD26534 nectin-1 [Rattus norvegicus]TITLE Nectin/PRR: An Immunogloblin-like Cell Adhesion Molecule Recruitedto Cadherin-based Adherens Junctions through Interaction withAfadin, a PDZ Domain-containing Proteingi|4590334|gb|AAD26534.1

ImmunoglobulinImmunoglobinRECORD

SINGLE SOURCE DATABASE

Invalid values

Ambiguity

Incompatible schema

ATTRIBUTE

Spelling errors

Format violation

Annotation error

Dubious sequences

Sequence redundancy

Data Provenance flaws

Cross-annotation error

Sequence structure violation

• Usually typo errors

• Occurs in different fields of the record

• We identified 569 possible misspelled words affecting up to 20,505 nucleotide records in Entrez.

Vector contaminated sequence

Erroneous data transformation

MULTIPLESOURCE DATABASE

Example Spelling Errors

Page 6: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

RECORD

SINGLE SOURCE DATABASE

Invalid values

Ambiguity

Incompatible schema

ATTRIBUTE

Overlapping intron/exon

Annotation error

Dubious sequences

Sequence redundancy

Data Provenance flaws

Cross-annotation error

Sequence structure violation

Vector contaminated sequence

Erroneous data transformation

MULTIPLESOURCE DATABASE

Syn7 gene of putative polyketide synthase in NCBI TPA record BN000507 has overlapping intron 5 and exon 6.

rpb7+ RNA polymerase II subunit in GENBANK record AF055916 has overlapping exon 1 and exon 2.

Example Overlapping Intron/Exon Errors

Copyright © 2005 by Limsoon Wong

Page 7: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

RECORD

SINGLE SOURCE DATABASE

Invalid values

Ambiguity

Incompatible schema

ATTRIBUTE

Annotation error

Dubious sequences

Sequence redundancy

Data provenance flaws

Cross-annotation error

Sequence structure violation

Vector contaminated sequence

Erroneous data transformation

MULTIPLESOURCE DATABASE

Replication of sequence informationDifferent views

Overlapping annotations of the same sequence

• Submission of the same sequence to different databases

• Repeated submission of the same sequence to the same database

• Initially submitted by different groups

• Protein sequences may be translated from duplicate nucleotide sequences

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=protein&list_uids=11692005&dopt=GenPept

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=protein&list_uids=11692005&dopt=GenPept

Example Seqs w/ Identical Info

Copyright © 2005 by Limsoon Wong

Page 8: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

Copyright © 2005 by Limsoon Wong

Soundness:Trying our hands on a data cleansing problem

This part is based on work of Judice Koh

Page 9: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

Copyright © 2005 by Limsoon Wong.

SINGLE-SOURCE DATABASE

ATTRIBUTE

Schema remapping

Sequence Structure Parser

RECORD

Concatenated values

Spelling errors

Format violation

Synonyms

Homonyms/Abbreviations

Misuse of fields

Features do not correspond with sequence

Dictionary lookup

Integrity constraints

Undersized sequences

Uninformative sequences

Replication of sequence information

Different views

Overlapping annotations of the same sequence

Fragments

Duplicate detection

Mis-fielded values

Comparative analysis

Sequence structure violation

MULTI-SOURCEDATABASE

Vector contaminated sequencesVector screening

Putative features

Cross-annotation error

Page 10: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

Copyright © 2005 by Limsoon Wong.

Scorpion venom dataset containing 520 records

695 duplicate pairs are collectively identified.

Snake PLA2 venom dataset containing 780 records

Entrez (GenBank, GenPept, SwissProt, DDBJ, PIR, PDB)

251 duplicate pairs 444 duplicate pairs

scorpion AND (venom OR toxin) serpentes AND venom AND PLA2

Expert annotation

Dataset

Page 11: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

Copyright © 2005 by Limsoon Wong.

Rule 1 S(Seq)=1 ^ N(Seq Length)=1 ^ M(PDB)=0 (99.7%)Identical sequences with the same sequence length and not originated from PDB are 99.7% likely to be duplicates.

Rule 2 S(Seq)=1 ^ M(PDB)=0 ^ M(Species)=1 (97.1%)Identical sequences with the same sequence length and of the same species are 97.1% likely to be duplicates.

Rule 3 S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 (96.8%)Identical sequences with the same sequence length, of the same species and not originated from PDB are 96.8% likely to be duplicates.

What else do we learn?

Definition of the sequence records do not play

a role in identifying the record duplicates

Results

Page 12: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

Copyright © 2005 by Limsoon Wong

Completeness:Is the structure of our databases expressive enough to capture critical information explicitly?

Page 13: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

Copyright © 2005 by Limsoon Wong.

Expressive Power

• Take a key paper such as the Kohn paper that summarises current knowledge on p53 regulation.

• Is there a structured database that is able to capture all info in that paper explicitly?

• Is there a semi-structured database that is able to capture all info in that paper explicitly?

• How well does this (semi-) structured database generalize to other similar type of papers?

Page 14: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

Copyright © 2005 by Limsoon Wong

Understandability:Is our databases or search results understandable?

Page 15: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

Copyright © 2005 by Limsoon Wong.

Self-Organization

• Take a search on p53. You will get >300k hits or some number like that on MEDLINE

• It is not feasible for anyone to go thru all of that to find what he wants! And this problem is growing bigger as MEDLINE doubles every 1-2 year.

• Need to organize the database and/or the search results into hierarchy or “semantic” net to make it easier for users to understand or to browse the results

• How do we define this hierarchy/net? • Can this hierarchy/net be self-organized?

Page 16: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

Copyright © 2005 by Limsoon Wong

Problems relating to NLP/IE

This part is mostly based on work of Chris Tan and See-Kiong Ng

Page 17: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

Copyright © 2005 by Limsoon Wong.

Handling full-length papers

• Source document structure parsing• Hyper-linked file tracking• Figure and table processing• Special symbol handling

Page 18: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

Copyright © 2005 by Limsoon Wong.

Information retrieval

• Document and sentence retrieval• Relevant interaction filtering

Page 19: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

Copyright © 2005 by Limsoon Wong.

Bio name recognition

• Nomenclature loosely followed• Frequent use of conjunction and

disjunction in bio names with multiple bio-entity names sharing one head noun

• Long descriptive names• Names of genes and proteins used

interchangeably

Page 20: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

Copyright © 2005 by Limsoon Wong.

• Inherent complexity of biological interactions

Sentences describing them also tend to be complicated

Bio-interaction extraction

Page 21: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

Bio-interaction extraction

• Domain knowledge is often needed for interaction template filling

Copyright © 2005 by Limsoon Wong.

Page 22: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

Copyright © 2005 by Limsoon Wong.

Extraction of other relevant info• Contextual information

– Species, cell type, cellular localisation, etc

• Negative information

• Speculative & incomplete facts

Page 23: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

Copyright © 2005 by Limsoon Wong.

Information integration

• Bio-name mapping

• Bio-interaction mapping– how do you know two complex sentences are

talking about the same interaction?

Page 24: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

Copyright © 2005 by Limsoon Wong.

Resource for training & benchmarking

• Is there such a good resource, especially for the more complex tasks?

Page 25: Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

I2R

Communications & DevicesServices & Applications Media

Media Processing

Human CentricMedia

Media Semantics

Infocomm Security

Context-Aware Systems

Knowledge Discovery

Radio Systems Networking LightwaveEmbedded Systems

Digital Wireless

Acknowledgements

Copyright © 2005 by Limsoon Wong

Data Cleansing:Judice Koh, Vladimir Brusic, Mong Li Lee, Asif M. Khan,Paul T.J. Tan, Heiny Tan, Kenneth Lee, Wilson Goh,

Songsak Tongchusak, Kavitha Gopalakrishnan

NLP/IE Issues:See-Kiong Ng,

Chris Tan