“guilt by association” as a search principle€¦ · 2 invited keynote at sigir2008 3 copyright...

31
1 “Guilt by Association” as a Search Principle Limsoon Wong Invited keynote at SIGIR2008 2 Copyright 2008 © Limsoon Wong A slight detour …

Upload: others

Post on 25-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

1

“Guilt by Association”as a Search Principle

Limsoon Wong

Invited keynote at SIGIR2008 2 Copyright 2008 © Limsoon Wong

A slight detour …

Page 2: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

2

Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong

Guilt by Association

• But do we know which ones are causal genes, which ones are surrogates, and which are noise?

Diagnostic ALL BM samples (n=327)

3σ-3σ -2σ -1σ 0 1σ 2σσ = std deviation from mean

Gen

es fo

r cla

ss

dist

inct

ion

(n=2

71)

TEL-AML1BCR-ABL

Hyperdiploid >50E2A-PBX1

MLL T-ALL Novel

Invited keynote at SIGIR2008 4 Copyright 2008 © Limsoon Wong

• Complete genomes are now available

• Knowing the genes is not enough to understand how biology functions

• Proteins, not genes, are responsible for many cellular activities

• Proteins function by interacting w/ other proteins and biomolecules

GENOME PROTEOME

“INTERACTOME”

Biology 101

Slide credit: See-Kiong Ng

Page 3: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

3

Invited keynote at SIGIR2008 5 Copyright 2008 © Limsoon Wong

Gene Regulatory Circuits

• Extract functional annotations

• Extract relationships between genes, proteins, processes, diseases, & drugs

Source: Miltenyi Biotec

• Predict functional annotations

• Predict relationships between genes, proteins, processes, diseases, & drugs

Invited keynote at SIGIR2008 6 Copyright 2008 © Limsoon Wong

Information Extraction & Retrieval:Challenges in Context of Biomedicine

Page 4: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

4

Invited keynote at SIGIR2008 7 Copyright 2008 © Limsoon Wong

A Typical Architecture

NLP/IE System Manual Curation/QA System

Grid of db in life sciences

User Dashboard

Invited keynote at SIGIR2008 8 Copyright 2008 © Limsoon Wong

Understandability:Is our databases or

search results understandable?

• Take a search on p53. You get >300k hits or some # like that on MEDLINE

• It is not feasible for anyone to go thru all of that to find what he wants! And this problem is growing bigger as MEDLINE doubles every 1-2 year.

• Need to organize the db and/or the search results to make it easier for users to understand or to browse the results

Page 5: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

5

Invited keynote at SIGIR2008 9 Copyright 2008 © Limsoon Wong

Completeness:Is the structure of

our databases expressive enough to capture critical

information explicitly?

• Take a key paper such as the Kohn paper that summarises current knowledge on p53

• Is there a (semi)structureddb that is able to capture all info in that paper explicitly?

• How well does this (semi-) structured database generalize to other similar type of papers?

Invited keynote at SIGIR2008 10 Copyright 2008 © Limsoon Wong

Extract Entities & Annotations• Nomenclature loosely

followed

• Freq use of conjunction and disjunction in bio names with multiple bio-entity names sharing one head noun

• Long descriptive names

• Names of genes & proteins used interchangeably

Zhou et al., BioCreAtIvE, 2004

Page 6: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

6

Invited keynote at SIGIR2008 11 Copyright 2008 © Limsoon Wong

Extract Relationships

• Sentences describing relationships tend to be complicated

• Domain knowledge is often needed for interaction template filling

Invited keynote at SIGIR2008 12 Copyright 2008 © Limsoon Wong

Organizing Retrieval Results

Credit: Molecular Connections

Page 7: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

7

Invited keynote at SIGIR2008 13 Copyright 2008 © Limsoon Wong

Organizing Retrieval Results

Credit: Molecular Connections

Invited keynote at SIGIR2008 14 Copyright 2008 © Limsoon Wong

Beyond IE & Retrieval:Predictions in Context of Biomedicine

Page 8: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

8

Invited keynote at SIGIR2008 15 Copyright 2008 © Limsoon Wong

Plan

• Abductive Foundation of “Guilt by Association”

• Issue of Chance Association

• Novel Forms of Association

• Fusion of Multiple Evidence of Association

• Dichotomy of knowing two entities are in some relationship and yet not knowing what that relationship is

Abductive Foundation of “Guilt by Association”:

A Protein Function Prediction Perspective

Page 9: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

9

Invited keynote at SIGIR2008 17 Copyright 2008 © Limsoon Wong

Function Assignment to Protein Sequence

• How do we attempt to assign a function to a new protein sequence?

SPSTNRKYPPLPVDKLEEEINRRMADDNKLFREEFNALPACPIQATCEAASKEENKEKNRYVNILPYDHSRVHLTPVEGVPDSDYINASFINGYQEKNKFIAAQGPKEETVNDFWRMIWEQNTATIVMVTNLKERKECKCAQYWPDQGCWTYGNVRVSVEDVTVLVDYTVRKFCIQQVGDVTNRKPQRLITQFHFTSWPDFGVPFTPIGMLKFLKKVKACNPQYAGAIVVHCSAGVGRTGTFVVIDAMLDMMHSERKVDVYGFVSRIRAQRCQMVQTDMQYVFIYQALLEHYLYGDTELEVT

Invited keynote at SIGIR2008 18 Copyright 2008 © Limsoon Wong

Parallels• Given a protein, determine

its functions

• Given a protein, find other proteins that share a common function with it

• Given a function, find all proteins having that function

• Given a document, determine what are the “things” it describes

• Given a document, find other documents that describe a common “thing” with it

• Given a “thing”, find all documents that describe that thing

Page 10: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

10

Invited keynote at SIGIR2008 19 Copyright 2008 © Limsoon Wong

A protein is a ...

• A protein is a large complex molecule made up of one or more chains of amino acids

• Protein performs a wide variety of activities in the cell

Invited keynote at SIGIR2008 20 Copyright 2008 © Limsoon Wong

Invariant and Abductive Reasoning• Function is determined

by 3D struct of protein & environment protein is in

• Constraints imposed by 3D struct & environment give rise to “invariant”properties observed in proteins having the ancestor with that function

⇒ Abductive reasoning– If those invariant

properties are seen in a protein, then the protein is homolog of this protein

⇒ “Guilt by association”

Hypothesis/Fact A

Entailment A B

Observation/Conclusion B

What is the parallel of the above in IR?

Page 11: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

11

Invited keynote at SIGIR2008 21 Copyright 2008 © Limsoon Wong

“Guilt by Association”

• Compare the target sequence T with sequences S1, …, Sn of known function in a database

• Determine which ones amongst S1, …, Sn are the mostly likely homologs of T

• Then assign to T the same function as these homologs

• Finally, confirm with suitable wet experiments

Invited keynote at SIGIR2008 22 Copyright 2008 © Limsoon Wong

Sequence Alignment: Poor Example

• Poor seq alignment shows few matched positions⇒ The two proteins are not likely to be homologous

No obvious match between Amicyanin and Ascorbate Oxidase

Page 12: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

12

Invited keynote at SIGIR2008 23 Copyright 2008 © Limsoon Wong

Sequence Alignment: Good Example

• Good alignment usually has clusters of extensive matched positions

⇒ The two proteins are likely to be homologous

good match between Amicyanin and unknown M. loti protein

Invited keynote at SIGIR2008 24 Copyright 2008 © Limsoon Wong

Guilt by Association of Seq SimilarityCompare T with seqs of known function in a db

Assign to T same function as homologs

Confirm with suitable wet experiments

Discard this functionas a candidate

Page 13: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

13

Invited keynote at SIGIR2008 25 Copyright 2008 © Limsoon Wong

Homologs obtained by BLAST

• Thus our example sequence could be a protein tyrosine phosphatase α (PTPα)

Issue of Chance Association:

A Twist in the Tale

Image credit: Shanti Christensen, http:// static.flickr.com/46/148437681_7f2dfa977e_m.jpg

Page 14: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

14

Invited keynote at SIGIR2008 27 Copyright 2008 © Limsoon Wong

Law of Large Numbers

• Suppose you are in a room with 365 other people

• Q: What is the prob that a specific person in the room has the same birthday as you?

• A: 1/365 = 0.3%

• Q: What is the prob that there is a person in the room having the same birthday as you?

• A: 1 – (364/365)365 = 63%

• Q: What is the prob that there are two persons in the room having the same birthday?

• A: 100%

Invited keynote at SIGIR2008 28 Copyright 2008 © Limsoon Wong

Interpretation of P-value

• Seq. comparison progs, e.g. BLAST, often associate a P-value to each hit

• P-value is interpreted as prob that a random seq has an equally good alignment

• Suppose the P-value of an alignment is 10−6

• If database has 107

seqs, then you expect 107 * 10−6 = 10 seqs in it that give an equally good alignment

⇒ Need to correct for database size if your seq comparison progdoes not do that!Note: P = 1 – e −E

Page 15: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

15

Invited keynote at SIGIR2008 29 Copyright 2008 © Limsoon Wong

Lightning Does Strike Twice!

• Roy Sullivan, a former park ranger from Virgina, was struck by lightning 7 times– 1942 (lost big-toe nail)– 1969 (lost eyebrows)– 1970 (left shoulder seared)– 1972 (hair set on fire)– 1973 (hair set on fire & legs seared)– 1976 (ankle injured)– 1977 (chest & stomach burned)

• September 1983, he committed suicideCartoon: Ron Hipschman

Data: David Hand

Invited keynote at SIGIR2008 30 Copyright 2008 © Limsoon Wong

Effect of Seq Compositional Bias

• One fourth of all residues in protein seqs occur in regions with biased amino acid composition

• Alignments of two such regions achieves high score purely due to segment composition

⇒While it is worth noting that two proteins contain similar low complexity regions, they are best excluded when constructing alignments

Source: NCBI

Page 16: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

16

Invited keynote at SIGIR2008 31 Copyright 2008 © Limsoon Wong

Effect of Sequence Length

Abagyan RA, Batalov S. Do aligned sequences share the same fold? J Mol Biol. 1997

Oct 17;273(1):355-68

Invited keynote at SIGIR2008 32 Copyright 2008 © Limsoon Wong

Parallels• P-value and E-value

• Compositionally biased regions

• Length, conserved site, transitive assignment, and other caveats

• Ranking measures? – Different concepts– Not necessarily same

effect

• Stop words?

• ???

Page 17: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

17

Novel Forms of Associations

Invited keynote at SIGIR2008 34 Copyright 2008 © Limsoon Wong

Important Unsolved Challenge

• What if there is no useful seq homolog?• Guilt by other types of association!

– Domain modeling (e.g., HMMPFAM)Similarity of phylogenetic profilesSimilarity of dissimilarities (e.g., SVM-PAIRWISE)

– Similarity of subcellular co-localization & other physico-chemico properties(e.g., PROTFUN)

– Similarity of gene expression profiles– Similarity of protein-protein interaction partners– …

Fusion of multiple types of info

Page 18: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

18

Invited keynote at SIGIR2008 35 Copyright 2008 © Limsoon Wong

Guilt by Association of Phylogenetic Profile

• A protein is not alone when performing its biological function

⇒Gene (and hence proteins) with identical patterns of occurrence across phyla tend to function together

Invited keynote at SIGIR2008 36 Copyright 2008 © Limsoon Wong

KEGGCOG

fra c

ti on

o f g

e ne

p ai r

sh a

v in g

ha m

min

g di

sta n

ce D

a nd

s ha r

e a

c om

mo n

pa t

h way

in K

EG

G/C

OG

hamming distance (D)

hamming distance X,Y= #lineages X occurs +

#lineages Y occurs –2 * #lineages X, Y occur

Phylogenetic Profiling: EvidenceWu et al., Bioinformatics, 19:1524--1530, 2003

• Proteins having low hamming distance (thus highly similar phylogenetic profiles) tend to share common pathways Exercise: Why do proteins having high

hamming distance also have this behaviour?

Page 19: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

19

Invited keynote at SIGIR2008 37 Copyright 2008 © Limsoon Wong

Guilt by Association of Dissimilarities

…Unknown1

…………

…Color = orange vs yellowSkin = rough vs smoothSize = small vs smallShape = round vs oblong

Color = orange vs orangeSkin = rough vs roughSize = small vs smallShape = round vs round

Orange2

…Color = red vs yellowSkin = smooth vs smoothSize = small vs smallShape = round vs oblong

Color = red vs orangeSkin = smooth vs roughSize = small vs smallShape = round vs round

Apple1

…Banana1Orange1

Differences of “unknown”to other fruits are same as

“apple” to other fruits

Color = red vs orangeSkin = smooth vs roughSize = small vs smallShape = round vs round

Color = red vs yellowSkin = smooth vs smoothSize = small vs smallShape = round vs oblong

“unknown” is an “apple”!

Invited keynote at SIGIR2008 38 Copyright 2008 © Limsoon Wong

SVM-Pairwise Framework

Training Data

S1

S2

S3

Testing Data

T1

T2

T3

Training Features

S1 S2 S3 …

S1 f11 f12 f13 …

S2 f21 f22 f23 …

S3 f31 f32 f33 …

… … … … …

Feature Generation

Trained SVM Model(Feature Weights)

Training

Testing Features

S1 S2 S3 …

T1 f11 f12 f13 …

T2 f21 f22 f23 …

T3 f31 f32 f33 …

… … … … …

Feature Generation

Support Vectors Machine

(Radial Basis Function Kernel)

Classification

DiscriminantScores

RBF Kernel

f31 is the local alignment score between S3 and S1

f31 is the local alignment score between T3 and S1

Image credit: Kenny Chua

Page 20: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

20

Invited keynote at SIGIR2008 39 Copyright 2008 © Limsoon Wong

Performance of SVM-Pairwise

• ROC: The area under the curve derived from plotting true positives as a function of false positives for various thresholds

Invited keynote at SIGIR2008 40 Copyright 2008 © Limsoon Wong

Parallels• Guilt by association of

genome phylogenetic profile

• Guilt by association of dissimilarities

• Two “things” are “equivalent”– If they occur in same

documents– If they are mutually

exclusive in documents they occur in?

• Is this a new association concept in IR?

Page 21: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

21

Fusion of Multiple Evidence of

Association

Invited keynote at SIGIR2008 42 Copyright 2008 © Limsoon Wong

Coverage of Data Sources

• > 90% of known protein annotations are suggested from at least one data source

• A large percentage (80%) of known protein annotations are suggested by 3 or more data sources

Data Sources Coverage(Biological Process)

0

5

10

15

20

25

30

0 1 2 3 4 5 6 7

Number of Data Sources

Perc

enta

ge o

f An

nota

tions

% of known annotations suggested by different number of data sources

Page 22: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

22

Invited keynote at SIGIR2008 43 Copyright 2008 © Limsoon Wong

Confidence of Overlapping Evidence

• Protein annotations suggested by more data sources are more likely to be correct

• Protein annotations suggested by 4 or more data sources are correct > 60% of the time

Precision of Annotations(Biological Process)

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5 6

Number of Data Sources

Prec

isio

n of

Sug

gest

ed

Anno

tatio

ns

Fraction of annotations suggested from different number of datasources that are 

correct

Invited keynote at SIGIR2008 44 Copyright 2008 © Limsoon Wong

Difficulties w/ Information Fusion

• Differences in nature– E.g., sequence homology vs PPI are very different

relationships

• Differences in reliability– E.g., noisy datasets such as Y2H PPI and gene

expression

• Differences in scoring metrices– E.g., E-Score from BLAST vs Pearson correlation

between expression profiles

Page 23: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

23

Invited keynote at SIGIR2008 45 Copyright 2008 © Limsoon Wong

Integrated Weighted Averaging – Step 1

• Model a data source as undirected graph G = ⟨V,E⟩

– V is a set of vertices; each vertex reps a protein

– E is a set of edges; each edge (u , v) reps a relationship (e.g. seqsimilarity, interaction) betw proteins u and v

CDC34

CDC4

CDC53

CLN2

MET30

Invited keynote at SIGIR2008 46 Copyright 2008 © Limsoon Wong

Integrated Weighted Averaging – Step 2

• Combine graphs from different data sources to form a larger graph

Page 24: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

24

Invited keynote at SIGIR2008 47 Copyright 2008 © Limsoon Wong

Integrated Weighted Averaging – Step 3

• Estimate edge confidence from contributing data sources

• Predict function by observing which functions occur frequently in high-confidence neighbours

{FA, FB}{FB, FC}

{FA, FD}

?

( )( )∏∈

−−=vuDk

fvu fkpr,

,11,, ( )( )

∑∑

+

×=

u

u

Nvfvu

Nvfvuf

f r

rveuS

,,

,,

1)(

Invited keynote at SIGIR2008 48 Copyright 2008 © Limsoon Wong

Associations from Multiple Data Sources

52

PUBMED

BLAST

PFAMBIND

87524 252

144015,727 3,112

58,835 94

11,6601310,819

231,919

(12,967)

(19,808)

(61,786)

(15,220)

Page 25: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

25

Invited keynote at SIGIR2008 49 Copyright 2008 © Limsoon Wong

Precision vs Recall

00.10.20.30.40.50.60.70.80.9

1

0 0.2 0.4 0.6 0.8 1

Recall

Prec

isio

n

BINDPFAMPUBM EDBLAST_ALLBLAST_SGDALL SOURCES

Precision vs Recall

00.10.20.30.40.50.60.70.80.9

1

0 0.2 0.4 0.6 0.8 1

Recall

Prec

isio

n

BINDPFAMPUBM EDBLAST_ALLBLAST_SGDALL SOURCES

Precision vs Recall

00.10.20.30.40.50.60.70.80.9

1

0 0.2 0.4 0.6 0.8 1

Recall

Prec

isio

nBINDPFAMPUBM EDBLAST_ALLBLAST_SGDALL SOURCES

Molecular Function

Biological Process Cellular Component

Combining all data sources outperforms any individual data

source

Invited keynote at SIGIR2008 50 Copyright 2008 © Limsoon Wong

Parallels• Given a protein, determine

its function via “guilt by association of multiple evidence types”

– Sensitivity is improved– Precision is improved

• Given a document, determine the “thing” it talks about by multimodal IR

– The more media types talking about the same “thing” that this document talks about, the higher chance to determine accurately what this “thing” is

• Is this a new problem concept for multimodal IR?

Page 26: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

26

Dichotomy of knowing two entities are in some relationship and yet not knowing what

that relationship is

Invited keynote at SIGIR2008 52 Copyright 2008 © Limsoon Wong

Good Source for Evidence of “Guilt”

• IWA and other methods perform better when– Graph has fewer nodes with no annotated

neighbours – Unannotated nodes in graph are connected to a

greater number of annotated nodes

⇒A data source that is able to contribute a large # of association edges connecting to annotated proteins should provide the greatest gain in prediction accuracy

Page 27: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

27

Invited keynote at SIGIR2008 53 Copyright 2008 © Limsoon Wong

So Medline abstracts seem

like a good source of

association info

Gabow et al, BMC Bioinformatics, 9:198, 2008

But only 2% of nodes derived from Medline

are unannotated

And more disturbingly…

Invited keynote at SIGIR2008 54 Copyright 2008 © Limsoon Wong

A Dichotomy

• 80% of protein pairs that co-occurred in “enough” Medline abstracts have 1 function in common

• Yet precision-recall curve is far below that expectation!

Page 28: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

28

Invited keynote at SIGIR2008 55 Copyright 2008 © Limsoon Wong

Co-occurrence at too coarse a level?

0

200

400

600

800

1000

1200

1400

1600

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Abst

ract

Num

ber

Ratio of correct edges

Abstract Number vs Ratio of Correct Edges

>10 proteins9-10 proteins7-8 proteins5-6 proteins2-4 proteins

Half of the abstracts has less than 50% of their co-mentioned protein pairs sharing some function

0100002000030000400005000060000700008000090000

100000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Sent

ence

Num

ber

Ratio of correct edges

Sentence Number vs Ratio of Correct edges

>10 proteins9-10 proteins7-8 proteins5-6 proteins2-4 proteins

Nearly all sentences have all their co-mentioned protein pairs sharing some function

Invited keynote at SIGIR2008 56 Copyright 2008 © Limsoon Wong

Dichotomy Remains

• Nearly all sentences have all their co-mentioned protein pairs sharing some function

• Yet the precision-recall curve is far below expectation! 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cisi

on

Recall

Precision vs Recall (Biological Process)

Basal Reference

Genus filtered

Genus sentence level

Segmented

Page 29: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

29

Invited keynote at SIGIR2008 57 Copyright 2008 © Limsoon Wong

Dichotomy Explained?

• An edge in the graph simply means an association between two proteins

⇒The two proteins have a function in common

• But each protein may have several functions• Don’t know which one is the function they have in

common

Parallel in IR: We know two documents are related, but we don’t know what that relationship is

Closing Remarks

Page 30: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

30

Invited keynote at SIGIR2008 59 Copyright 2008 © Limsoon Wong

Shades of Meanings

• GO Functional Annotation– Hierarchical– 3 Namespaces (molecular function, biological

process, cellular component)

Invited keynote at SIGIR2008 60 Copyright 2008 © Limsoon Wong

“Guilt by Association” as a Search Principle

• Founded on abduction based on invariants of biology & physics

• Many forms of associations– Sequence– Phylogenetic profile– Dissimilarities

• Fusion of Multiple Evidence of Association

• Issue of Chance Association

• Dichotomy of knowing two entities are in some relationship and yet not knowing what that relationship is

• Shades of meanings

Page 31: “Guilt by Association” as a Search Principle€¦ · 2 Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong Guilt by Association • But do we know which ones are causal

31

Invited keynote at SIGIR2008 61 Copyright 2008 © Limsoon Wong

Acknowledgements

• See-Kiong Ng• Molecular Connections Pvt Ltd

• Hon Nian CHUA• Zhihui LI