“guilt by association” as a search principle€¦ · 2 invited keynote at sigir2008 3 copyright...
TRANSCRIPT
1
“Guilt by Association”as a Search Principle
Limsoon Wong
Invited keynote at SIGIR2008 2 Copyright 2008 © Limsoon Wong
A slight detour …
2
Invited keynote at SIGIR2008 3 Copyright 2008 © Limsoon Wong
Guilt by Association
• But do we know which ones are causal genes, which ones are surrogates, and which are noise?
Diagnostic ALL BM samples (n=327)
3σ-3σ -2σ -1σ 0 1σ 2σσ = std deviation from mean
Gen
es fo
r cla
ss
dist
inct
ion
(n=2
71)
TEL-AML1BCR-ABL
Hyperdiploid >50E2A-PBX1
MLL T-ALL Novel
Invited keynote at SIGIR2008 4 Copyright 2008 © Limsoon Wong
• Complete genomes are now available
• Knowing the genes is not enough to understand how biology functions
• Proteins, not genes, are responsible for many cellular activities
• Proteins function by interacting w/ other proteins and biomolecules
GENOME PROTEOME
“INTERACTOME”
Biology 101
Slide credit: See-Kiong Ng
3
Invited keynote at SIGIR2008 5 Copyright 2008 © Limsoon Wong
Gene Regulatory Circuits
• Extract functional annotations
• Extract relationships between genes, proteins, processes, diseases, & drugs
Source: Miltenyi Biotec
• Predict functional annotations
• Predict relationships between genes, proteins, processes, diseases, & drugs
Invited keynote at SIGIR2008 6 Copyright 2008 © Limsoon Wong
Information Extraction & Retrieval:Challenges in Context of Biomedicine
4
Invited keynote at SIGIR2008 7 Copyright 2008 © Limsoon Wong
A Typical Architecture
NLP/IE System Manual Curation/QA System
Grid of db in life sciences
User Dashboard
Invited keynote at SIGIR2008 8 Copyright 2008 © Limsoon Wong
Understandability:Is our databases or
search results understandable?
• Take a search on p53. You get >300k hits or some # like that on MEDLINE
• It is not feasible for anyone to go thru all of that to find what he wants! And this problem is growing bigger as MEDLINE doubles every 1-2 year.
• Need to organize the db and/or the search results to make it easier for users to understand or to browse the results
5
Invited keynote at SIGIR2008 9 Copyright 2008 © Limsoon Wong
Completeness:Is the structure of
our databases expressive enough to capture critical
information explicitly?
• Take a key paper such as the Kohn paper that summarises current knowledge on p53
• Is there a (semi)structureddb that is able to capture all info in that paper explicitly?
• How well does this (semi-) structured database generalize to other similar type of papers?
Invited keynote at SIGIR2008 10 Copyright 2008 © Limsoon Wong
Extract Entities & Annotations• Nomenclature loosely
followed
• Freq use of conjunction and disjunction in bio names with multiple bio-entity names sharing one head noun
• Long descriptive names
• Names of genes & proteins used interchangeably
Zhou et al., BioCreAtIvE, 2004
6
Invited keynote at SIGIR2008 11 Copyright 2008 © Limsoon Wong
Extract Relationships
• Sentences describing relationships tend to be complicated
• Domain knowledge is often needed for interaction template filling
Invited keynote at SIGIR2008 12 Copyright 2008 © Limsoon Wong
Organizing Retrieval Results
Credit: Molecular Connections
7
Invited keynote at SIGIR2008 13 Copyright 2008 © Limsoon Wong
Organizing Retrieval Results
Credit: Molecular Connections
Invited keynote at SIGIR2008 14 Copyright 2008 © Limsoon Wong
Beyond IE & Retrieval:Predictions in Context of Biomedicine
8
Invited keynote at SIGIR2008 15 Copyright 2008 © Limsoon Wong
Plan
• Abductive Foundation of “Guilt by Association”
• Issue of Chance Association
• Novel Forms of Association
• Fusion of Multiple Evidence of Association
• Dichotomy of knowing two entities are in some relationship and yet not knowing what that relationship is
Abductive Foundation of “Guilt by Association”:
A Protein Function Prediction Perspective
9
Invited keynote at SIGIR2008 17 Copyright 2008 © Limsoon Wong
Function Assignment to Protein Sequence
• How do we attempt to assign a function to a new protein sequence?
SPSTNRKYPPLPVDKLEEEINRRMADDNKLFREEFNALPACPIQATCEAASKEENKEKNRYVNILPYDHSRVHLTPVEGVPDSDYINASFINGYQEKNKFIAAQGPKEETVNDFWRMIWEQNTATIVMVTNLKERKECKCAQYWPDQGCWTYGNVRVSVEDVTVLVDYTVRKFCIQQVGDVTNRKPQRLITQFHFTSWPDFGVPFTPIGMLKFLKKVKACNPQYAGAIVVHCSAGVGRTGTFVVIDAMLDMMHSERKVDVYGFVSRIRAQRCQMVQTDMQYVFIYQALLEHYLYGDTELEVT
Invited keynote at SIGIR2008 18 Copyright 2008 © Limsoon Wong
Parallels• Given a protein, determine
its functions
• Given a protein, find other proteins that share a common function with it
• Given a function, find all proteins having that function
• Given a document, determine what are the “things” it describes
• Given a document, find other documents that describe a common “thing” with it
• Given a “thing”, find all documents that describe that thing
10
Invited keynote at SIGIR2008 19 Copyright 2008 © Limsoon Wong
A protein is a ...
• A protein is a large complex molecule made up of one or more chains of amino acids
• Protein performs a wide variety of activities in the cell
Invited keynote at SIGIR2008 20 Copyright 2008 © Limsoon Wong
Invariant and Abductive Reasoning• Function is determined
by 3D struct of protein & environment protein is in
• Constraints imposed by 3D struct & environment give rise to “invariant”properties observed in proteins having the ancestor with that function
⇒ Abductive reasoning– If those invariant
properties are seen in a protein, then the protein is homolog of this protein
⇒ “Guilt by association”
Hypothesis/Fact A
Entailment A B
Observation/Conclusion B
What is the parallel of the above in IR?
11
Invited keynote at SIGIR2008 21 Copyright 2008 © Limsoon Wong
“Guilt by Association”
• Compare the target sequence T with sequences S1, …, Sn of known function in a database
• Determine which ones amongst S1, …, Sn are the mostly likely homologs of T
• Then assign to T the same function as these homologs
• Finally, confirm with suitable wet experiments
Invited keynote at SIGIR2008 22 Copyright 2008 © Limsoon Wong
Sequence Alignment: Poor Example
• Poor seq alignment shows few matched positions⇒ The two proteins are not likely to be homologous
No obvious match between Amicyanin and Ascorbate Oxidase
12
Invited keynote at SIGIR2008 23 Copyright 2008 © Limsoon Wong
Sequence Alignment: Good Example
• Good alignment usually has clusters of extensive matched positions
⇒ The two proteins are likely to be homologous
good match between Amicyanin and unknown M. loti protein
Invited keynote at SIGIR2008 24 Copyright 2008 © Limsoon Wong
Guilt by Association of Seq SimilarityCompare T with seqs of known function in a db
Assign to T same function as homologs
Confirm with suitable wet experiments
Discard this functionas a candidate
13
Invited keynote at SIGIR2008 25 Copyright 2008 © Limsoon Wong
Homologs obtained by BLAST
• Thus our example sequence could be a protein tyrosine phosphatase α (PTPα)
Issue of Chance Association:
A Twist in the Tale
Image credit: Shanti Christensen, http:// static.flickr.com/46/148437681_7f2dfa977e_m.jpg
14
Invited keynote at SIGIR2008 27 Copyright 2008 © Limsoon Wong
Law of Large Numbers
• Suppose you are in a room with 365 other people
• Q: What is the prob that a specific person in the room has the same birthday as you?
• A: 1/365 = 0.3%
• Q: What is the prob that there is a person in the room having the same birthday as you?
• A: 1 – (364/365)365 = 63%
• Q: What is the prob that there are two persons in the room having the same birthday?
• A: 100%
Invited keynote at SIGIR2008 28 Copyright 2008 © Limsoon Wong
Interpretation of P-value
• Seq. comparison progs, e.g. BLAST, often associate a P-value to each hit
• P-value is interpreted as prob that a random seq has an equally good alignment
• Suppose the P-value of an alignment is 10−6
• If database has 107
seqs, then you expect 107 * 10−6 = 10 seqs in it that give an equally good alignment
⇒ Need to correct for database size if your seq comparison progdoes not do that!Note: P = 1 – e −E
15
Invited keynote at SIGIR2008 29 Copyright 2008 © Limsoon Wong
Lightning Does Strike Twice!
• Roy Sullivan, a former park ranger from Virgina, was struck by lightning 7 times– 1942 (lost big-toe nail)– 1969 (lost eyebrows)– 1970 (left shoulder seared)– 1972 (hair set on fire)– 1973 (hair set on fire & legs seared)– 1976 (ankle injured)– 1977 (chest & stomach burned)
• September 1983, he committed suicideCartoon: Ron Hipschman
Data: David Hand
Invited keynote at SIGIR2008 30 Copyright 2008 © Limsoon Wong
Effect of Seq Compositional Bias
• One fourth of all residues in protein seqs occur in regions with biased amino acid composition
• Alignments of two such regions achieves high score purely due to segment composition
⇒While it is worth noting that two proteins contain similar low complexity regions, they are best excluded when constructing alignments
Source: NCBI
16
Invited keynote at SIGIR2008 31 Copyright 2008 © Limsoon Wong
Effect of Sequence Length
Abagyan RA, Batalov S. Do aligned sequences share the same fold? J Mol Biol. 1997
Oct 17;273(1):355-68
Invited keynote at SIGIR2008 32 Copyright 2008 © Limsoon Wong
Parallels• P-value and E-value
• Compositionally biased regions
• Length, conserved site, transitive assignment, and other caveats
• Ranking measures? – Different concepts– Not necessarily same
effect
• Stop words?
• ???
17
Novel Forms of Associations
Invited keynote at SIGIR2008 34 Copyright 2008 © Limsoon Wong
Important Unsolved Challenge
• What if there is no useful seq homolog?• Guilt by other types of association!
– Domain modeling (e.g., HMMPFAM)Similarity of phylogenetic profilesSimilarity of dissimilarities (e.g., SVM-PAIRWISE)
– Similarity of subcellular co-localization & other physico-chemico properties(e.g., PROTFUN)
– Similarity of gene expression profiles– Similarity of protein-protein interaction partners– …
Fusion of multiple types of info
18
Invited keynote at SIGIR2008 35 Copyright 2008 © Limsoon Wong
Guilt by Association of Phylogenetic Profile
• A protein is not alone when performing its biological function
⇒Gene (and hence proteins) with identical patterns of occurrence across phyla tend to function together
Invited keynote at SIGIR2008 36 Copyright 2008 © Limsoon Wong
KEGGCOG
fra c
ti on
o f g
e ne
p ai r
sh a
v in g
ha m
min
g di
sta n
ce D
a nd
s ha r
e a
c om
mo n
pa t
h way
in K
EG
G/C
OG
hamming distance (D)
hamming distance X,Y= #lineages X occurs +
#lineages Y occurs –2 * #lineages X, Y occur
Phylogenetic Profiling: EvidenceWu et al., Bioinformatics, 19:1524--1530, 2003
• Proteins having low hamming distance (thus highly similar phylogenetic profiles) tend to share common pathways Exercise: Why do proteins having high
hamming distance also have this behaviour?
19
Invited keynote at SIGIR2008 37 Copyright 2008 © Limsoon Wong
Guilt by Association of Dissimilarities
…Unknown1
…………
…Color = orange vs yellowSkin = rough vs smoothSize = small vs smallShape = round vs oblong
Color = orange vs orangeSkin = rough vs roughSize = small vs smallShape = round vs round
Orange2
…Color = red vs yellowSkin = smooth vs smoothSize = small vs smallShape = round vs oblong
Color = red vs orangeSkin = smooth vs roughSize = small vs smallShape = round vs round
Apple1
…Banana1Orange1
Differences of “unknown”to other fruits are same as
“apple” to other fruits
Color = red vs orangeSkin = smooth vs roughSize = small vs smallShape = round vs round
Color = red vs yellowSkin = smooth vs smoothSize = small vs smallShape = round vs oblong
“unknown” is an “apple”!
Invited keynote at SIGIR2008 38 Copyright 2008 © Limsoon Wong
SVM-Pairwise Framework
Training Data
S1
S2
S3
…
Testing Data
T1
T2
T3
…
Training Features
S1 S2 S3 …
S1 f11 f12 f13 …
S2 f21 f22 f23 …
S3 f31 f32 f33 …
… … … … …
Feature Generation
Trained SVM Model(Feature Weights)
Training
Testing Features
S1 S2 S3 …
T1 f11 f12 f13 …
T2 f21 f22 f23 …
T3 f31 f32 f33 …
… … … … …
Feature Generation
Support Vectors Machine
(Radial Basis Function Kernel)
Classification
DiscriminantScores
RBF Kernel
f31 is the local alignment score between S3 and S1
f31 is the local alignment score between T3 and S1
Image credit: Kenny Chua
20
Invited keynote at SIGIR2008 39 Copyright 2008 © Limsoon Wong
Performance of SVM-Pairwise
• ROC: The area under the curve derived from plotting true positives as a function of false positives for various thresholds
Invited keynote at SIGIR2008 40 Copyright 2008 © Limsoon Wong
Parallels• Guilt by association of
genome phylogenetic profile
• Guilt by association of dissimilarities
• Two “things” are “equivalent”– If they occur in same
documents– If they are mutually
exclusive in documents they occur in?
• Is this a new association concept in IR?
21
Fusion of Multiple Evidence of
Association
Invited keynote at SIGIR2008 42 Copyright 2008 © Limsoon Wong
Coverage of Data Sources
• > 90% of known protein annotations are suggested from at least one data source
• A large percentage (80%) of known protein annotations are suggested by 3 or more data sources
Data Sources Coverage(Biological Process)
0
5
10
15
20
25
30
0 1 2 3 4 5 6 7
Number of Data Sources
Perc
enta
ge o
f An
nota
tions
% of known annotations suggested by different number of data sources
22
Invited keynote at SIGIR2008 43 Copyright 2008 © Limsoon Wong
Confidence of Overlapping Evidence
• Protein annotations suggested by more data sources are more likely to be correct
• Protein annotations suggested by 4 or more data sources are correct > 60% of the time
Precision of Annotations(Biological Process)
0
0.2
0.4
0.6
0.8
1
0 1 2 3 4 5 6
Number of Data Sources
Prec
isio
n of
Sug
gest
ed
Anno
tatio
ns
Fraction of annotations suggested from different number of datasources that are
correct
Invited keynote at SIGIR2008 44 Copyright 2008 © Limsoon Wong
Difficulties w/ Information Fusion
• Differences in nature– E.g., sequence homology vs PPI are very different
relationships
• Differences in reliability– E.g., noisy datasets such as Y2H PPI and gene
expression
• Differences in scoring metrices– E.g., E-Score from BLAST vs Pearson correlation
between expression profiles
23
Invited keynote at SIGIR2008 45 Copyright 2008 © Limsoon Wong
Integrated Weighted Averaging – Step 1
• Model a data source as undirected graph G = ⟨V,E⟩
– V is a set of vertices; each vertex reps a protein
– E is a set of edges; each edge (u , v) reps a relationship (e.g. seqsimilarity, interaction) betw proteins u and v
CDC34
CDC4
CDC53
CLN2
MET30
Invited keynote at SIGIR2008 46 Copyright 2008 © Limsoon Wong
Integrated Weighted Averaging – Step 2
• Combine graphs from different data sources to form a larger graph
24
Invited keynote at SIGIR2008 47 Copyright 2008 © Limsoon Wong
Integrated Weighted Averaging – Step 3
• Estimate edge confidence from contributing data sources
• Predict function by observing which functions occur frequently in high-confidence neighbours
{FA, FB}{FB, FC}
{FA, FD}
?
( )( )∏∈
−−=vuDk
fvu fkpr,
,11,, ( )( )
∑∑
∈
∈
+
×=
u
u
Nvfvu
Nvfvuf
f r
rveuS
,,
,,
1)(
Invited keynote at SIGIR2008 48 Copyright 2008 © Limsoon Wong
Associations from Multiple Data Sources
52
PUBMED
BLAST
PFAMBIND
87524 252
144015,727 3,112
58,835 94
11,6601310,819
231,919
(12,967)
(19,808)
(61,786)
(15,220)
25
Invited keynote at SIGIR2008 49 Copyright 2008 © Limsoon Wong
Precision vs Recall
00.10.20.30.40.50.60.70.80.9
1
0 0.2 0.4 0.6 0.8 1
Recall
Prec
isio
n
BINDPFAMPUBM EDBLAST_ALLBLAST_SGDALL SOURCES
Precision vs Recall
00.10.20.30.40.50.60.70.80.9
1
0 0.2 0.4 0.6 0.8 1
Recall
Prec
isio
n
BINDPFAMPUBM EDBLAST_ALLBLAST_SGDALL SOURCES
Precision vs Recall
00.10.20.30.40.50.60.70.80.9
1
0 0.2 0.4 0.6 0.8 1
Recall
Prec
isio
nBINDPFAMPUBM EDBLAST_ALLBLAST_SGDALL SOURCES
Molecular Function
Biological Process Cellular Component
Combining all data sources outperforms any individual data
source
Invited keynote at SIGIR2008 50 Copyright 2008 © Limsoon Wong
Parallels• Given a protein, determine
its function via “guilt by association of multiple evidence types”
– Sensitivity is improved– Precision is improved
• Given a document, determine the “thing” it talks about by multimodal IR
– The more media types talking about the same “thing” that this document talks about, the higher chance to determine accurately what this “thing” is
• Is this a new problem concept for multimodal IR?
26
Dichotomy of knowing two entities are in some relationship and yet not knowing what
that relationship is
Invited keynote at SIGIR2008 52 Copyright 2008 © Limsoon Wong
Good Source for Evidence of “Guilt”
• IWA and other methods perform better when– Graph has fewer nodes with no annotated
neighbours – Unannotated nodes in graph are connected to a
greater number of annotated nodes
⇒A data source that is able to contribute a large # of association edges connecting to annotated proteins should provide the greatest gain in prediction accuracy
27
Invited keynote at SIGIR2008 53 Copyright 2008 © Limsoon Wong
So Medline abstracts seem
like a good source of
association info
Gabow et al, BMC Bioinformatics, 9:198, 2008
But only 2% of nodes derived from Medline
are unannotated
And more disturbingly…
Invited keynote at SIGIR2008 54 Copyright 2008 © Limsoon Wong
A Dichotomy
• 80% of protein pairs that co-occurred in “enough” Medline abstracts have 1 function in common
• Yet precision-recall curve is far below that expectation!
28
Invited keynote at SIGIR2008 55 Copyright 2008 © Limsoon Wong
Co-occurrence at too coarse a level?
0
200
400
600
800
1000
1200
1400
1600
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Abst
ract
Num
ber
Ratio of correct edges
Abstract Number vs Ratio of Correct Edges
>10 proteins9-10 proteins7-8 proteins5-6 proteins2-4 proteins
Half of the abstracts has less than 50% of their co-mentioned protein pairs sharing some function
0100002000030000400005000060000700008000090000
100000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Sent
ence
Num
ber
Ratio of correct edges
Sentence Number vs Ratio of Correct edges
>10 proteins9-10 proteins7-8 proteins5-6 proteins2-4 proteins
Nearly all sentences have all their co-mentioned protein pairs sharing some function
Invited keynote at SIGIR2008 56 Copyright 2008 © Limsoon Wong
Dichotomy Remains
• Nearly all sentences have all their co-mentioned protein pairs sharing some function
• Yet the precision-recall curve is far below expectation! 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cisi
on
Recall
Precision vs Recall (Biological Process)
Basal Reference
Genus filtered
Genus sentence level
Segmented
29
Invited keynote at SIGIR2008 57 Copyright 2008 © Limsoon Wong
Dichotomy Explained?
• An edge in the graph simply means an association between two proteins
⇒The two proteins have a function in common
• But each protein may have several functions• Don’t know which one is the function they have in
common
Parallel in IR: We know two documents are related, but we don’t know what that relationship is
Closing Remarks
30
Invited keynote at SIGIR2008 59 Copyright 2008 © Limsoon Wong
Shades of Meanings
• GO Functional Annotation– Hierarchical– 3 Namespaces (molecular function, biological
process, cellular component)
Invited keynote at SIGIR2008 60 Copyright 2008 © Limsoon Wong
“Guilt by Association” as a Search Principle
• Founded on abduction based on invariants of biology & physics
• Many forms of associations– Sequence– Phylogenetic profile– Dissimilarities
• Fusion of Multiple Evidence of Association
• Issue of Chance Association
• Dichotomy of knowing two entities are in some relationship and yet not knowing what that relationship is
• Shades of meanings
31
Invited keynote at SIGIR2008 61 Copyright 2008 © Limsoon Wong
Acknowledgements
• See-Kiong Ng• Molecular Connections Pvt Ltd
• Hon Nian CHUA• Zhihui LI