![Page 1: Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir](https://reader035.vdocuments.mx/reader035/viewer/2022081602/551b6cb3550346d6338b4bfe/html5/thumbnails/1.jpg)
Weighing Evidence Weighing Evidence in the Absence in the Absence
of a Gold Standardof a Gold StandardPhil Long
Genome Institute of Singapore
(joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir Friedman and Edison Liu.)
![Page 2: Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir](https://reader035.vdocuments.mx/reader035/viewer/2022081602/551b6cb3550346d6338b4bfe/html5/thumbnails/2.jpg)
Problem: Ortholog mappingProblem: Ortholog mapping
• Pair genes in one organism with their equivalent counterparts in another
• Useful for supporting medical research using animal models
![Page 3: Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir](https://reader035.vdocuments.mx/reader035/viewer/2022081602/551b6cb3550346d6338b4bfe/html5/thumbnails/3.jpg)
A little molecular biologyA little molecular biology
• DNA has nucleotides (A, C, T and G) arranged linearly along chromosomes
• Regions of DNA, called genes, encode proteins
• Proteins biochemical workhorses
• Proteins made up of amino acids• also strung together linearly
• fold up to form 3D structure
![Page 4: Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir](https://reader035.vdocuments.mx/reader035/viewer/2022081602/551b6cb3550346d6338b4bfe/html5/thumbnails/4.jpg)
Mutations and evolutionMutations and evolution
• Speciation often roughly as follows:• one species separated into two populations• separate populations’ genomes drift apart through
mutation• important parts (e.g. genes) drift less
• Orthologs have common evolutionary ancestor• Genes sometimes copied
• original retains function• copy drifts or dies out
• Both fine-grained and coarse-grained mutations
![Page 5: Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir](https://reader035.vdocuments.mx/reader035/viewer/2022081602/551b6cb3550346d6338b4bfe/html5/thumbnails/5.jpg)
Evidence of orthologyEvidence of orthology
• (protein) sequence similarity
• comparison with third organism
• conservation of synteny
.
.
.
![Page 6: Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir](https://reader035.vdocuments.mx/reader035/viewer/2022081602/551b6cb3550346d6338b4bfe/html5/thumbnails/6.jpg)
Conserved synteny Conserved synteny • Neighbor relationships often preserved
• Consequently, similarity among their neighbors evidence that a pair of genes are orthologs
![Page 7: Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir](https://reader035.vdocuments.mx/reader035/viewer/2022081602/551b6cb3550346d6338b4bfe/html5/thumbnails/7.jpg)
PlanPlan
• Identify numerical features corresponding to• sequence similarity
• common similarity to third organism
• conservation of synteny
• “Learn” mapping from feature values to prediction
![Page 8: Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir](https://reader035.vdocuments.mx/reader035/viewer/2022081602/551b6cb3550346d6338b4bfe/html5/thumbnails/8.jpg)
Problem – no “gold Problem – no “gold standard”standard”
• for mouse-human orthology, Jackson database reasonable
• for human-zebrafish? human-pombe?
![Page 9: Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir](https://reader035.vdocuments.mx/reader035/viewer/2022081602/551b6cb3550346d6338b4bfe/html5/thumbnails/9.jpg)
Another “no gold standard” Another “no gold standard” problem: protein-protein problem: protein-protein
interactionsinteractions• Sources of evidence:
• Yeast two-hybrid
• Rosetta Stone
• Phage display
• All yield errors
.
.
.
![Page 10: Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir](https://reader035.vdocuments.mx/reader035/viewer/2022081602/551b6cb3550346d6338b4bfe/html5/thumbnails/10.jpg)
Related Theoretical Work Related Theoretical Work [MV95] – Problem[MV95] – Problem
• Goal:• given m training examples generated as below• output accurate classifier h
• Training example generation:• All variables {0,1}-valued• Y chosen randomly, fixed
• X1,...,Xn chosen independently with Pr(Xi = Y) = pi, where pi is• unknown, • same when Y is 0 or 1 (crucial for analysis)
• only X1,...,Xn given to training algorithm
![Page 11: Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir](https://reader035.vdocuments.mx/reader035/viewer/2022081602/551b6cb3550346d6338b4bfe/html5/thumbnails/11.jpg)
Related Theoretical Work Related Theoretical Work [MV95] – Results[MV95] – Results
• If n ≥ 3, can approach Bayes error (best possible for source) as m gets large
• Idea:• variable “good” if often agrees with others
• can e.g. solve for Pr(X1 = Y) as function of Pr(X1 = X2), Pr(X1 = X3), and Pr(X2 = X3)
• can estimate Pr(X1 = X2), Pr(X1 = X3), and Pr(X2 = X3) from the training data
• can plug in to get estimates of Pr(X1 = Y),...,Pr(Xn = Y)
• can use resulting estimates of Pr(X1 = Y),...,Pr(Xn = Y) to approximate optimal classifier for source
![Page 12: Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir](https://reader035.vdocuments.mx/reader035/viewer/2022081602/551b6cb3550346d6338b4bfe/html5/thumbnails/12.jpg)
In our problem(s)...In our problem(s)...
• Pr(Y = 1) small
• X1,...,Xn continuous-valued
• Reasonable to assume X1,...,Xn conditionally independent given Y
• Reasonable to assume Pr(Y = 1 | Xi = x) increasing in x, for all i
• Sufficient to sort training examples in order of associated conditional probabilities that Y = 1
![Page 13: Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir](https://reader035.vdocuments.mx/reader035/viewer/2022081602/551b6cb3550346d6338b4bfe/html5/thumbnails/13.jpg)
Key IdeaKey Idea
• Suppose Pr(Y = 1) known• For variable i,
• Set threshold so that Pr(Ui = 1) = Pr(Y = 1)• Then Pr(Y = 1 and Ui = 0) = Pr(Y = 0 and Ui = 1) • Can solve for these error probabilities for all i in
terms of probabilities Ui’s agree,...
- - - - - - - - - - - - - + -- - - + + - + - - + + +
Ui = 1Ui = 0
![Page 14: Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir](https://reader035.vdocuments.mx/reader035/viewer/2022081602/551b6cb3550346d6338b4bfe/html5/thumbnails/14.jpg)
Final Plan (informal)Final Plan (informal)
• Assume various values of Pr(Y = 1); predict orthologs given each
• For pairs of genes predicted to be orthologs even when Pr(Y = 1) assumed small, confidently predict orthology
• For pairs of genes predicted to be orthologs only when Pr(Y = 1) assumed pretty big, predict orthology more tentatively
![Page 15: Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir](https://reader035.vdocuments.mx/reader035/viewer/2022081602/551b6cb3550346d6338b4bfe/html5/thumbnails/15.jpg)
Final Plan – Final Plan – Probabilistic ViewpointProbabilistic Viewpoint
• Consider hidden variable Z:• takes values uniformly distributed in [0,1]• interpretation: “obviously orthologous”
• Assumptions• Pr(Y = 1| Z = z) increasing in z
• For all z, Pr(Z ≥ z | Xi = x ) increasing in x
• For various z• Let Vz = 1 if Z ≥ z, Vz = 0 otherwise
• Let Uz,i = 1 if Xi ≥ θz,i, Uz,i = 0 otherwise, where θz,i chosen so that Pr(Uz,i = 1) = Pr(Vz = 1)
• Interpretations: • Vz is “In the top 100(1-z)% most likely to have Y = 1 overall”
• Uz,i “In the top 100(1-z) % most likely to have Y = 1 given Xi”
![Page 16: Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir](https://reader035.vdocuments.mx/reader035/viewer/2022081602/551b6cb3550346d6338b4bfe/html5/thumbnails/16.jpg)
Final Plan - AlgorithmFinal Plan - Algorithm
• Estimate conditional probability that Vz = 1, i.e. that Z ≥ z, given each
training example, using estimated probabilities pairs of Uz,i’s agree
• Add to estimate Z’s; sort by estimates.
![Page 17: Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir](https://reader035.vdocuments.mx/reader035/viewer/2022081602/551b6cb3550346d6338b4bfe/html5/thumbnails/17.jpg)
Practical problemPractical problem
• Small errors in estimates of Pr(Uz,i = Uz,j)’s
can lead to large errors in estimates of Pr(Uz,i = Vz )’s (in fact, program crashes).
• Solution: • when Pr(Vz = 1) small is important case
(confident predictions)
• can approximate: Pr(Uz,i ≠ Vz ) ~ ½ (Pr(Uz,i ≠ Uz,j) + Pr(Uz,i ≠ Uz,k) - Pr(Uz,j ≠ Uz,k)).
![Page 18: Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir](https://reader035.vdocuments.mx/reader035/viewer/2022081602/551b6cb3550346d6338b4bfe/html5/thumbnails/18.jpg)
Evaluation: Artificial SourceEvaluation: Artificial Source
• Examples generated using randomly chosen probability distribution:
• Pr(Yz = 1) = 0.1, n = 5• For each i,
• choose μi uniformly from [min,max]• set distributions for ith variable:
• Pr(Xi | Y=0) = N(-μi,1), • Pr(Xi | Y=1) = N(μi,1).
• Evaluate using area under the ROC curve• Repeat 100 times, average
![Page 19: Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir](https://reader035.vdocuments.mx/reader035/viewer/2022081602/551b6cb3550346d6338b4bfe/html5/thumbnails/19.jpg)
ROC curveROC curve
False positives
True
positives
1
1
Area under
the ROC curve
![Page 20: Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir](https://reader035.vdocuments.mx/reader035/viewer/2022081602/551b6cb3550346d6338b4bfe/html5/thumbnails/20.jpg)
Results: Artificial SourceResults: Artificial Source
m min μ max μ peer AUC opt (w/ Y’s)
1000 0.2 1.0 .940 .985
1000 0.1 0.5 .811 .881
1000 0.05 0.25 .635 .818
1000 0.02 0.1 .611 .753
![Page 21: Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir](https://reader035.vdocuments.mx/reader035/viewer/2022081602/551b6cb3550346d6338b4bfe/html5/thumbnails/21.jpg)
Evaluation: mouse-human Evaluation: mouse-human ortholog mappingortholog mapping
• Use Jackson mouse-human ortholog database as “gold standard”
• Apply algorithm, post-processing to map each gene to unique ortholog
• Compare with analogous BLAST-only algorithm• Plot ROC curve • Treat anything not in database as non-ortholog
• some “false positives” in fact correct• error rate overestimated
![Page 22: Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir](https://reader035.vdocuments.mx/reader035/viewer/2022081602/551b6cb3550346d6338b4bfe/html5/thumbnails/22.jpg)
Results: mouse-human Results: mouse-human ortholog mappingortholog mapping
![Page 23: Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir](https://reader035.vdocuments.mx/reader035/viewer/2022081602/551b6cb3550346d6338b4bfe/html5/thumbnails/23.jpg)
Open problemsOpen problems
• Given our assumptions, is there an algorithm for learning using random examples that always approaches the optimal AUC given knowledge of the source?
• Is discretizing the independent variables necessary?
• How does our method compare with other natural algorithms? (E.g. what about algorithms based on clustering?)