charalampos (babis) e. tsourakakis [email protected] machine learning seminar 10 th january...

36
Approximate Dynamic Programming and aCGH denoising Charalampos (Babis) E. Tsourakakis [email protected] Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar 1

Post on 21-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 1

Approximate Dynamic Programming and aCGH

denoising

Charalampos (Babis) E. Tsourakakis

[email protected]

Machine Learning Seminar 10th January ‘11

Page 2: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 2

Joint work

Richard Peng SCS, CMU

Gary L. Miller SCS, CMU

Russell SchwartzSCS & BioScience

CMU

David Tolliver SCS, CMU

Maria Tsiarli CNUP, CNBC Upitt

and Stanley Shackney Oncologist

Page 3: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 3

Outline

MotivationRelated Work Our contributions

Halfspaces and DP Multiscale Monge optimization

Experimental ResultsConclusions

Page 4: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 4

Dynamic Programming

Richard Bellman

• “Lazy” Recursion!• Overlapping subproblems• Optimal Substructure

Why Dynamic Programming?The 1950s were not good years for mathematical research. We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word, research. I’m not using the term lightly; I’m using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term, research, in his presence.You can imagine how he felt, then, about the term, mathematical. ….. What title, what name, could I choose? …. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to.

Page 5: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 5

*Few* applications…

DNA sequence alignment

“Pretty” printing

Histogram constructionin DB systems

HMMs

and many more…

Page 6: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 6

… and few books

Page 7: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 7

Motivation

Tumorigenesis is strongly associated with abnormalities in DNA copy numbers.

Goal: find the true DNAcopy number per probeby denoising the measurements.

Array based comparative genomic hybridization (aCGH)

Page 8: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 8

Typical Assumptions

Near-by probes tend to have the same DNA copy number

Treat the data as 1d time series Fit piecewise constant segments

log T/Rfor humansR=2

Genome

Page 9: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 9

Problem Formulation

Input: Noisy sequence (p1,..,pn) Output: (F1,..,Fn) which minimizes

Digression:Constant C is determined by training on data with ground truth.

Page 10: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 10

Outline

MotivationRelated Work Our contributions

Halfspaces and DP Multiscale Monge optimization

Experimental ResultsConclusions

Page 11: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 11

Related Work

Don Knuth

Optimal BSTs in O(n2) time

Frances Yao

Recurrence

Quadrangle inequality (Monge)

Then we can turn the naïve O(n3) algorithm to O(n2)

Page 12: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 12

Related Work

Gaspard Monge

Quadrangle inequality

InverseQuadrangle inequality

Page 13: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 13

Related Work

Eppstein Galil Giancarlo

Larmore Schieber

𝐸 ( 𝑗 )=min1≤𝑖< 𝑗

𝐸 (𝑖 )+𝑤 (𝑖 , 𝑗 ) ,𝑤()𝑀𝑜𝑛𝑔𝑒

time

time

Page 14: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 14

Related Work

SMAWK algorithm : finds all row minima of a totally monotone matrix NxN in O(N) time!

Bein, Golin, Larmore, Zhang showed that the Knuth-Yao technique is implied by the SMAWK algorithm.

Page 15: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 15

Related Work

HMMs Bayesian HMMs Kalman Filters Wavelet decompositions Quantile regression EM and edge filtering Lasso Circular Binary Segmentation (CBS) Likelihood based methods using a

Gaussian profile for the data (CGHSEG)…….

Page 16: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 16

Outline

MotivationRelated Work Our contributions

Halfspaces and DP Multiscale Monge optimization

Experimental ResultsConclusions

Page 17: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 17

Our contributions

Technique 1: Using halfspace queries get a fast, high quality approximation algorithm

ε additive error, runs in O(n4/3+δ log(U/ε) ) time. Technique 2: break carefully original

problem into a “small” number of Monge optimization problems. Approximates optimal answer within a factor of (1+ε), O(nlogn/ε) time.

Page 18: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 18

Analysis of our Recurrence

Recurrence for our optimization problem:

Equivalent formulation where

, for all i>0

Page 19: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 19

Analysis of our Recurrence

Let

Claim:

This immediately implies a O(n2) algorithm for this problem.

This term kills the Monge property!

Page 20: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 20

Why it’s not Monge?

Basically, because we cannot be searching for the optimum breakpoint in a restricted range.

E.g., for C=1 and the sequence (0,2,0,2,….,0,2) : fit a segment per point

(0,2,0,2,….,0,2,1): fit one segment for all points

Page 21: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 21

Halfspaces and DP

Do binary searches to approximate DPi for every i=1,..,n.

Let We do enough iterations in order to

get . O(logn log(U/ε)) iterations suffice where C} By induction we can show that ,i.e., additive ε error approximation

Page 22: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 22

Halfspaces and DP

i fixed, binary search query

constant(𝑥 , 𝑖 ,𝑆𝑖 ,−1)( 𝑗 ,𝐷𝑃 𝑗 ,2𝑆 𝑗 ,𝑆 𝑗

2+𝐷𝑃 𝑗 𝑗)~ ~

Page 23: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

23

Dynamic Halfspace Reporting

EppsteinAgarwal Matousek

Halfspace emptiness queryGiven a set of points S in R4 the halfspacerange reporting problem can be solved query time space and preprocessing time update time

Page 24: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 24

Halfspaces and DP

Hence the algorithm iterates through the indices i=1..n, and maintains the Eppstein et al. data structure containing one point for every j<i.

It performs binary search on the value, which reduces to emptiness queries

It provides an answer within ε additive error from the optimal one.

Running time: O(n4/3+δ log(U/ε) )

~

Page 25: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 25

Multimonge Decomposition and DP

By simple algebra we can write our weight function w(j,i) as w’(j,i)/(i-j)+C where

The weight function w’ is Monge! Key Idea: approximate i-j by a

constant! But how?

Page 26: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 26

Multimonge Decomposition and DP

For each i, we break the choices of j into intervals [lk,rk] s.t i-lk and i-rk differ by at most 1+ε.

Ο(logn/ε) such intervals suffice to get a 1+ε approximation.

However, we need to make sure that when we solve a specific subproblem, the optimum lies in the desired interval.

How?

Page 27: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 27

Multimonge Decomposition and DP

M is a sufficiently large positive constant

Running time O(nlogn/ε)using an O(n) time persubproblem Larmore Schieber

Page 28: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 28

Outline

MotivationRelated Work Our contributions

Halfspaces and DP Multiscale Monge optimization

Experimental ResultsConclusions

Page 29: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 29

Experimental Setup

All methods implemented in MATLAB CGHseg (Picard et al.) CBS (MATLAB Bioinformatics Toolbox)

Train on data with ground truth to “learn” C.

Datasets “Hard” synthetic data (Lai et al.) Coriell Cell Lines Breast Cancer Cell Lines

Page 30: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 30

Synthetic Data

CGHTRIMMER

CBS CGHSEG

Misses a small aberrant region!

0.007 sec 60 sec 1.23 sec

Page 31: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 31

Coriell Cell Lines

CGHTRIMMER

CBS CGHSEG

5.78 sec 47.7 min 8.15 min2 (1FP,1FN) mistakes, both made by the competitors.8 (7FP,1FN) mistakes

8 (1FP,3FN) mistakes

Page 32: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 32

Breast Cancer (BT474)

CGHTRIMMER CGHSEG

CBS

NEK7, KIF14

Note: In the paperwe have an extensivebiological analysiswith respect to the findings of the threemethods

Page 33: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 33

Outline

MotivationRelated Work Our contributions

Halfspaces and DP Multiscale Monge optimization

Experimental ResultsConclusions

Page 34: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 34

Summary

New, simple formulation of aCGH denoising data with numerous other applications (e.g., histogram construction etc.)

Two new techniques for approximate DP: Halfspace queries Multiscale Monge Decomposition

Validation of our model using synthetic and real data.

Page 35: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 35

Problems

Other problems where our techniques are directly or almost directly applicable?

O(n2) is unlikely to be tight. E.g., if two points pi and pj satisfy

then they belong in different segments. Can we find a faster exact algorithm?

Page 36: Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu Machine Learning Seminar 10 th January ‘11 Machine Learning Lunch Seminar1

Machine Learning Lunch Seminar 36

Thanks a lot!