an empirical bayesian analysis of simultaneous ... · pdf fileempirical bayesian analysis of...

35
AN EMPIRICAL BAYESIAN ANALYSIS OF SIMULTANEOUS CHANGEPOINTS IN MULTIPLE DATA SEQUENCES By Zhou Fan Lester Mackey Technical Report No. 2015-17 August 2015 Department of Statistics STANFORD UNIVERSITY Stanford, California 94305-4065

Upload: nguyencong

Post on 15-Mar-2018

244 views

Category:

Documents


2 download

TRANSCRIPT

  • AN EMPIRICAL BAYESIAN ANALYSIS OF SIMULTANEOUS

    CHANGEPOINTS IN MULTIPLE DATA SEQUENCES

    By

    Zhou Fan Lester Mackey

    Technical Report No. 2015-17 August 2015

    Department of Statistics STANFORD UNIVERSITY

    Stanford, California 94305-4065

  • AN EMPIRICAL BAYESIAN ANALYSIS OF SIMULTANEOUS

    CHANGEPOINTS IN MULTIPLE DATA SEQUENCES

    By

    Zhou Fan Lester Mackey

    Stanford University

    Technical Report No. 2015-17 August 2015

    Department of Statistics STANFORD UNIVERSITY

    Stanford, California 94305-4065

    http://statistics.stanford.edu

  • AN EMPIRICAL BAYESIAN ANALYSIS OF SIMULTANEOUS

    CHANGEPOINTS IN MULTIPLE DATA SEQUENCES

    ZHOU FAN AND LESTER MACKEY

    Abstract. Motivated by applications in genomics, finance, and biomolecular simulation, we in-troduce a Bayesian framework for modeling changepoints that tend to co-occur across multiplerelated data sequences. We infer the locations and sequence memberships of changepoints in ourhierarchical model by developing efficient Markov chain Monte Carlo sampling and posterior modefinding algorithms based on dynamic programming recursions. We further propose an empiricalBayesian Monte Carlo expectation-maximization procedure for estimating unknown prior param-eters from data. The resulting framework accommodates a broad range of data and changepointtypes, including real-valued sequences with changing mean or variance and sequences of counts orbinary observations. We demonstrate on simulated data that our changepoint estimation accuracyis competitive with the best methods in the literature, and we apply our methodology to the discov-ery of DNA copy number variations in cancer cell lines and the analysis of historical price volatilityin U.S. stocks.

    1. Introduction

    Figure 1 displays three modern examples of aligned sequence data. Panel (a) presents DNAcopy number measurements at (sorted) genome locations in four human cancer cell lines [38].Panel (b) shows the daily stock returns of four U.S. stocks over a period of ten years. Panel (c)traces the interatomic distances between four pairs of atoms in a protein molecule over the courseof a computer simulation [20]. Each sequence in each panel is reasonably modeled as having anumber of discrete changepoints, such that the characteristics of the data change abruptly ateach changepoint but remain homogeneous between changepoints. In panel (a), these changepointsdemarcate the boundaries of DNA stretches with abnormal copy number. In panel (b), changepointsindicate historical events that abruptly impacted the volatility of stock returns. In panel (c),changepoints indicate structural changes in the 3-D conformation of the protein molecule. For eachof these examples, it is important to understand when and in which sequences changepoints occur.However, the number and locations of these changepoints are typically not known a priori and mustbe estimated from the data.

    The problem of identifying changepoints in sequential data was first studied by W. A. Shewhart[31] and E. S. Page [27], in the context of designing inspection schemes that detect abrupt changes inthe quality of output of industrial manufacturing processes. For the applications depicted in Figure1, changepoint detection methodology was first proposed as a tool to discover DNA copy numbervariations in [26] and remains one of the most frequently used methods for this task. Changepointanalysis of the volatility of historical returns of the Dow Jones Industrial Index was performed in[14, 1]. Conformational changes in computer simulations of protein dynamics were analyzed usingchangepoint detection methods in [11]. We refer the reader to [4, 5] for more detailed reviews ofthe literature and further applications.

    Department of Statistics, Stanford UniversityE-mail addresses: [email protected], [email protected] is supported by a Hertz Foundation Fellowship and an NDSEG Fellowship (DoD, Air Force Office of Scientific

    Research, 32 CFR 168a). LM is supported by a Terman Fellowship.

    1

  • 2 EMPIRICAL BAYESIAN ANALYSIS OF SIMULTANEOUS CHANGEPOINTS

    (a) (b) (c)

    Figure 1. (a) DNA copy numbers in four cancer cell lines, indicated by fluorescence intensity log-ratiosfrom array-CGH experiments. (b) Daily returns of four U.S. stocks. (c) Distances between four pairs ofatoms in a computer simulation of a protein molecule.

    In many modern applications, we have available not just a single data sequence but rathermultiple related sequences measured at the same locations or time points. These sequences oftenexhibit changepoints occurring at the same sequential locations, which we will refer to as si-multaneous changepoints in accordance with the terminology introduced in [45]. For instance,copy number variations frequently occur at common genomic locations in cancer cells [29] and inbiologically-related individuals [45], economic and political events can impact the volatility of manystock returns in tandem, and a conformational change in a region of a protein molecule can affectdistances between multiple atomic pairs [11].

    In such applications, an analysis of multiple sequences jointly may yield greater statistical powerin detecting their changepoints than analyses of the sequences individually. In addition, a jointanalysis may more precisely identify the times or locations at which changepoints occur and betterhighlight the locations where changepoints most frequently recur across sequences. For these rea-sons, several statistical methods have recently been developed to simultaneously locate changepointsacross multiple data sequences. These methods include greedy recursive segmentation procedures[45, 32, 17], penalized maximum likelihood procedures [44, 11], and total-variation denoising proce-dures that solve a related problem of estimating multiple piecewise-constant signals [24, 46]. Mostof these procedures were developed for the specific problem of detecting changes in the mean valuesof the data sequences, implicitly or explicitly assuming that the data have constant unchangingvariance around these mean values. Moreover, many of these procedures produce a point estimatewithout an associated measure of uncertainty.

    In contrast, we introduce a Bayesian modeling framework, BASIC, for carrying out a BayesianAnalysis of SImultaneous Changepoints. In single-sequence applications, Bayesian changepointdetectors have been shown to exhibit favorable performance in comparison with other availablemethods and have enjoyed widespread use [6, 42, 3, 34, 7, 12, 1]. We extend Bayesian changepointdetection to the multi-sequence setting by defining a hierarchical prior over latent changepoints,which first specifies the sequential locations at which changepoints may occur and then specifiesthe sequences that contain a changepoint at each such location. In this way, our model capturesthe simultaneity of changepoints across multiple data sequences. Compared with many existingmethods for simultaneous changepoint detection, our framework more easily accommodates a broadrange of data types and likelihood models for the observed data. It can also easily estimate posteriorprobabilities of changepoint events, in addition to returning a point estimate of the changepointlocations. We note that Bayesian approaches to changepoint detection in multiple sequences havebeen explored previously in [10, 13], but these methods have computational complexity that grow

  • EMPIRICAL BAYESIAN ANALYSIS OF SIMULTANEOUS CHANGEPOINTS 3

    exponentially in the number of sequences and hence are inapplicable to problems of the scale weconsider.

    To perform inference under our model, we propose an efficient Markov chain Monte Carlo(MCMC) procedure based on dynamic programming recursions to sample from the posterior dis-tribution of changepoints. These samples may be used to give Monte Carlo estimates of posteriorprobabilities of various changepoint events. We also provide an algorithm to locally maximize theposterior distribution under our model, to obtain a maximum a posteriori point estimate of thechangepoint locations. In applications where parameters of the prior distributions in our model arenot known in advance, we propose a Monte Carlo expectation-maximization (MCEM) procedureto estimate the prior parameters from the data in an empirical Bayesian fashion.

    Even though our model is specified hierarchically, our inference algorithms directly sample fromand maximize over the posterior distribution of only the latent changepoint variables, analyti-cally marginalizing over all other unobserved variables in the model. The primary routines of ourMCMC procedure are Gibbs sampling steps which use dynamic programming algorithms to ex-actly sample from the joint conditional distribution of all changepoints in a single sequence or ofall changepoints at a single location across all sequences. We observe in simulation that this allowsfor fast convergence of changepoint detection accuracy, and in particular, that this approach ismuch more effective than the naive Gibbs sampling approach which samples each latent change-point variable individually. Similarly, our maximization algorithm for estimating the changepointlocations with highest posterior probability employs subroutines that exactly maximize over allpossible changepoint locations in a single sequence or all possible combinations of sequences thatcontain a changepoint at a single sequential location. We observe that this procedure yields highchangepoint detection accuracy, competitive with the best existing methods in the literature, inour simulated examples.

    We apply our model and inference algorithms to two real-data examples. In the first example, weanalyze a set of array comparative genomic hybridization (aCGH) copy number measurements