continuous representations of time gene expression data

27
Continuous Representations of Time Gene Expression Data Ziv Bar-Joseph, Georg Gerber, David K. Gifford MIT Laboratory for Computer Science J. Comput. Biol.,10,341-356, 2003

Upload: manju

Post on 22-Feb-2016

58 views

Category:

Documents


0 download

DESCRIPTION

Continuous Representations of Time Gene Expression Data. Ziv Bar-Joseph, Georg Gerber, David K. Gifford MIT Laboratory for Computer Science J. Comput . Biol .,10,341-356, 2003. Outline. Splines Estimating Unobserved Expression Values and Time Points - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Continuous Representations of Time Gene Expression Data

Continuous Representations of Time Gene Expression Data

Ziv Bar-Joseph, Georg Gerber, David K. GiffordMIT Laboratory for Computer Science

J. Comput. Biol.,10,341-356, 2003

Page 2: Continuous Representations of Time Gene Expression Data

Outline

• Splines• Estimating Unobserved Expression Values and

Time Points• Model Based Clustering Algorithm for

Temporal Data• Aligning Temporal Data• Results

Page 3: Continuous Representations of Time Gene Expression Data

Splines

• The word “spline” come from the ship building industry

Page 4: Continuous Representations of Time Gene Expression Data

Splines

• Splines are piecewise polynomials with boundary continuity and smoothness constraints.

• The typical way to represent a piecewise cubic curve :

maxmin1

)()( ttttSCtyn

iii

tscoefficien theare )( s,polynomial are )( iCtSi

14/...0 ,4...1 njlspolynomial piecewise theof points-break thedenote ' sx j

Page 5: Continuous Representations of Time Gene Expression Data

Splines

– We have cubic polynomial :

– equations are required :

– Interpolating splines

4/n

4

1

14)(

l

lljj tCtp

n

)()(

)()(

)()(

)(

1

1

1

jjjj

jjjj

jjjj

jjj

xPxp

xpxp

xpxp

Dxp

4/4/14/000 )( and )( nnn DxpDxp

Page 6: Continuous Representations of Time Gene Expression Data

Splines

• B-spline– In terms of a set of normalized Basis functions

• The application of fitting curved to gene expression time-series data– Convenient with the B-spline basis to obtains

approximating or smoothing splines– Fewer basis coefficient than there are observed

data points– Avoid overfitting

Page 7: Continuous Representations of Time Gene Expression Data

Splines

• The basis coefficients :– Interpreted geometrically as control points – The vertices of a polygon that control the shape of

the spline but are not interpolated by the curve– The curve lies entirely within the convex hull of

this controlling polygon.– Each vertex exerts only a local influence on the

curve.

iC

Page 8: Continuous Representations of Time Gene Expression Data

Splines

Page 9: Continuous Representations of Time Gene Expression Data

Splines– 任何 xi區間中 S(t)必為 k-1次的多項式– S(t)具有 1,2,…,k-2階微分的連續性– 對於同一 k值而言

– 在 t的有效區間中 bi,k 0≧ ,且任一 bi,k均僅有唯一極大值,除k=1,2外 bi,k均為連續平滑曲線。

y

t

1

xi xi+1 xi+2 xi+3

bi,1

bi,2

bi,3

n

iki tb

1, 1)(

Page 10: Continuous Representations of Time Gene Expression Data

Splines

• A uniform knot vector is one in which the entries are evenly space– i.e. – The basis functions will be translated of each

other, i.e.– For a periodic cubic B-spline (k=4), the equation

specifying the curve :

T)7,6,5,4,3,2,1,0(x

)1()1()( ,1,1, tbtbtb kikiki

141

4, for )()(

n

n

iii xtxtbCty

Page 11: Continuous Representations of Time Gene Expression Data

B-splines

– The B-spline will only be defined in the shaded region 3t 4

Page 12: Continuous Representations of Time Gene Expression Data

Estimating Unobserved Expression Values and Time Points

• To obtain a continuous time formulation, use cubic B-spline – Getting the value of the splines at a set of control points in the time-

series.• Re-sample the curve to estimate expression values at any time-

points.• Spline function are not fit for each gene individually

– due to noise and missing value– lead to over-fitting

• Instead, constraint the spline coefficients of co-expressed genes to have the same covariance matrix– Use other genes in the same class to estimate the missing values of a

specific gene.

Page 13: Continuous Representations of Time Gene Expression Data

Estimating Unobserved Expression Values and Time Points

• A probabilistic model of time series expression data– Assume a set of genes are grouped together• Using prior biological knowledge• a clustering algorithm

. at time for valueobserved theis )( , classin gene aFor titYji i

iiji tstY ))(()(

Page 14: Continuous Representations of Time Gene Expression Data

Estimating Unobserved Expression Values and Time Points

– –

jj classin gensfor tscoefficien spline theof valueaverage the:

j

i

Γmatrix covariance points control spline class theand zeromean th vector widdistributenormally

tscoefficien variationspecific gene the:

2 varianceand 0mean with

ddistributenormally is that termnoise random :

i

used points control spline ofnumber the: q

1by dimension at time evaluated basis spline of vector the: )s(

qtt

Page 15: Continuous Representations of Time Gene Expression Data

Estimating Unobserved Expression Values and Time Points

• To learn the parameters of this model (, , and ) – Use the observed values, and maximize the likelihood of the

input data

– – –

))('()'(ˆ

: any timeat gene resamplecan We

ijii tstY

t'i

iijii SY )(

iYi genefor valueobserved of vector thedataset our in for valuesexpression of totala im

qmi

Si

by dimension observed were genefor aluesin which v timeat the evaluated

function basis spline the:

Page 16: Continuous Representations of Time Gene Expression Data

Estimating Unobserved Expression Values and Time Points

– Decompose the probability : • If the values were observed, decompose the

probability :

Tjj SSI

j

2

: classin gene afor matrix covariance combined the

Page 17: Continuous Representations of Time Gene Expression Data

Estimating Unobserved Expression Values and Time Points

– Use EM• E step : find the best estimation for using the values

we have for 2, , and .• M step : maximize .),,|,( 2 Yp

Page 18: Continuous Representations of Time Gene Expression Data

Model Based Clustering Algorithm for Temporal Data

• A new clustering algorithm that simultaneously solves the parameter estimation and class assignment problems– – EM algorithm• E step

• M step

random.at unifomly genefor class aselect weFirst, ij

)|( , class tobelongs that yprobabilit the class and geneeach for estimate

variablesmissing theas sassignemnt class treat the

ijPjij i

step E in the computed as )|(y probabilit class therespect to with parameters with classeach for parameter our maximize

ijP

Page 19: Continuous Representations of Time Gene Expression Data

Model Based Clustering Algorithm for Temporal Data

– )|(max)|(

class the to geneassign coverges, algorithm When the

1ikPijp

ji

ck

Page 20: Continuous Representations of Time Gene Expression Data

Aligning Temporal Data

• Assume we have two sets of time-series gene expression profiles– Splines for reference

– Splines in the set to be warped

• A mapping – Linear transformation

maxmin1 where),( ssssgi

maxmin2 where),( ttttgi

tsT )(

abssT /)()(

Page 21: Continuous Representations of Time Gene Expression Data

Aligning Temporal Data

• The error of the alignment:– Averaged squared distance

• Find parameters a and b that minimize• The error for a set of genes S of size n

)}(,max{ min1

min tTs )}(,min{ max1

max tTs 2ie

n

iiiS ewE

1

2

dssgsTg

eii

i

212

2

)())(( The averaged squared distance between the two curve

Take into account the degree of overlap between the curves.

Page 22: Continuous Representations of Time Gene Expression Data

Aligning Temporal Data

– –

one tosumt that coefficien weightingare 'swi

genes allfor amfe thebe product the

thatrequire tois thisgformulatin of way one2iiew

Page 23: Continuous Representations of Time Gene Expression Data

Results

• 800 genes in Saccharomyces cerevisiae with five groups• Unobserved data estimation

Page 24: Continuous Representations of Time Gene Expression Data

Results

• Clustering– Explore the effect that non-uniform sampling• Two synthetic curves :

Page 25: Continuous Representations of Time Gene Expression Data

Results

Page 26: Continuous Representations of Time Gene Expression Data

Results

Page 27: Continuous Representations of Time Gene Expression Data

Results