gaussian process models and covariance function...

51
Gaussian process models and covariance function design Nicolas Durrande (Ecole des Mines de St Etienne) [email protected] UJM – APSSE Seminar, the 28 th of November 2014 1/49 N. Durrande ([email protected]) GP models and kernel design 28 th of November 2014 1

Upload: others

Post on 11-Feb-2021

14 views

Category:

Documents


0 download

TRANSCRIPT

  • Gaussian process models and covariancefunction design

    Nicolas Durrande (Ecole des Mines de St Etienne)

    [email protected]

    UJM – APSSE Seminar, the 28th of November 2014

    1/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 1 / 49

  • Introduction

    Context

    We consider a function f whose evaluation of f is costly:

    examplesphysical experimentoutput of a largecomputer code...

    2/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 2 / 49

  • Introduction

    Using a limited number of observations we want to answer questionssuch as

    what is the minimum of f?what is the mean value?what is the probablility to be above a given threshold?Are there some non influent variables?...

    3/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 3 / 49

  • Introduction

    Outline:

    1 Introduction to GP models2 A few words on RKHS3 What is a kernel?4 Usual operations on kernels5 Effect of a linear application6 Two examples of application:

    Sensitivity analysisPeriodicity detection

    4/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 4 / 49

  • Introduction

    We assume we have observed f for a limited number of time pointsx1, . . . , xn:

    0 5 10 15x

    −1.5

    −1.0

    −0.5

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    f

    The observations are denoted by fi = f (xi) (or F = f (X )).

    5/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 5 / 49

  • Introduction

    Since f in unknown, we make the general assumption that it is to thesample path of a Gaussian process Y :

    0 5 10 15

    x

    −4

    −3

    −2

    −1

    0

    1

    2

    3

    4Y

    Y is characterised by its mean and covariance function.

    6/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 6 / 49

  • Introduction

    We can look at the sample paths of Y that interpolate the data points:

    0 5 10 15

    x

    −1.5

    −1.0

    −0.5

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5Y

    7/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 7 / 49

  • Introduction

    The conditional distribution is still Gaussian. It has mean and variance

    m(x) = E (Y (x)|Y (X ) = F ) = k(x ,X )k(X ,X )−1Fv(x) = var (Y (x)|Y (X ) = F ) = k(x , x)− k(x ,X )tk(X ,X )−1k(x ,X )

    where k is the kernel: k(x , y) = cov(Y (x),Y (y)).

    It can be represented as a mean function with confidence intervals.

    0 5 10 15x

    −1.5

    −1.0

    −0.5

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    Y

    8/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 8 / 49

  • Introduction

    Changing the kernel has a huge impact on the model:

    0 5 10 15

    x

    −1.5

    −1.0

    −0.5

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    gaussian kernel

    k(x , y) = σ2 exp(−(x − y)2

    2θ2

    )0 5 10 15

    x

    −1.5

    −1.0

    −0.5

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    exponential kernel

    k(x , y) = σ2 exp(−|x − y |

    θ

    )

    9/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 9 / 49

  • Introduction

    This is because it means changing the prior on f :

    0 5 10 15

    x

    −3

    −2

    −1

    0

    1

    2

    3

    Y

    gaussian kernel

    0 5 10 15

    x

    −4

    −3

    −2

    −1

    0

    1

    2

    3

    4

    Y

    exponential kernel

    Given the observations, the model is entirely defined by the kernel.We will now focus on this object.

    10/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 10 / 49

  • Introduction

    Theorem (Loeve)k corresponds to the covariance of a GP

    mk is a symmetric positive semi-definite function

    DefinitionA function k is positive semi-definite if it satisfies

    n∑i=1

    n∑j=1

    aiajk(xi , xj) ≥ 0

    for all n ∈ N, for all xi ∈ D, for all ai ∈ R.

    11/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 11 / 49

  • Introduction

    A symmetric positive semi-definite function is also the reproducingkernel of a RKHS:

    DefinitionH is a RKHS with kernel k if it is a Hilbert space such that:

    for all x , k(x , .) ∈ Hfor all f ∈ H, 〈f (.), k(x , .)〉H = f (x)

    Given a kernel k , the associated RKHS is the completion of{n∑

    i=1

    aik(xi , .);n ∈ N,ai ∈ R, xi ∈ D}

    for the inner product〈n∑

    i=1

    aik(xi , .),m∑

    i=1

    bik(xi , .)

    〉=

    n∑i=1

    m∑j=1

    aibjk(xi , xj)

    12/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 12 / 49

  • Introduction

    Given some observations, the best predictor is defined as theinterpolator with minimal norm:

    m = argminh∈H

    {||h||H,h(xi)=f (xi)} = · · · = k(x ,X )k(X ,X )−1F

    The expression is the same as the conditional expectation of the GP!

    k m

    H

    Z

    13/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 13 / 49

  • Introduction

    In order to build m we can use any off the shelf kernel:

    white noise: k(x , y) = δx ,y

    bias: k(x , y) = 1

    linear: k(x , y) = xy

    exponential: k(x , y) = exp (−|x − y |)Brownian: k(x , y) = min(x , y)

    Gaussian: k(x , y) = exp(−(x − y)2

    )Matérn 3/2: k(x , y) = (1 + |x − y |)× exp (−|x − y |)

    sinc: k(x , y) = sin(|x − y |)|x − y |...

    14/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 14 / 49

  • Introduction

    15/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 15 / 49

  • Introduction

    The kernel has to be chosen accordingly to the prior believe on thefunction to approximate:

    What is the regularity of the phenomenon?Is it stationary?...

    If we have some knowledge about the behaviour of f , can wedesign a kernel accordingly?We will discuss 3 options:

    Making new from oldLinear operatorextracting RKHS subspaces

    16/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 16 / 49

  • Introduction

    Making new from old:

    Kernels can be:Summed together

    On the same space k(x , y) = k1(x , y) + k2(x , y)On the tensor space k(x,y) = k1(x1, y1) + k2(x2, y2)

    Multiplied togetherOn the same space k(x , y) = k1(x , y)× k2(x , y)On the tensor space k(x,y) = k1(x1, y1)× k2(x2, y2)

    Composed with a functionk(x , y) = k1(f (x), f (y))

    All these operations will preserve the positive definiteness.

    How can this be useful?

    17/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 17 / 49

  • Introduction

    Sum of kernels over the same space

    Example (The Mauna Loa observatory dataset)This famous dataset compiles the monthly CO2 concentration inHawaii since 1958.

    1960 1970 1980 1990 2000 2010 2020 2030

    320

    340

    360

    380

    400

    420

    440

    Let’s try to predict the concentration for the next 20 years.

    18/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 18 / 49

  • Introduction

    Sum of kernels over the same space

    We first consider a squared-exponential kernel:

    1950 1960 1970 1980 1990 2000 2010 2020 2030 2040600

    400

    200

    0

    200

    400

    600

    1950 1960 1970 1980 1990 2000 2010 2020 2030 2040300

    320

    340

    360

    380

    400

    420

    440

    460

    480

    The results are terrible!

    19/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 19 / 49

  • Introduction

    Sum of kernels over the same space

    What happen if we sum both kernels?

    k(x , y) = σ21krbf1(x , y) + σ22krbf2(x , y)

    1950 1960 1970 1980 1990 2000 2010 2020 2030 2040300

    320

    340

    360

    380

    400

    420

    440

    460

    480

    The model is drastically improved!

    20/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 20 / 49

  • Introduction

    Sum of kernels over the same space

    What happen if we sum both kernels?

    k(x , y) = σ21krbf1(x , y) + σ22krbf2(x , y)

    1950 1960 1970 1980 1990 2000 2010 2020 2030 2040300

    320

    340

    360

    380

    400

    420

    440

    460

    480

    The model is drastically improved!

    20/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 20 / 49

  • Introduction

    Sum of kernels over the same space

    We can try the following kernel:

    k(x , y) = σ20x2y2 + σ21krbf1(x , y) + σ

    22krbf2(x , y) + σ

    23kper (x , y)

    1950 1960 1970 1980 1990 2000 2010 2020 2030 2040300

    320

    340

    360

    380

    400

    420

    440

    460

    Once again, the model is significantly improved.

    21/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 21 / 49

  • Introduction

    Sum of kernels over the same space

    We can try the following kernel:

    k(x , y) = σ20x2y2 + σ21krbf1(x , y) + σ

    22krbf2(x , y) + σ

    23kper (x , y)

    1950 1960 1970 1980 1990 2000 2010 2020 2030 2040300

    320

    340

    360

    380

    400

    420

    440

    460

    Once again, the model is significantly improved.

    21/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 21 / 49

  • Introduction

    Sum of kernels over tensor space

    Property

    K (x,y) = K1(x1, y1) + K2(x2, y2) (1)

    is valid covariance structure.

    0.0 0.2 0.4 0.6 0.8 1.00.00.2

    0.40.6

    0.81.0

    0.2

    0.4

    0.6

    0.8

    1.0

    +

    0.0 0.2 0.4 0.6 0.8 1.00.00.2

    0.40.6

    0.81.0

    0.2

    0.4

    0.6

    0.8

    1.0

    =

    0.0 0.2 0.4 0.6 0.8 1.00.00.2

    0.40.6

    0.81.0

    0.5

    1.0

    1.5

    Remark:From a GP point of view, K is the kernel of Z (x) = Z1(x1) + Z2(x2)

    22/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 22 / 49

  • Introduction

    Sum of kernels over tensor space

    We can have a look at a few sample paths from Z :

    0.0 0.2 0.4 0.6 0.8 1.00.00.2

    0.40.6

    0.81.0

    21

    01234

    0.0 0.2 0.4 0.6 0.8 1.00.00.2

    0.40.6

    0.81.0

    321

    012

    0.0 0.2 0.4 0.6 0.8 1.00.00.2

    0.40.6

    0.81.0

    1.51.00.5

    0.00.51.01.5

    ⇒ They are additive (up to a modification)

    Tensor Additive kernels are very useful forApproximating additive functionsBuilding models over high dimensional inputs spaces

    23/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 23 / 49

  • Introduction

    Effect of a linear operator

    PropertyLet L be a linear operator that commutes with the covariance, thenk(x , y) = Lx(Ly (k1(x , y))) is a kernel.

    ExampleWe want to approximate a function [0,1]→ R that is symmetric withrespect to 0.5. We will consider 2 linear operators:

    L1 : f (x)→{

    f (x) x < 0.5f (1− x) x ≥ 0.5

    L2 : f (x)→f (x) + f (1− x)

    2.

    24/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 24 / 49

  • Introduction

    Effect of a linear operator: example (Ginsbourger, AFST 2013)

    Examples of associated sample paths are

    k1 = L1(L1(k))

    0.0 0.2 0.4 0.6 0.8 1.0

    −2

    −1

    01

    23

    x

    Y

    k2 = L2(L2(k))

    0.0 0.2 0.4 0.6 0.8 1.0

    −2

    −1

    01

    23

    x

    Y

    The differentiability is not always respected!

    25/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 25 / 49

  • Introduction

    Effect of a linear operator

    Ideally, we want to extract the subspace of symmetric functions in H

    H

    Hsym

    f

    L1fL2f

    and to define L as the orthogonal projection onto Hsym

    ⇒ This can be difficult... but it raises interesting questions!

    26/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 26 / 49

  • 2 practical cases

    We will now see how to construct the optimal linear application on tworeal problems:

    Kernels for sensitivity analysisKernels for periodicity detection

    27/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 27 / 49

  • 2 practical cases

    Sensitivity analysis

    The analysis of the influence of the various variables of ad-dimensional function f is often based on the HDMR:

    f (x) = f0 +d∑

    i=1

    fi(xi) +∑i

  • 2 practical cases

    A first idea is to consider ANOVA kernels [Stitson 97]:

    k(x,y) =d∏

    i=1

    (1 + k(xi , yi))

    = 1 +d∑

    i=1

    k(xi , yi)︸ ︷︷ ︸additive part

    +∑i

  • 2 practical cases

    A decomposition of the best predictor is naturally associated to thosekernels.

    Example: we have in 2D K = 1+K1 +K2 +K1K2 so the best predictorcan be written as

    m(x) = (1 + k(x1) + k(x2) + k(x1)k(x2))tK−1F= m0 + m1(x1) + m2(x2) + m12(x)

    This decomposition looks like the ANOVA representation of m but the

    mI do not satisfy ∫Di

    mI(xI)dxi = 0

    We need to build a kernel k0 such that∫

    k0(x , y)dx = 0 for all y .

    30/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 30 / 49

  • 2 practical cases

    The RKHS framework will be very useful here

    Can we extract the subspace of zero mean function in H ?

    h ∈ H0 ⇔∫

    h(x)dx = 0H0

    H

    The integral operator is linear, and it is bounded if∫

    k(x , x)dx

  • 2 practical cases

    Calculations give directly

    R(x) = 〈R, k(x , .)〉H =∫

    Dk(x , s)ds

    L(h) = h − 〈R, k(x , .)〉H||R||2H

    R

    k0(x , y) = k(x , y)−

    ∫k(x , s)ds

    ∫k(y , s)ds∫∫

    k(s, t)dsdt

    32/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 32 / 49

  • 2 practical cases

    Let us consider the random test function f : [0, 1]10 → R :

    x 7→ 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 +N (0,1)

    The steps for approximating f with GPR are:1 Learn f on a DoE (here LHS maximin with 180 points)2 get the optimal values for the kernel parameters using MLE,3 build the kriging predictor m based on

    ∏(1 + k0)

    As f̂ is a function of 10 variables, the model can not easily berepresented: it is usually considered as a “blackbox”. However, thestructure of the kernel allows to split m in submodels.

    33/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 33 / 49

  • 2 practical cases

    The univariate sub-models are:

    (we had f (x) = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 +N (0, 1)

    )

    34/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 34 / 49

  • 2 practical cases Periodicity detection

    General problemGiven a few observations can we extract the periodic part of a signal ?

    0 5 10 151.5

    1.0

    0.5

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    35/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 35 / 49

  • 2 practical cases Periodicity detection

    As previously we will build an orthogonal decomposition of the RKHS:

    H = Hp +Ha

    where Hp is the subspace of H spanned by the Fourier basisB(t) = (sin(t), cos(t), . . . , sin(nt), cos(nt))t .

    PropertyThe reproducing kernel of Hp is

    kp(x , y) = B(x)tG−1B(y)

    where G is the Gram matrix G associated to B.

    36/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 36 / 49

  • 2 practical cases Periodicity detection

    We can deduce the following decomposition of the kernel:

    k(x , y) = kp(x , y) + k(x , y)− kp(x , y)︸ ︷︷ ︸ka(x,y)

    Property: Decomposition of the model

    The decomposition of the kernel gives directly

    m(t) = (kp(t) + ka(t))t(Kp + Ka)−1F

    = kp(t)t(Kp + Ka)−1F︸ ︷︷ ︸periodic sub-model mp

    + ka(t)t(Kp + Ka)−1F︸ ︷︷ ︸aperiodic sub-model ma

    and we can associate a prediction variance to the sub-models:

    vp(t) = kp(t , t)− kp(t)t(Kp + Ka)−1kp(t)va(t) = ka(t , t)− ka(t)t(Kp + Ka)−1ka(t)

    37/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 37 / 49

  • 2 practical cases Periodicity detection

    ExampleFor the observations shown previously we obtain:

    0 5 10 15−1.5

    −1.0

    −0.5

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    0 5 10 15−1.5

    −1.0

    −0.5

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    ||

    0 5 10 15−1.5

    −1.0

    −0.5

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    +

    Can we can do better?

    38/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 38 / 49

  • 2 practical cases Periodicity detection

    Previously, the kernels were parameterized by 2 variables:

    k(x , y , σ2, θ)

    but writing k as a sum allows to tune independently the parameters ofthe sub-kernels.

    Let k∗ be defined as

    k∗(x , y , σ2p, σ2a, θp, θa) = kp(x , y , σ

    2p, θp) + ka(x , y , σ

    2a, θa)

    Furthermore, we include a 5th parameter in k∗ accounting for theperiod by changing the Fourier basis:

    Bω(t) = (sin(ωt), cos(ωt), . . . , sin(nωt), cos(nωt))t

    39/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 39 / 49

  • 2 practical cases Periodicity detection

    If we optimize the 5 parameters of k∗ with maximum likelihoodestimation we obtain:

    0 5 10 15−1.5

    −1.0

    −0.5

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    0 5 10 15−1.5

    −1.0

    −0.5

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    ||

    0 5 10 15−1.5

    −1.0

    −0.5

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    +

    40/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 40 / 49

  • 2 practical cases Periodicity detection

    The 24 hour cycle of days can be observed in the oscillations of manyphysiological processes of living beings.

    ExamplesBody temperature, jet lag, sleep, ... but also observed for plants,micro-organisms, etc.

    This phenomenon is called the circadian rhythm and the mechanismdriving this cycle is the circadian clock.

    To understand how the circadian clock operates at the gene level,biologist look at the temporal evolution of gene expression.

    41/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 41 / 49

  • 2 practical cases Periodicity detection

    The aim of gene expression is to measure the activity of various genes:

    42/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 42 / 49

  • 2 practical cases Periodicity detection

    The mRNA concentration is measured with microarray experiments

    The chip is then scanned to determine the occupation of each cell andreveal the concentration of mRNA.

    43/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 43 / 49

  • 2 practical cases Periodicity detection

    Experiments to study the circadian clock are typically:1 Expose the organism to a 12h light / 12h dark cycle2 at t=0, transfer to constant light3 perform a microarray experiment every 4 hours to measure gene

    expression

    Regulators of the circadian clock are often rhythmically regulated.⇒ identifying periodically expressed genes gives an insight on the

    overall mechanism.

    44/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 44 / 49

  • 2 practical cases Periodicity detection

    We used data from Edward 2006, based on arabidopsis.

    The dimension of the data is:

    22810 genes13 time points

    Edward 2006 gives a list of the 3504 most periodically expressedgenes. The comparison with our approach gives:

    21767 genes with the same label (2461 per. and 19306 non-per.)1043 genes with different labels

    45/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 45 / 49

  • 2 practical cases Periodicity detection

    Let’s look at genes with different labels:

    30 40 50 60 703

    2

    1

    0

    1

    2

    3 At1g60810

    30 40 50 60 702

    1

    0

    1

    2

    3 At4g10040

    30 40 50 60 703

    2

    1

    0

    1

    2

    3 At1g06290

    30 40 50 60 702.0

    1.5

    1.0

    0.5

    0.0

    0.5

    1.0

    1.5

    2.0 At5g4890030 40 50 60 702

    1

    0

    1

    2

    3

    4 At5g41480

    30 40 50 60 702

    1

    0

    1

    2

    3

    4 At3g08000

    30 40 50 60 703

    2

    1

    0

    1

    2 At3g03900

    30 40 50 60 702

    1

    0

    1

    2

    3 At2g36400

    periodic for Edward periodic for our approach

    46/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 46 / 49

  • 2 practical cases Periodicity detection

    What cannot be done with GPs... It is rather difficult to:

    47/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 47 / 49

  • Conclusion

    Conclusion

    We have seen thatGaussian process regression is a great tool for modeling.Kernels can (and should) be tailored to the problem at hand.

    What cannot be done with GPs...It is rather difficult to:

    impose non linear constrainsdeal with a (very) large number of observations...

    48/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 48 / 49

  • Conclusion

    Conclusion

    Bibliography

    N. Aronszajn.Theory of reproducing kernels.Transaction of the AMS, 68,1950

    A. Berlinet and C. Thomas-Agnan.RKHS in probability and statistics.Kluwer academic publisher,2004

    49/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 49 / 49

    IntroductionIntroduction2 practical casesPeriodicity detection

    Conclusion