building an automatic statistician - ds 2014ds2014.ijs.si/invited/altds14.pdf · • the...

66
Building an Automatic Statistician Zoubin Ghahramani Department of Engineering University of Cambridge [email protected] http://learning.eng.cam.ac.uk/zoubin/ ALT-Discovery Science Conference, October 2014 James Robert Lloyd David Duvenaud Roger Grosse Josh Tenenbaum Cambridge Cambridge ! Harvard MIT ! Toronto MIT

Upload: others

Post on 07-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

Building an Automatic Statistician

Zoubin Ghahramani

Department of EngineeringUniversity of Cambridge

[email protected]

http://learning.eng.cam.ac.uk/zoubin/

ALT-Discovery Science Conference, October 2014

James Robert Lloyd David Duvenaud Roger Grosse Josh TenenbaumCambridge Cambridge ! Harvard MIT ! Toronto MIT

Page 2: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

THERE IS A GROWING NEED FOR DATA ANALYSIS

I We live in an era of abundant data

I The McKinsey Global Institute claimI

“The United States alone faces a shortage of 140,000 to

190,000 people with analytical expertise and 1.5 million

managers and analysts with the skills to understand and

make decisions based on the analysis of big data.”

I Diverse fields increasingly relying on expert statisticians,machine learning researchers and data scientists e.g.

I Computational sciences (e.g. biology, astronomy, . . . )I Online advertisingI Quantitative financeI . . .

James Robert Lloyd and Zoubin Ghahramani 2 / 44

Page 3: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

GOALS OF THE AUTOMATIC STATISTICIAN PROJECT

I Provide a set of tools for understanding data that requireminimal expert input

I Uncover challenging research problems in e.g.I Automated inferenceI Model construction and comparisonI Data visualisation and interpretation

I Advance the field of machine learning in general

James Robert Lloyd and Zoubin Ghahramani 4 / 44

Page 4: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

Background

• Probabilistic modelling

• Model selection and marginal likelihoods

• Bayesian nonparametrics

• Gaussian processes

Page 5: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

Probabilistic Modelling

• A model describes data that one could observe from a system

• If we use the mathematics of probability theory to express allforms of uncertainty and noise associated with our model...

• ...then inverse probability (i.e. Bayes rule) allows us to inferunknown quantities, adapt our models, make predictions andlearn from data.

Page 6: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

Bayesian Machine Learning

Everything follows from two simple rules:

Sum rule: P (x) =

Py P (x, y)

Product rule: P (x, y) = P (x)P (y|x)

P (✓|D, m) =

P (D|✓,m)P (✓|m)

P (D|m)

P (D|✓, m) likelihood of parameters ✓ in model mP (✓|m) prior probability of ✓P (✓|D, m) posterior of ✓ given data D

Prediction:

P (x|D, m) =

ZP (x|✓, D, m)P (✓|D, m)d✓

Model Comparison:

P (m|D) =

P (D|m)P (m)

P (D)

P (D|m) =

ZP (D|✓,m)P (✓|m) d✓

Page 7: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

Model Comparison

0 5 10−20

0

20

40

M = 0

0 5 10−20

0

20

40

M = 1

0 5 10−20

0

20

40

M = 2

0 5 10−20

0

20

40

M = 3

0 5 10−20

0

20

40

M = 4

0 5 10−20

0

20

40

M = 5

0 5 10−20

0

20

40

M = 6

0 5 10−20

0

20

40

M = 7

Page 8: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

Bayesian Occam’s Razor and Model Comparison

Compare model classes, e.g. m and m

0, using posterior probabilities given D:

p(m|D) =

p(D|m) p(m)

p(D)

, p(D|m) =

Zp(D|✓, m) p(✓|m) d✓

Interpretations of the Marginal Likelihood (“model evidence”):

• The probability that randomly selected parameters from the prior would generate D.

• Probability of the data under the model, averaging over all possible parameter values.

• log

2

⇣1

p(D|m)

⌘is the number of bits of surprise at observing data D under model m.

Model classes that are too simple are unlikelyto generate the data set.

Model classes that are too complex cangenerate many possible data sets, so again,they are unlikely to generate that particulardata set at random.

too simple

too complex

"just right"

All possible data sets of size n

P(D

|m)

D

Page 9: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

Bayesian Model Comparison: Occam’s Razor at Work

0 5 10−20

0

20

40

M = 0

0 5 10−20

0

20

40

M = 1

0 5 10−20

0

20

40

M = 2

0 5 10−20

0

20

40

M = 3

0 5 10−20

0

20

40

M = 4

0 5 10−20

0

20

40

M = 5

0 5 10−20

0

20

40

M = 6

0 5 10−20

0

20

40

M = 7

0 1 2 3 4 5 6 70

0.2

0.4

0.6

0.8

1

M

P(Y|

M)

Model Evidence

For example, for quadratic polynomials (m = 2): y = a

0

+ a

1

x + a

2

x

2

+ ✏, where✏ ⇠ N (0, �

2

) and parameters ✓ = (a

0

a

1

a

2

�)

demo: polybayes

Page 10: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

Parametric vs Nonparametric Models

• Parametric models assume some finite set of parameters ✓. Given the parameters,future predictions, x, are independent of the observed data, D:

P (x|✓, D) = P (x|✓)

therefore ✓ capture everything there is to know about the data.

• So the complexity of the model is bounded even if the amount of data isunbounded. This makes them not very flexible.

• Non-parametric models assume that the data distribution cannot be defined interms of such a finite set of parameters. But they can often be defined byassuming an infinite dimensional ✓. Usually we think of ✓ as a function.

• The amount of information that ✓ can capture about the data D can grow asthe amount of data grows. This makes them more flexible.

Page 11: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

Bayesian nonparametrics

A simple framework for modelling complex data.

Nonparametric models can be viewed as having infinitely many parameters

Examples of non-parametric models:

Parametric Non-parametric Applicationpolynomial regression Gaussian processes function approx.logistic regression Gaussian process classifiers classificationmixture models, k-means Dirichlet process mixtures clusteringhidden Markov models infinite HMMs time seriesfactor analysis / pPCA / PMF infinite latent factor models feature discovery...

Page 12: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

Nonlinear regression and Gaussian processes

Consider the problem of nonlinear regression:You want to learn a function f with error bars from data D = {X,y}

x

y

A Gaussian process defines a distribution over functions p(f) which can be used forBayesian regression:

p(f |D) =

p(f)p(D|f)

p(D)

Let f = (f(x

1

), f(x

2

), . . . , f(xn)) be an n-dimensional vector of function valuesevaluated at n points xi 2 X . Note, f is a random variable.

Definition: p(f) is a Gaussian process if for any finite subset {x

1

, . . . , xn} ⇢ X ,the marginal distribution over that subset p(f) is multivariate Gaussian.

Page 13: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

A picture

Logistic Regression

Linear Regression

Kernel Regression

Bayesian Linear

Regression

GP Classification

Bayesian Logistic

Regression

Kernel Classification

GP Regression

Classification

BayesianKernel

Page 14: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

Gaussian process covariance functions (kernels)

p(f) is a Gaussian process if for any finite subset {x

1

, . . . , xn} ⇢ X , the marginaldistribution over that finite subset p(f) has a multivariate Gaussian distribution.

Gaussian processes (GPs) are parameterized by a mean function, µ(x), and acovariance function, or kernel, K(x, x

0).

p(f(x), f(x

0)) = N(µ, ⌃)

where

µ =

µ(x)

µ(x

0)

�⌃ =

K(x, x) K(x, x

0)

K(x

0, x) K(x

0, x

0)

and similarly for p(f(x

1

), . . . , f(xn)) where now µ is an n ⇥ 1 vector and ⌃ is ann ⇥ n matrix.

Page 15: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

Gaussian process covariance functions

Gaussian processes (GPs) are parameterized by a mean function, µ(x), anda covariance function, K(x, x

0), where µ(x) = E(f(x)) and K(x, x

0) =

Cov(f(x), f(x

0)).

An example covariance function:

K(x, x

0) = v

0

exp

⇢�

✓|x � x

0|r

◆↵�+ v

1

+ v

2

�ij

with parameters (v

0

, v

1

, v

2

, r, ↵).

These kernel parameters are interpretableand can be learned from data:

v

0

signal variancev

1

variance of biasv

2

noise variancer lengthscale↵ roughness

Once the mean and covariance functions are defined, everything else about GPsfollows from the basic rules of probability applied to mutivariate Gaussians.

Page 16: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

Samples from GPs with di↵erent K(x, x

0)

0 10 20 30 40 50 60 70 80 90 100−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−1.5

−1

−0.5

0

0.5

1

1.5

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−2

−1

0

1

2

3

4

5

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−4

−3

−2

−1

0

1

2

3

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−3

−2

−1

0

1

2

3

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−3

−2

−1

0

1

2

3

4

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−6

−4

−2

0

2

4

6

8

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−4

−3

−2

−1

0

1

2

3

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−3

−2

−1

0

1

2

3

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−4

−3

−2

−1

0

1

2

3

4

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−6

−4

−2

0

2

4

6

8

x

f(x)

gpdemogen

Page 17: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

Prediction using GPs with di↵erent K(x, x

0)

A sample from the prior for each covariance function:

0 10 20 30 40 50 60 70 80 90 100−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−1.5

−1

−0.5

0

0.5

1

1.5

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

x

f(x)

Corresponding predictions, mean with two standard deviations:

0 5 10 15−1.5

−1

−0.5

0

0.5

1

0 5 10 15−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

0 5 10 15−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

0 5 10 15−1.5

−1

−0.5

0

0.5

1

gpdemo

Page 18: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

The Automatic Statistician

Page 19: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

WHAT WOULD AN AUTOMATIC STATISTICIAN DO?

Data Search

Language of models

Evaluation

Model Prediction

Translation

Checking

Report

James Robert Lloyd and Zoubin Ghahramani 3 / 44

Page 20: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

GOALS OF THE AUTOMATIC STATISTICIAN PROJECT

I Provide a set of tools for understanding data that requireminimal expert input

I Uncover challenging research problems in e.g.I Automated inferenceI Model construction and comparisonI Data visualisation and interpretation

I Advance the field of machine learning in general

James Robert Lloyd and Zoubin Ghahramani 4 / 44

Page 21: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

INGREDIENTS OF AN AUTOMATIC STATISTICIAN

Data Search

Language of models

Evaluation

Model Prediction

Translation

Checking

Report

IAn open-ended language of models

I Expressive enough to capture real-world phenomena. . .I . . . and the techniques used by human statisticians

IA search procedure

I To efficiently explore the language of modelsI

A principled method of evaluating models

I Trading off complexity and fit to dataI

A procedure to automatically explain the models

I Making the assumptions of the models explicit. . .I . . . in a way that is intelligible to non-experts

James Robert Lloyd and Zoubin Ghahramani 5 / 44

Page 22: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

PREVIEW: AN ENTIRELY AUTOMATIC ANALYSIS

Raw data

1950 1952 1954 1956 1958 1960 1962100

200

300

400

500

600

700Full model posterior with extrapolations

1950 1952 1954 1956 1958 1960 19620

100

200

300

400

500

600

700

Four additive components have been identified in the data

I A linearly increasing function.

I An approximately periodic function with a period of 1.0 years andwith linearly increasing amplitude.

I A smooth function.

I Uncorrelated noise with linearly increasing standard deviation.

James Robert Lloyd and Zoubin Ghahramani 6 / 44

Page 23: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

DEFINING A LANGUAGE OF MODELS

Data Search

Language of models

Evaluation

Model Prediction

Translation

Checking

Report

James Robert Lloyd and Zoubin Ghahramani 7 / 44

Page 24: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

DEFINING A LANGUAGE OF REGRESSION MODELS

I Regression consists of learning a function f : X ! Yfrom inputs to outputs from example input / output pairs

I Language should include simple parametric forms. . .I e.g. Linear functions, Polynomials, Exponential functions

I . . . as well as functions specified by high level properties

I e.g. Smoothness, Periodicity

I Inference should be tractable for all models in language

James Robert Lloyd and Zoubin Ghahramani 8 / 44

Page 25: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

WE CAN BUILD REGRESSION MODELS WITHGAUSSIAN PROCESSES

I GPs are distributions over functions such that any finitesubset of function evaluations, (f (x1), f (x2), . . . f (x

N

)),have a joint Gaussian distribution

I A GP is completely specified byI Mean function, µ(x) = E(f (x))I Covariance / kernel function, k(x, x

0) = Cov(f (x), f (x0))I Denoted f ⇠ GP(µ, k)

x

f(x)

GP Posterior MeanGP Posterior Uncertainty

x

f(x)

GP Posterior MeanGP Posterior UncertaintyData

x

f(x)

GP Posterior MeanGP Posterior UncertaintyData

x

f(x)

GP Posterior MeanGP Posterior UncertaintyData

James Robert Lloyd and Zoubin Ghahramani 9 / 44

Page 26: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

A LANGUAGE OF GAUSSIAN PROCESS KERNELS

I It is common practice to use a zero mean function since themean can be marginalised out

I Suppose, f (x) | a ⇠ GP(a ⇥ µ(x), k(x, x

0)) wherea ⇠ N (0, 1)

I Then equivalently, f (x) ⇠ GP(0, µ(x)µ(x0) + k(x, x

0))

I We therefore define a language of GP regression models byspecifying a language of kernels

James Robert Lloyd and Zoubin Ghahramani 10 / 44

Page 27: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

THE ATOMS OF OUR LANGUAGE

Five base kernels

0 0

0

0 0

Squaredexp. (SE)

Periodic(PER)

Linear(LIN)

Constant(C)

Whitenoise (WN)

Encoding for the following types of functions

Smoothfunctions

Periodicfunctions

Linearfunctions

Constantfunctions

Gaussiannoise

James Robert Lloyd and Zoubin Ghahramani 11 / 44

Page 28: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

THE COMPOSITION RULES OF OUR LANGUAGE

I Two main operations: addition, multiplication

0 0

LIN ⇥ LINquadraticfunctions

SE ⇥ PERlocallyperiodic

0

0

LIN + PERperiodic pluslinear trend

SE + PERperiodic plussmooth trend

James Robert Lloyd and Zoubin Ghahramani 12 / 44

Page 29: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

MODELING CHANGEPOINTS

Assume f1(x) ⇠ GP(0, k1) and f2(x) ⇠ GP(0, k2). Define:

f (x) = (1 � �(x)) f1(x) + �(x) f2(x)

where � is a sigmoid function between 0 and 1.

Then f ⇠ GP(0, k), where

k(x, x

0) = (1 � �(x)) k1(x, x

0) (1 � �(x0)) + �(x) k2(x, x

0) �(x0)

We define the changepoint operator k = CP(k1, k2).

James Robert Lloyd and Zoubin Ghahramani 13 / 44

Page 30: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

AN EXPRESSIVE LANGUAGE OF MODELS

Regression model Kernel

GP smoothing SE + WNLinear regression C + LIN + WNMultiple kernel learning

PSE + WN

Trend, cyclical, irregularP

SE +P

PER + WNFourier decomposition C +

Pcos + WN

Sparse spectrum GPsP

cos + WNSpectral mixture

PSE ⇥ cos + WN

Changepoints e.g. CP(SE, SE) + WNHeteroscedasticity e.g. SE + LIN ⇥ WN

Note: cos is a special case of our version of PER

James Robert Lloyd and Zoubin Ghahramani 14 / 44

Page 31: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

DISCOVERING A GOOD MODEL VIA SEARCH

DataSearch

Language of models

Evaluation

Model Prediction

Translation

Checking

Report

James Robert Lloyd and Zoubin Ghahramani 15 / 44

Page 32: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

DISCOVERING A GOOD MODEL VIA SEARCH

I Language defined as the arbitrary composition of five basekernels (WN, C, LIN, SE, PER) via three operators(+,⇥, CP).

I The space spanned by this language is open-ended and canhave a high branching factor requiring a judicious search

I We propose a greedy search for its simplicity andsimilarity to human model-building

James Robert Lloyd and Zoubin Ghahramani 16 / 44

Page 33: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

EXAMPLE: MAUNA LOA KEELING CURVE

RQ

2000 2005 2010!20

0

20

40

60

Start

SE RQ

SE + RQ . . . PER + RQ

SE + PER + RQ . . . SE ⇥ (PER + RQ)

. . . . . . . . .

. . .

. . . PER ⇥ RQ

LIN PER

James Robert Lloyd and Zoubin Ghahramani 17 / 44

Page 34: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

EXAMPLE: MAUNA LOA KEELING CURVE

( Per + RQ )

2000 2005 20100

10

20

30

40

Start

SE RQ

SE + RQ . . . PER + RQ

SE + PER + RQ . . . SE ⇥ (PER + RQ)

. . . . . . . . .

. . .

. . . PER ⇥ RQ

LIN PER

James Robert Lloyd and Zoubin Ghahramani 17 / 44

Page 35: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

EXAMPLE: MAUNA LOA KEELING CURVE

SE ! ( Per + RQ )

2000 2005 201010

20

30

40

50

Start

SE RQ

SE + RQ . . . PER + RQ

SE + PER + RQ . . . SE ⇥ (PER + RQ)

. . . . . . . . .

. . .

. . . PER ⇥ RQ

LIN PER

James Robert Lloyd and Zoubin Ghahramani 17 / 44

Page 36: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

EXAMPLE: MAUNA LOA KEELING CURVE

( SE + SE ! ( Per + RQ ) )

2000 2005 20100

20

40

60

Start

SE RQ

SE + RQ . . . PER + RQ

SE + PER + RQ . . . SE ⇥ (PER + RQ)

. . . . . . . . .

. . .

. . . PER ⇥ RQ

LIN PER

James Robert Lloyd and Zoubin Ghahramani 17 / 44

Page 37: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

MODEL EVALUATION

Data Search

Language of models

Evaluation

Model Prediction

Translation

Checking

Report

James Robert Lloyd and Zoubin Ghahramani 18 / 44

Page 38: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

MODEL EVALUATION

I After proposing a new model its kernel parameters areoptimised by conjugate gradients

I We evaluate each optimised model, M, using the modelevidence (marginal likelihood) which can be computedanalytically for GPs

I We penalise the marginal likelihood for the optimisedkernel parameters using the Bayesian InformationCriterion (BIC):

�0.5 ⇥ BIC(M) = log p(D |M)� p

2log n

where p is the number of kernel parameters, D representsthe data, and n is the number of data points.

James Robert Lloyd and Zoubin Ghahramani 19 / 44

Page 39: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

AUTOMATIC TRANSLATION OF MODELS

Data Search

Language of models

Evaluation

Model Prediction

Translation

Checking

Report

James Robert Lloyd and Zoubin Ghahramani 20 / 44

Page 40: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

AUTOMATIC TRANSLATION OF MODELS

I Search can produce arbitrarily complicated models fromopen-ended language but two main properties allowdescription to be automated

I Kernels can be decomposed into a sum of products

I A sum of kernels corresponds to a sum of functionsI Therefore, we can describe each product of kernels

separately

I Each kernel in a product modifies a model in a consistent

wayI Each kernel roughly corresponds to an adjective

James Robert Lloyd and Zoubin Ghahramani 21 / 44

Page 41: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

SUM OF PRODUCTS NORMAL FORM

Suppose the search finds the following kernel

SE ⇥ (WN ⇥ LIN + CP(C, PER))

The changepoint can be converted into a sum of products

SE ⇥ (WN ⇥ LIN + C ⇥ � + PER ⇥ �̄)

Multiplication can be distributed over addition

SE ⇥ WN ⇥ LIN + SE ⇥ C ⇥ � + SE ⇥ PER ⇥ �̄

Simplification rules are applied

WN ⇥ LIN + SE ⇥ � + SE ⇥ PER ⇥ �̄

James Robert Lloyd and Zoubin Ghahramani 22 / 44

Page 42: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

SUM OF PRODUCTS NORMAL FORM

Suppose the search finds the following kernel

SE ⇥ (WN ⇥ LIN + CP(C, PER))

The changepoint can be converted into a sum of products

SE ⇥ (WN ⇥ LIN + C ⇥ � + PER ⇥ �̄)

Multiplication can be distributed over addition

SE ⇥ WN ⇥ LIN + SE ⇥ C ⇥ � + SE ⇥ PER ⇥ �̄

Simplification rules are applied

WN ⇥ LIN + SE ⇥ � + SE ⇥ PER ⇥ �̄

James Robert Lloyd and Zoubin Ghahramani 22 / 44

Page 43: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

SUM OF PRODUCTS NORMAL FORM

Suppose the search finds the following kernel

SE ⇥ (WN ⇥ LIN + CP(C, PER))

The changepoint can be converted into a sum of products

SE ⇥ (WN ⇥ LIN + C ⇥ � + PER ⇥ �̄)

Multiplication can be distributed over addition

SE ⇥ WN ⇥ LIN + SE ⇥ C ⇥ � + SE ⇥ PER ⇥ �̄

Simplification rules are applied

WN ⇥ LIN + SE ⇥ � + SE ⇥ PER ⇥ �̄

James Robert Lloyd and Zoubin Ghahramani 22 / 44

Page 44: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

SUM OF PRODUCTS NORMAL FORM

Suppose the search finds the following kernel

SE ⇥ (WN ⇥ LIN + CP(C, PER))

The changepoint can be converted into a sum of products

SE ⇥ (WN ⇥ LIN + C ⇥ � + PER ⇥ �̄)

Multiplication can be distributed over addition

SE ⇥ WN ⇥ LIN + SE ⇥ C ⇥ � + SE ⇥ PER ⇥ �̄

Simplification rules are applied

WN ⇥ LIN + SE ⇥ � + SE ⇥ PER ⇥ �̄

James Robert Lloyd and Zoubin Ghahramani 22 / 44

Page 45: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

SUMS OF KERNELS ARE SUMS OF FUNCTIONS

If f1 ⇠ GP(0, k1) and independently f2 ⇠ GP(0, k2) then

f1 + f2 ⇠ GP(0, k1 + k2)

e.g.Full model posterior with extrapolations

1960 1970 1980 1990 2000 2010 2020−40

−20

0

20

40

60

80

=Posterior of component 1

1960 1965 1970 1975 1980 1985 1990 1995 2000−30

−20

−10

0

10

20

30

40

+Posterior of component 2

1960 1965 1970 1975 1980 1985 1990 1995 2000−4

−3

−2

−1

0

1

2

3

4

+Posterior of component 3

1960 1965 1970 1975 1980 1985 1990 1995 2000−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Full model posterior with extrapolations

1950 1952 1954 1956 1958 1960 19620

100

200

300

400

500

600

700

=Posterior of component 1

1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 19600

100

200

300

400

500

+Posterior of component 2

1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960−150

−100

−50

0

50

100

150

200

+Posterior of component 3

1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960−40

−30

−20

−10

0

10

20

30

40

We can therefore describe each component separately

James Robert Lloyd and Zoubin Ghahramani 23 / 44

Page 46: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

PRODUCTS OF KERNELS

PER|{z}periodic function

On their own, each kernel is described by a standard nounphrase

James Robert Lloyd and Zoubin Ghahramani 24 / 44

Page 47: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

PRODUCTS OF KERNELS - SE

SE|{z}approximately

⇥ PER|{z}periodic function

Multiplication by SE removes long range correlations from amodel since SE(x, x

0) decreases monotonically to 0 as |x � x

0|increases.

James Robert Lloyd and Zoubin Ghahramani 25 / 44

Page 48: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

PRODUCTS OF KERNELS - LIN

SE|{z}approximately

⇥ PER|{z}periodic function

⇥ LIN|{z}with linearly growing amplitude

Multiplication by LIN is equivalent to multiplying the functionbeing modeled by a linear function. If f (x) ⇠ GP(0, k), thenx f (x) ⇠ GP (0, k ⇥ LIN). This causes the standard deviation ofthe model to vary linearly without affecting the correlation.

James Robert Lloyd and Zoubin Ghahramani 26 / 44

Page 49: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

PRODUCTS OF KERNELS - CHANGEPOINTS

SE|{z}approximately

⇥ PER|{z}periodic function

⇥ LIN|{z}with linearly growing amplitude

⇥ �|{z}until 1700

Multiplication by � is equivalent to multiplying the functionbeing modeled by a sigmoid.

James Robert Lloyd and Zoubin Ghahramani 27 / 44

Page 50: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

AUTOMATICALLY GENERATED REPORTS

Data Search

Language of models

Evaluation

Model Prediction

Translation

Checking

Report

James Robert Lloyd and Zoubin Ghahramani 28 / 44

Page 51: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

EXAMPLE: AIRLINE PASSENGER VOLUME

Raw data

1950 1952 1954 1956 1958 1960 1962100

200

300

400

500

600

700Full model posterior with extrapolations

1950 1952 1954 1956 1958 1960 19620

100

200

300

400

500

600

700

Four additive components have been identified in the data

I A linearly increasing function.

I An approximately periodic function with a period of 1.0 years andwith linearly increasing amplitude.

I A smooth function.

I Uncorrelated noise with linearly increasing standard deviation.

James Robert Lloyd and Zoubin Ghahramani 29 / 44

Page 52: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

EXAMPLE: AIRLINE PASSENGER VOLUME

This component is linearly increasing.

Posterior of component 1

1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 19600

100

200

300

400

500Sum of components up to component 1

1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 19600

100

200

300

400

500

600

700

James Robert Lloyd and Zoubin Ghahramani 30 / 44

Page 53: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

EXAMPLE: AIRLINE PASSENGER VOLUME

This component is approximately periodic with a period of 1.0 years andvarying amplitude. Across periods the shape of this function varies verysmoothly. The amplitude of the function increases linearly. The shape of thisfunction within each period has a typical lengthscale of 6.0 weeks.

Posterior of component 2

1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960−150

−100

−50

0

50

100

150

200Sum of components up to component 2

1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 19600

100

200

300

400

500

600

700

James Robert Lloyd and Zoubin Ghahramani 31 / 44

Page 54: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

EXAMPLE: AIRLINE PASSENGER VOLUME

This component is a smooth function with a typical lengthscale of 8.1months.

Posterior of component 3

1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960−40

−30

−20

−10

0

10

20

30

40Sum of components up to component 3

1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960100

200

300

400

500

600

700

James Robert Lloyd and Zoubin Ghahramani 32 / 44

Page 55: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

EXAMPLE: AIRLINE PASSENGER VOLUME

This component models uncorrelated noise. The standard deviation of thenoise increases linearly.

Posterior of component 4

1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960−20

−15

−10

−5

0

5

10

15

20Sum of components up to component 4

1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 19600

100

200

300

400

500

600

700

James Robert Lloyd and Zoubin Ghahramani 33 / 44

Page 56: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

EXAMPLE: SOLAR IRRADIANCE

This component is constant.

Posterior of component 1

1650 1700 1750 1800 1850 1900 1950 20001360.5

1360.6

1360.7

1360.8

1360.9

1361Sum of components up to component 1

1650 1700 1750 1800 1850 1900 1950 20001360

1360.5

1361

1361.5

1362

James Robert Lloyd and Zoubin Ghahramani 34 / 44

Page 57: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

EXAMPLE: SOLAR IRRADIANCE

This component is constant. This component applies from 1643 until 1716.

Posterior of component 2

1650 1700 1750 1800 1850 1900 1950 2000−0.8

−0.7

−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0Sum of components up to component 2

1650 1700 1750 1800 1850 1900 1950 20001360

1360.5

1361

1361.5

1362

James Robert Lloyd and Zoubin Ghahramani 35 / 44

Page 58: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

EXAMPLE: SOLAR IRRADIANCE

This component is a smooth function with a typical lengthscale of 23.1years. This component applies until 1643 and from 1716 onwards.

Posterior of component 3

1650 1700 1750 1800 1850 1900 1950 2000−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8Sum of components up to component 3

1650 1700 1750 1800 1850 1900 1950 20001360

1360.5

1361

1361.5

1362

James Robert Lloyd and Zoubin Ghahramani 36 / 44

Page 59: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

EXAMPLE: SOLAR IRRADIANCE

This component is approximately periodic with a period of 10.8 years.Across periods the shape of this function varies smoothly with a typicallengthscale of 36.9 years. The shape of this function within each period isvery smooth and resembles a sinusoid. This component applies until 1643and from 1716 onwards.

Posterior of component 4

1650 1700 1750 1800 1850 1900 1950 2000−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6Sum of components up to component 4

1650 1700 1750 1800 1850 1900 1950 20001360

1360.5

1361

1361.5

1362

James Robert Lloyd and Zoubin Ghahramani 37 / 44

Page 60: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

OTHER EXAMPLES

See http://www.automaticstatistician.com

James Robert Lloyd and Zoubin Ghahramani 38 / 44

Page 61: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

GOOD PREDICTIVE PERFORMANCE AS WELL

Standardised RMSE over 13 data sets1.

01.

52.

02.

53.

03.

5

ABCDaccuracy

ABCDinterpretability

Spectralkernels

Trend, cyclicalirregular

BayesianMKL Eureqa Changepoints

SquaredExponential

Linearregression

Stan

dard

ised

RM

SE

I Tweaks can be made to the algorithm to improve accuracyor interpretability of models produced. . .

I . . . but both methods are highly competitive at extrapolation(shown above) and interpolation

James Robert Lloyd and Zoubin Ghahramani 39 / 44

Page 62: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

MODEL CHECKING AND CRITICISM

I Good statistical modelling should include model criticism:I Does the data match the assumptions of the model?I For example, if the model assumed Gaussian noise, does a

Q-Q plot reveal non-Gaussian residuals?I Our automatic statistician does posterior predictive checks,

dependence tests and residual testsI We have also been developing more systematic

nonparametric approaches to model criticism using kerneltwo-sample testing with MMD.

Lloyd, J. R., and Ghahramani, Z. (2014) Statistical Model Criticism using Kernel Two Sample

Tests. http://mlg.eng.cam.ac.uk/Lloyd/papers/kernel-model-checking.pdf

James Robert Lloyd and Zoubin Ghahramani 40 / 44

Page 63: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

CHALLENGES

I Interpretability / accuracy

I Increasing the expressivity of languageI e.g. Monotonocity, positive functions, symmetries

I Computational complexity of searching through a hugespace of models

I Extending the automatic reports to multidimensionaldatasets

I Search and descriptions naturally extend to multipledimensions, but automatically generating relevant visualsummaries harder

James Robert Lloyd and Zoubin Ghahramani 41 / 44

Page 64: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

CURRENT AND FUTURE DIRECTIONS

I Automatic statistician for:* One-dimensional time series* Linear regression (classical)I Multivariate nonlinear regression (c.f. Duvenaud, Lloyd et

al, ICML 2013)I Multivariate classification (w/ Nikola Mrksic)I Completing and interpreting tables and databases (w/ Kee

Chong Tan)

I Probabilistic programmingI Probabilistic models are expressed in a general (Turing

complete) programming languageI A universal inference engine can then be used to infer

unobserved variables given observed dataI This can be used to implement seach over the model space

in an automatic statisticianJames Robert Lloyd and Zoubin Ghahramani 42 / 44

Page 65: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

SUMMARY

I We have presented the beginnings of an automaticstatistician

I Our systemI Defines an open-ended language of modelsI Searches greedily through this spaceI Produces detailed reports describing patterns in dataI Performs automatic model criticism

I Extrapolation and interpolation performance highlycompetitive

I We believe this line of research has the potential to makepowerful statistical model-building techniques accessibleto non-experts

James Robert Lloyd and Zoubin Ghahramani 43 / 44

Page 66: Building an automatic statistician - DS 2014ds2014.ijs.si/invited/altds14.pdf · • The probability that randomly selected parameters from the prior would generate D. • Probability

REFERENCES

Website: http://www.automaticstatistician.com

Duvenaud, D., Lloyd, J. R., Grosse, R., Tenenbaum, J. B. and Ghahramani, Z. (2013) StructureDiscovery in Nonparametric Regression through Compositional Kernel Search. ICML 2013.

Lloyd, J. R., Duvenaud, D., Grosse, R., Tenenbaum, J. B. and Ghahramani, Z. (2014) AutomaticConstruction and Natural-language Description of Nonparametric Regression Models AAAI2014. http://arxiv.org/pdf/1402.4304v2.pdf

Lloyd, J. R., and Ghahramani, Z. (2014) Statistical Model Criticism using Kernel Two SampleTests http://mlg.eng.cam.ac.uk/Lloyd/papers/kernel-model-checking.pdf

Ghahramani, Z. (2013) Bayesian nonparametrics and the probabilistic approach to modellingPhilosophical Trans. Royal Society A 371: 20110553.

Looking for postdocs!

James Robert Lloyd and Zoubin Ghahramani 44 / 44