building an automatic statistician - ds 2014ds2014.ijs.si/invited/altds14.pdf · • the...

Building an Automatic Statistician

Zoubin Ghahramani

Department of EngineeringUniversity of Cambridge

[email protected]

http://learning.eng.cam.ac.uk/zoubin/

ALT-Discovery Science Conference, October 2014

James Robert Lloyd David Duvenaud Roger Grosse Josh TenenbaumCambridge Cambridge ! Harvard MIT ! Toronto MIT

THERE IS A GROWING NEED FOR DATA ANALYSIS

I We live in an era of abundant data

I The McKinsey Global Institute claimI

“The United States alone faces a shortage of 140,000 to

190,000 people with analytical expertise and 1.5 million

managers and analysts with the skills to understand and

make decisions based on the analysis of big data.”

I Diverse fields increasingly relying on expert statisticians,machine learning researchers and data scientists e.g.

I Computational sciences (e.g. biology, astronomy, . . . )I Online advertisingI Quantitative financeI . . .

James Robert Lloyd and Zoubin Ghahramani 2 / 44

GOALS OF THE AUTOMATIC STATISTICIAN PROJECT

I Provide a set of tools for understanding data that requireminimal expert input

I Uncover challenging research problems in e.g.I Automated inferenceI Model construction and comparisonI Data visualisation and interpretation

I Advance the field of machine learning in general


Background

• Probabilistic modelling

• Model selection and marginal likelihoods

• Bayesian nonparametrics

• Gaussian processes

Probabilistic Modelling

• A model describes data that one could observe from a system

• If we use the mathematics of probability theory to express allforms of uncertainty and noise associated with our model...

• ...then inverse probability (i.e. Bayes rule) allows us to inferunknown quantities, adapt our models, make predictions andlearn from data.

Bayesian Machine Learning

Everything follows from two simple rules:

Sum rule: P (x) =

Py P (x, y)

Product rule: P (x, y) = P (x)P (y|x)

P (✓|D, m) =

P (D|✓,m)P (✓|m)

P (D|m)

P (D|✓, m) likelihood of parameters ✓ in model mP (✓|m) prior probability of ✓P (✓|D, m) posterior of ✓ given data D

Prediction:

P (x|D, m) =

ZP (x|✓, D, m)P (✓|D, m)d✓

Model Comparison:

P (m|D) =

P (D|m)P (m)

P (D)

P (D|m) =

ZP (D|✓,m)P (✓|m) d✓

Model Comparison

0 5 10−20

0

20

40

M = 0

0 5 10−20

0

20

40

M = 1

0 5 10−20

0

20

40

M = 2

0 5 10−20

0

20

40

M = 3

0 5 10−20

0

20

40

M = 4

0 5 10−20

0

20

40

M = 5

0 5 10−20

0

20

40

M = 6

0 5 10−20

0

20

40

M = 7

Bayesian Occam’s Razor and Model Comparison

Compare model classes, e.g. m and m

0, using posterior probabilities given D:

p(m|D) =

p(D|m) p(m)

p(D)

, p(D|m) =

Zp(D|✓, m) p(✓|m) d✓

Interpretations of the Marginal Likelihood (“model evidence”):

• The probability that randomly selected parameters from the prior would generate D.

• Probability of the data under the model, averaging over all possible parameter values.

• log

2

⇣1

p(D|m)

⌘is the number of bits of surprise at observing data D under model m.

Model classes that are too simple are unlikelyto generate the data set.

Model classes that are too complex cangenerate many possible data sets, so again,they are unlikely to generate that particulardata set at random.

too simple

too complex

"just right"

All possible data sets of size n

P(D

|m)

D

Bayesian Model Comparison: Occam’s Razor at Work

0 5 10−20

0

20

40

M = 0

0 5 10−20

0

20

40

M = 1

0 5 10−20

0

20

40

M = 2

0 5 10−20

0

20

40

M = 3

0 5 10−20

0

20

40

M = 4

0 5 10−20

0

20

40

M = 5

0 5 10−20

0

20

40

M = 6

0 5 10−20

0

20

40

M = 7

0 1 2 3 4 5 6 70

0.2

0.4

0.6

0.8

1

M

P(Y|

M)

Model Evidence

For example, for quadratic polynomials (m = 2): y = a

0

+ a

1

x + a

2

x

2

+ ✏, where✏ ⇠ N (0, �

2

) and parameters ✓ = (a

0

a

1

a

2

�)

demo: polybayes

Parametric vs Nonparametric Models

• Parametric models assume some finite set of parameters ✓. Given the parameters,future predictions, x, are independent of the observed data, D:

P (x|✓, D) = P (x|✓)

therefore ✓ capture everything there is to know about the data.

• So the complexity of the model is bounded even if the amount of data isunbounded. This makes them not very flexible.

• Non-parametric models assume that the data distribution cannot be defined interms of such a finite set of parameters. But they can often be defined byassuming an infinite dimensional ✓. Usually we think of ✓ as a function.

• The amount of information that ✓ can capture about the data D can grow asthe amount of data grows. This makes them more flexible.

Bayesian nonparametrics

A simple framework for modelling complex data.

Nonparametric models can be viewed as having infinitely many parameters

Examples of non-parametric models:

Parametric Non-parametric Applicationpolynomial regression Gaussian processes function approx.logistic regression Gaussian process classifiers classificationmixture models, k-means Dirichlet process mixtures clusteringhidden Markov models infinite HMMs time seriesfactor analysis / pPCA / PMF infinite latent factor models feature discovery...

Nonlinear regression and Gaussian processes

Consider the problem of nonlinear regression:You want to learn a function f with error bars from data D = {X,y}

x

y

A Gaussian process defines a distribution over functions p(f) which can be used forBayesian regression:

p(f |D) =

p(f)p(D|f)

p(D)

Let f = (f(x

1

), f(x

2

), . . . , f(xn)) be an n-dimensional vector of function valuesevaluated at n points xi 2 X . Note, f is a random variable.

Definition: p(f) is a Gaussian process if for any finite subset {x

1

, . . . , xn} ⇢ X ,the marginal distribution over that subset p(f) is multivariate Gaussian.

A picture

Logistic Regression

Linear Regression

Kernel Regression

Bayesian Linear

Regression

GP Classification

Bayesian Logistic

Regression

Kernel Classification

GP Regression

Classification

BayesianKernel

Gaussian process covariance functions (kernels)

p(f) is a Gaussian process if for any finite subset {x

1

, . . . , xn} ⇢ X , the marginaldistribution over that finite subset p(f) has a multivariate Gaussian distribution.

Gaussian processes (GPs) are parameterized by a mean function, µ(x), and acovariance function, or kernel, K(x, x

0).

p(f(x), f(x

0)) = N(µ, ⌃)

where

µ =

µ(x)

µ(x

0)

�⌃ =

K(x, x) K(x, x

0)

K(x

0, x) K(x

0, x

0)

�

and similarly for p(f(x

1

), . . . , f(xn)) where now µ is an n ⇥ 1 vector and ⌃ is ann ⇥ n matrix.

Gaussian process covariance functions

Gaussian processes (GPs) are parameterized by a mean function, µ(x), anda covariance function, K(x, x

0), where µ(x) = E(f(x)) and K(x, x

0) =

Cov(f(x), f(x

0)).

An example covariance function:

K(x, x

0) = v

0

exp

⇢�

✓|x � x

0|r

◆↵�+ v

1

+ v

2

�ij

with parameters (v

0

, v

1

, v

2

, r, ↵).

These kernel parameters are interpretableand can be learned from data:

v

0

signal variancev

1

variance of biasv

2

noise variancer lengthscale↵ roughness

Once the mean and covariance functions are defined, everything else about GPsfollows from the basic rules of probability applied to mutivariate Gaussians.

Samples from GPs with di↵erent K(x, x

0)

0 10 20 30 40 50 60 70 80 90 100−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−1.5

−1

−0.5

0

0.5

1

1.5

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−2

−1

0

1

2

3

4

5

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−4

−3

−2

−1

0

1

2

3

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−3

−2

−1

0

1

2

3

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−3

−2

−1

0

1

2

3

4

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−6

−4

−2

0

2

4

6

8

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−4

−3

−2

−1

0

1

2

3

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−3

−2

−1

0

1

2

3

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−4

−3

−2

−1

0

1

2

3

4

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−6

−4

−2

0

2

4

6

8

x

f(x)

gpdemogen

Prediction using GPs with di↵erent K(x, x

0)

A sample from the prior for each covariance function:

0 10 20 30 40 50 60 70 80 90 100−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−1.5

−1

−0.5

0

0.5

1

1.5

x

f(x)

0 10 20 30 40 50 60 70 80 90 100−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

x

f(x)

Corresponding predictions, mean with two standard deviations:

0 5 10 15−1.5

−1

−0.5

0

0.5

1

0 5 10 15−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

0 5 10 15−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

0 5 10 15−1.5

−1

−0.5

0

0.5

1

gpdemo

The Automatic Statistician

WHAT WOULD AN AUTOMATIC STATISTICIAN DO?

Data Search

Language of models

Evaluation

Model Prediction

Translation

Checking

Report


GOALS OF THE AUTOMATIC STATISTICIAN PROJECT

I Provide a set of tools for understanding data that requireminimal expert input

I Uncover challenging research problems in e.g.I Automated inferenceI Model construction and comparisonI Data visualisation and interpretation

I Advance the field of machine learning in general


INGREDIENTS OF AN AUTOMATIC STATISTICIAN

Data Search

Language of models

Evaluation

Model Prediction

Translation

Checking

Report

IAn open-ended language of models

I Expressive enough to capture real-world phenomena. . .I . . . and the techniques used by human statisticians

IA search procedure

I To efficiently explore the language of modelsI

A principled method of evaluating models

I Trading off complexity and fit to dataI

A procedure to automatically explain the models

I Making the assumptions of the models explicit. . .I . . . in a way that is intelligible to non-experts


PREVIEW: AN ENTIRELY AUTOMATIC ANALYSIS

Raw data

1950 1952 1954 1956 1958 1960 1962100

200

300

400

500

600

700Full model posterior with extrapolations

1950 1952 1954 1956 1958 1960 19620

100

200

300

400

500

600

700

Four additive components have been identified in the data

I A linearly increasing function.

I An approximately periodic function with a period of 1.0 years andwith linearly increasing amplitude.

I A smooth function.

I Uncorrelated noise with linearly increasing standard deviation.


DEFINING A LANGUAGE OF MODELS

Data Search

Language of models

Evaluation

Model Prediction

Translation

Checking

Report


DEFINING A LANGUAGE OF REGRESSION MODELS

I Regression consists of learning a function f : X ! Yfrom inputs to outputs from example input / output pairs

I Language should include simple parametric forms. . .I e.g. Linear functions, Polynomials, Exponential functions

I . . . as well as functions specified by high level properties

I e.g. Smoothness, Periodicity

I Inference should be tractable for all models in language


WE CAN BUILD REGRESSION MODELS WITHGAUSSIAN PROCESSES

I GPs are distributions over functions such that any finitesubset of function evaluations, (f (x1), f (x2), . . . f (x

N

)),have a joint Gaussian distribution

I A GP is completely specified byI Mean function, µ(x) = E(f (x))I Covariance / kernel function, k(x, x

0) = Cov(f (x), f (x0))I Denoted f ⇠ GP(µ, k)

x

f(x)

GP Posterior MeanGP Posterior Uncertainty

x

f(x)

GP Posterior MeanGP Posterior UncertaintyData

x

f(x)


x

f(x)



A LANGUAGE OF GAUSSIAN PROCESS KERNELS

I It is common practice to use a zero mean function since themean can be marginalised out

I Suppose, f (x) | a ⇠ GP(a ⇥ µ(x), k(x, x

0)) wherea ⇠ N (0, 1)

I Then equivalently, f (x) ⇠ GP(0, µ(x)µ(x0) + k(x, x

0))

I We therefore define a language of GP regression models byspecifying a language of kernels


THE ATOMS OF OUR LANGUAGE

Five base kernels

0 0

0

0 0

Squaredexp. (SE)

Periodic(PER)

Linear(LIN)

Constant(C)

Whitenoise (WN)

Encoding for the following types of functions

Smoothfunctions

Periodicfunctions

Linearfunctions

Constantfunctions

Gaussiannoise


THE COMPOSITION RULES OF OUR LANGUAGE

I Two main operations: addition, multiplication

0 0

LIN ⇥ LINquadraticfunctions

SE ⇥ PERlocallyperiodic

0

0

LIN + PERperiodic pluslinear trend

SE + PERperiodic plussmooth trend


MODELING CHANGEPOINTS

Assume f1(x) ⇠ GP(0, k1) and f2(x) ⇠ GP(0, k2). Define:

f (x) = (1 � �(x)) f1(x) + �(x) f2(x)

where � is a sigmoid function between 0 and 1.

Then f ⇠ GP(0, k), where

k(x, x

0) = (1 � �(x)) k1(x, x

0) (1 � �(x0)) + �(x) k2(x, x

0) �(x0)

We define the changepoint operator k = CP(k1, k2).


AN EXPRESSIVE LANGUAGE OF MODELS

Regression model Kernel

GP smoothing SE + WNLinear regression C + LIN + WNMultiple kernel learning

PSE + WN

Trend, cyclical, irregularP

SE +P

PER + WNFourier decomposition C +

Pcos + WN

Sparse spectrum GPsP

cos + WNSpectral mixture

PSE ⇥ cos + WN

Changepoints e.g. CP(SE, SE) + WNHeteroscedasticity e.g. SE + LIN ⇥ WN

Note: cos is a special case of our version of PER


DISCOVERING A GOOD MODEL VIA SEARCH

DataSearch

Language of models

Evaluation

Model Prediction

Translation

Checking

Report


DISCOVERING A GOOD MODEL VIA SEARCH

I Language defined as the arbitrary composition of five basekernels (WN, C, LIN, SE, PER) via three operators(+,⇥, CP).

I The space spanned by this language is open-ended and canhave a high branching factor requiring a judicious search

I We propose a greedy search for its simplicity andsimilarity to human model-building


EXAMPLE: MAUNA LOA KEELING CURVE

RQ

2000 2005 2010!20

0

20

40

60

Start

SE RQ

SE + RQ . . . PER + RQ

SE + PER + RQ . . . SE ⇥ (PER + RQ)

. . . . . . . . .

. . .

. . . PER ⇥ RQ

LIN PER



( Per + RQ )

2000 2005 20100

10

20

30

40

Start

SE RQ



. . . . . . . . .

. . .

. . . PER ⇥ RQ

LIN PER



SE ! ( Per + RQ )

2000 2005 201010

20

30

40

50

Start

SE RQ



. . . . . . . . .

. . .

. . . PER ⇥ RQ

LIN PER



( SE + SE ! ( Per + RQ ) )

2000 2005 20100

20

40

60

Start

SE RQ



. . . . . . . . .

. . .

. . . PER ⇥ RQ

LIN PER


MODEL EVALUATION

Data Search

Language of models

Evaluation

Model Prediction

Translation

Checking

Report


MODEL EVALUATION

I After proposing a new model its kernel parameters areoptimised by conjugate gradients

I We evaluate each optimised model, M, using the modelevidence (marginal likelihood) which can be computedanalytically for GPs

I We penalise the marginal likelihood for the optimisedkernel parameters using the Bayesian InformationCriterion (BIC):

�0.5 ⇥ BIC(M) = log p(D |M)� p

2log n

where p is the number of kernel parameters, D representsthe data, and n is the number of data points.


AUTOMATIC TRANSLATION OF MODELS

Data Search

Language of models

Evaluation

Model Prediction

Translation

Checking

Report


AUTOMATIC TRANSLATION OF MODELS

I Search can produce arbitrarily complicated models fromopen-ended language but two main properties allowdescription to be automated

I Kernels can be decomposed into a sum of products

I A sum of kernels corresponds to a sum of functionsI Therefore, we can describe each product of kernels

separately

I Each kernel in a product modifies a model in a consistent

wayI Each kernel roughly corresponds to an adjective


SUM OF PRODUCTS NORMAL FORM

Suppose the search finds the following kernel

SE ⇥ (WN ⇥ LIN + CP(C, PER))

The changepoint can be converted into a sum of products

SE ⇥ (WN ⇥ LIN + C ⇥ � + PER ⇥ �̄)

Multiplication can be distributed over addition

SE ⇥ WN ⇥ LIN + SE ⇥ C ⇥ � + SE ⇥ PER ⇥ �̄

Simplification rules are applied

WN ⇥ LIN + SE ⇥ � + SE ⇥ PER ⇥ �̄


SUMS OF KERNELS ARE SUMS OF FUNCTIONS

If f1 ⇠ GP(0, k1) and independently f2 ⇠ GP(0, k2) then

f1 + f2 ⇠ GP(0, k1 + k2)

e.g.Full model posterior with extrapolations

1960 1970 1980 1990 2000 2010 2020−40

−20

0

20

40

60

80

=Posterior of component 1

1960 1965 1970 1975 1980 1985 1990 1995 2000−30

−20

−10

0

10

20

30

40

+Posterior of component 2

1960 1965 1970 1975 1980 1985 1990 1995 2000−4

−3

−2

−1

0

1

2

3

4


1960 1965 1970 1975 1980 1985 1990 1995 2000−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Full model posterior with extrapolations

1950 1952 1954 1956 1958 1960 19620

100

200

300

400

500

600

700

=Posterior of component 1

1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 19600

100

200

300

400

500


1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960−150

−100

−50

0

50

100

150

200


1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960−40

−30

−20

−10

0

10

20

30

40

We can therefore describe each component separately


PRODUCTS OF KERNELS

PER|{z}periodic function

On their own, each kernel is described by a standard nounphrase


PRODUCTS OF KERNELS - SE

SE|{z}approximately

⇥ PER|{z}periodic function

Multiplication by SE removes long range correlations from amodel since SE(x, x

0) decreases monotonically to 0 as |x � x

0|increases.


PRODUCTS OF KERNELS - LIN

SE|{z}approximately


⇥ LIN|{z}with linearly growing amplitude

Multiplication by LIN is equivalent to multiplying the functionbeing modeled by a linear function. If f (x) ⇠ GP(0, k), thenx f (x) ⇠ GP (0, k ⇥ LIN). This causes the standard deviation ofthe model to vary linearly without affecting the correlation.


PRODUCTS OF KERNELS - CHANGEPOINTS

SE|{z}approximately


⇥ LIN|{z}with linearly growing amplitude

⇥ �|{z}until 1700

Multiplication by � is equivalent to multiplying the functionbeing modeled by a sigmoid.


AUTOMATICALLY GENERATED REPORTS

Data Search

Language of models

Evaluation

Model Prediction

Translation

Checking

Report


EXAMPLE: AIRLINE PASSENGER VOLUME

Raw data

1950 1952 1954 1956 1958 1960 1962100

200

300

400

500

600

700Full model posterior with extrapolations

1950 1952 1954 1956 1958 1960 19620

100

200

300

400

500

600

700

Four additive components have been identified in the data

I A linearly increasing function.

I An approximately periodic function with a period of 1.0 years andwith linearly increasing amplitude.

I A smooth function.

I Uncorrelated noise with linearly increasing standard deviation.



This component is linearly increasing.

Posterior of component 1

1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 19600

100

200

300

400

500Sum of components up to component 1

1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 19600

100

200

300

400

500

600

700



This component is approximately periodic with a period of 1.0 years andvarying amplitude. Across periods the shape of this function varies verysmoothly. The amplitude of the function increases linearly. The shape of thisfunction within each period has a typical lengthscale of 6.0 weeks.


1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960−150

−100

−50

0

50

100

150


1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 19600

100

200

300

400

500

600

700



This component is a smooth function with a typical lengthscale of 8.1months.


1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960−40

−30

−20

−10

0

10

20

30


1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960100

200

300

400

500

600

700



This component models uncorrelated noise. The standard deviation of thenoise increases linearly.


1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960−20

−15

−10

−5

0

5

10

15


1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 19600

100

200

300

400

500

600

700


EXAMPLE: SOLAR IRRADIANCE

This component is constant.


1650 1700 1750 1800 1850 1900 1950 20001360.5

1360.6

1360.7

1360.8

1360.9


1650 1700 1750 1800 1850 1900 1950 20001360

1360.5

1361

1361.5

1362



This component is constant. This component applies from 1643 until 1716.


1650 1700 1750 1800 1850 1900 1950 2000−0.8

−0.7

−0.6

−0.5

−0.4

−0.3

−0.2

−0.1


1650 1700 1750 1800 1850 1900 1950 20001360

1360.5

1361

1361.5

1362



This component is a smooth function with a typical lengthscale of 23.1years. This component applies until 1643 and from 1716 onwards.


1650 1700 1750 1800 1850 1900 1950 2000−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8Sum of components up to component 3

1650 1700 1750 1800 1850 1900 1950 20001360

1360.5

1361

1361.5

1362



This component is approximately periodic with a period of 10.8 years.Across periods the shape of this function varies smoothly with a typicallengthscale of 36.9 years. The shape of this function within each period isvery smooth and resembles a sinusoid. This component applies until 1643and from 1716 onwards.


1650 1700 1750 1800 1850 1900 1950 2000−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6Sum of components up to component 4

1650 1700 1750 1800 1850 1900 1950 20001360

1360.5

1361

1361.5

1362


OTHER EXAMPLES

See http://www.automaticstatistician.com


GOOD PREDICTIVE PERFORMANCE AS WELL

Standardised RMSE over 13 data sets1.

01.

52.

02.

53.

03.

5

ABCDaccuracy

ABCDinterpretability

Spectralkernels

Trend, cyclicalirregular

BayesianMKL Eureqa Changepoints

SquaredExponential

Linearregression

Stan

dard

ised

RM

SE

I Tweaks can be made to the algorithm to improve accuracyor interpretability of models produced. . .

I . . . but both methods are highly competitive at extrapolation(shown above) and interpolation


MODEL CHECKING AND CRITICISM

I Good statistical modelling should include model criticism:I Does the data match the assumptions of the model?I For example, if the model assumed Gaussian noise, does a

Q-Q plot reveal non-Gaussian residuals?I Our automatic statistician does posterior predictive checks,

dependence tests and residual testsI We have also been developing more systematic

nonparametric approaches to model criticism using kerneltwo-sample testing with MMD.

Lloyd, J. R., and Ghahramani, Z. (2014) Statistical Model Criticism using Kernel Two Sample

Tests. http://mlg.eng.cam.ac.uk/Lloyd/papers/kernel-model-checking.pdf


CHALLENGES

I Interpretability / accuracy

I Increasing the expressivity of languageI e.g. Monotonocity, positive functions, symmetries

I Computational complexity of searching through a hugespace of models

I Extending the automatic reports to multidimensionaldatasets

I Search and descriptions naturally extend to multipledimensions, but automatically generating relevant visualsummaries harder


CURRENT AND FUTURE DIRECTIONS

I Automatic statistician for:* One-dimensional time series* Linear regression (classical)I Multivariate nonlinear regression (c.f. Duvenaud, Lloyd et

al, ICML 2013)I Multivariate classification (w/ Nikola Mrksic)I Completing and interpreting tables and databases (w/ Kee

Chong Tan)

I Probabilistic programmingI Probabilistic models are expressed in a general (Turing

complete) programming languageI A universal inference engine can then be used to infer

unobserved variables given observed dataI This can be used to implement seach over the model space

in an automatic statisticianJames Robert Lloyd and Zoubin Ghahramani 42 / 44

SUMMARY

I We have presented the beginnings of an automaticstatistician

I Our systemI Defines an open-ended language of modelsI Searches greedily through this spaceI Produces detailed reports describing patterns in dataI Performs automatic model criticism

I Extrapolation and interpolation performance highlycompetitive

I We believe this line of research has the potential to makepowerful statistical model-building techniques accessibleto non-experts


REFERENCES

Website: http://www.automaticstatistician.com

Duvenaud, D., Lloyd, J. R., Grosse, R., Tenenbaum, J. B. and Ghahramani, Z. (2013) StructureDiscovery in Nonparametric Regression through Compositional Kernel Search. ICML 2013.

Lloyd, J. R., Duvenaud, D., Grosse, R., Tenenbaum, J. B. and Ghahramani, Z. (2014) AutomaticConstruction and Natural-language Description of Nonparametric Regression Models AAAI2014. http://arxiv.org/pdf/1402.4304v2.pdf

Lloyd, J. R., and Ghahramani, Z. (2014) Statistical Model Criticism using Kernel Two SampleTests http://mlg.eng.cam.ac.uk/Lloyd/papers/kernel-model-checking.pdf

Ghahramani, Z. (2013) Bayesian nonparametrics and the probabilistic approach to modellingPhilosophical Trans. Royal Society A 371: 20110553.

Looking for postdocs!


building an automatic statistician - ds 2014ds2014.ijs.si/invited/altds14.pdf · • the...

Documents