amos storkey, school of informatics. when training and test distributions are different...

Amos Storkey, School of Informatics.

when training and test distributions are

different

characterising learning transfer


acknowledgements

Joint work with Masashi Sugiyama, Jon Clayden and Mark Bastin


characterising learning transfer

Learning transfer

Covers many current cases of dataset shift

Will benefit from an inclusive framework that characterises the general problem

Can be formalised

Is practical


dataset shift

Predictive Generative

Training

?Test


real life

Characterising the change Simple covariate shift Prior probability shift Sample selection bias Imbalanced data Domain shift Source component shift

Focus on the prediction problem: Given X predict Y


simple covariate shift

Learnt conditional predictive model

Change: Distribution of X changes P(Y|X) does not

Modelling implication: None (given suitable modelling class)

X

Y

y

x


no modellingimplication?

y

x


prior probability shift

Learnt generative model

Change: Distribution of Y changes P(X|Y) does not

Modelling implication: Use different P(Y) in Bayes Rule

Y

X

x2

x1

y

x


sample selection bias

Learnt conditional predictive model

Change: Sample selection rule V determines

what samples occur in data.

Modelling implication: Sample selection estimation

X

Y V

y

x

X

Y V

= covariate shift


imbalanced data

Learn conditional classification model on balanced data

Change: Training data: V rejects many samples for

common class Test on full imbalanced data (special case of

sample selection bias)

Modelling implication: Adapt classification probability thresholds to

account for change.

X

Y V

X1

X2


domain shift

Learn conditional classification model on balanced data

Change: Dynamic X. Xnew=f(Xold) Y(Xnew)=Y(f(Xold))

Modelling implication: Need to learn functional map f

X

Y

F

Xo


source component shift

Various sources for dataChange:

Proportions of different source components vary between datasets

Within source conditional models are same

Modelling implication: Estimate sources and proportion changes Learn mixture of experts model

X

Y R

y

x


sample selection v source

componentsample selection bias as

source component shift: Let R index rejection-

equiprobable regions. P(X,Y|R) gives distributions

for those regions: consistent for both training and test.

P(R) varies to account for rejection in training.

X

Y V

X

Y R


modelling source component shift

P1(y|x) P2(y|x)

P11(x) P12(x) P13(x) P21(x) P22(x) P23(x)

i

D1i

D2i

T1i


EM for source component shift

Effectively a Gaussian mixture model with shared components, and different priors.

Can use EM algorithm: Compute responsibilities for components Learn parameters of Gaussians Learn parameters for regressors. All subject to constraints on what data point can

be generated from what model.


tests

1D linear, sample from prior form, BIC model selection, 100 tests.


tests

4D nonlinear, auto-mpg data, Gaussian process regressors, BIC.

Trained on one origin of car

Tested on 2 other origins


issues

Single training dataset

No targets for new domain Semi-supervised: a few target values might help

to distinguish between different potential shift models.

Dataset shift Transfer Learning


from here...

Tranference Dealing with the more general problem of

multiple datasets multiple domains• Topic modelling and multilevel topic modelling• What is a domain or dataset anyway? Structured data.• More general than regression. Varying fields. Missing

data. Semi-supervised learning.• Characterising the general case.

Mixtures and mixingDataset productionNon-parametric methods and local minima

reduction


interim

Transference is really structure modellingDataset shift implies unsupervised learning!Using conditional models implies a particular

full generative model under dataset shift scenarios.

But in unsupervised learning people have been dealing with dataset shift for a long time… by modelling for it.

e.g.Intra versus inter subject variability.In real life, modelling for the variability is the

most common approach. Never simple.


Diffusion Tensor Imaging

Brain MRI imaging technique looking at the anisotropy of water diffusion in the brain.


the white matter


diffusion tensor

The diffusion of water at each voxel is commonly modelled as a three dimensional second order tensor, D.

Think of it as an ellipsoid with some principal direction.


The problem

“White matter integrity” matters in studies of ageing.

But to study white matter integrity, we have to compare across subjects, and within subjects.

But subjects brains are different anyway.

Need to account for shifts between brains in mapping results.

Use diffusion tensor imaging. Currently: Use FA.


Tractography

Would like to combine local direction components into consistent “tracts”.

But the measurements are noisy…

Set up a Markov Random Field


Behrens et al

And then sample streamlines from the random field. Can either work with streamline samples,

or compute marginals: P(tract goes through X| same tract goes through SEED).


Seed points as hypotheses

Single seed point is more specific than a seeding region

But tract reconstruction is highly sensitive to seed placement

Neighbourhood tractography (NT) treats a group of “candidate” seed points as hypotheses

Uses tract shape and length to find best resulting match to a reference tract

Clayden et al., NeuroImage, 2006


Bayesian model comparison

Given some reference tract from one brain.

Is this tract in a second brain the same tract as the reference tract?

Compare P(tract) with P(tract|reference tract)

but

Want consistency! The reference tract is just any other tract. Need a model with P(tract)=

reftract )reftract()reftract|tract( dPP


Model choice

Model Comparison or Model choice?

In fact we have a number of candidate matches.

Presume at most one is right. Could be that none match.

Compute P(this is right match).


median tractspline fit

Work with streamlines. Reduce to Median Tract.

Fit a B-spline to the 3D

median tract.

Adjust knot point positions to constrain error on reference tract.

Seed point


Two models:P(cos[]) andP(cos[], cos[r] | cos[])=P(cos[])P(cos[r]| cos[], cos[])

Derive second from assumption v1

* symmetric about v1.

v0

v1


modelcos() is uniform if direction is uniform on unit sphere.Use a Beta distribution + uniform component to model

probabilities. Compute using hand labelled training data.

model whole tract as product of individual step probabilities.

2 cases: unmatched, matched.


results


match quality

Posterior probabilities for the second and third subjects:

1: 0.332 2: 0.344 3: 0.822 4: 0.588 5: 0.877

For the first subject, the best match (top): 0.464, next best (middle): 0.116.

Three tracts >0.1, five >0.05 (all plausible matches). This is out of 220 candidate seeds. The posterior for the “central seed” (bottom)was 5.28

x 10-6.


Use match

Now we can compare like with like across brains: compute tract integrity measures.

Major improvement in comparative results.

Clayden J.D., A.J. Storkey, S. Munoz Maniega and M.E. Bastin (2009) Reproducibility of tract segmentation between sessions using an unsupervised modelling-based approach. Neuroimage 45, 377-385.

Bastin, M., J.P. Piatowski, A.J. Storkey, L.J. Brown, A.M. Maclullich and J.D. Clayden (2008) Tract shape modelling provides evidence of topological change in corpus callosum genu during normal ageing. Neuroimage 43: 20-28

Bastin M.E. , S. MuÃ±oz Maniega, K.J. Ferguson, L.J. Brown, J.M. Wardlaw, A.M. MacLullich & J.D. Clayden (2010). Quantifying the effects of normal ageing on white matter structure using unsupervised tract shape modelling. NeuroImage 51(1):1-10.

Penke L., S. MuÃ±oz Maniega, L.M. Houlihan, C. Murray, A.J. Gow, J.D. Clayden, M.E. Bastin, J.M. Wardlaw & I.J. Deary (2010). White matter integrity in the splenium of the corpus callosum is related to successful cognitive aging and partly mediates the protective effect of an ancestral polymorphism in ADRB2. Behavior Genetics 40(2):146-156.


Conclusions

Dataset shift happens all the time

There are some common generic causes

Modelling involves a full generative understanding.

In many realistic scenarios accommodating shifts is non-trivial.

Model for likely changes.

amos storkey, school of informatics. when training and test distributions are different...

Documents

bic model selection

sample selection rule

notmodelling implication

mpg data

data point

structured data

missing data

samemodelling implication