understanding multivariate pattern analysis for...

Understanding multivariate pattern analysisfor neuroimaging applications

Maria Giulia PretiDimitri Van De VilleMedical Image Processing Lab, Institute of Bioengineering, Ecole Polytechnique Fédérale de LausanneDepartment of Radiology & Medical Informatics, Université de Genève

http://miplab.epfl.ch/

April 17, 2014

Overviewn Brain decoding: motivation and applications

n Limitations of “regular SPM” (GLM)n “Hall of fame” of brain decoding

n Pattern recognition principles for brain decodingn Basicsn Data processing pipeline for fMRI decoding

n Preprocessingn Feature extractionn Feature selection

n Cross-validationn Training phasen Evaluation phasen Testing phase

n Classifier selection

2

3

GLM

Stats

whole-brain scan2x2x2mm3, 20-30 slicesevery 2-4 sec

100’000 intracranial voxels1’000 timepoints

t-val

ue

time

Limitations of the GLM approachn Massively univariate approach

n Hypothesis testing is applied voxel-wisen All voxels are treated independently,

but are obviously not independentn Textbook multivariate approaches are not practical

n E.g., estimating spatial covariance matrix (100’000x100’000)by 1’000 timepoints will be rank-deficient

n fMRI data are complicated➔ Data-driven: let the data speak

n Group statistics are great...n ... but the ultimate goal is often

to know how well the resultsgeneralizes (prediction)

n Learn from the group, apply to the (unseen) individual4

condition 1 condition 2

voxel 1voxel 2

voxel 1voxel 2

Forward versus backwards inference

5

stimuliconditionsdiseasedisorder...

generative model

stimuliconditionsdiseasedisorder...

discriminative model

encoding

decoding

data

data

phrenology

blind reading

Exploiting “subtle” spatial patterns

[Cox et al., 2003; Haynes and Rees, 2006]

Brain decoding of visual processing

7[Haxby et al., 2001]

distributed patterns subtle patterns in V1

[Kamitani and Tong, 2005, 2006][Haynes, Rees., 2006]

advanced models forreceptive fields

[Miyawaki et al., 2008]

8

Decoding video sequences

reverse statistical inference(“decode” stimulus from data)

[Nishimoto et al, 2011]

Emotions in voice-sensitive corticesn How is emotional information treated in aud. cortex?

n 22 healthy right-handed subjectsn 10 actors pronounced “ne kalibam sout molem”

(anger, sadness, neutrality, relief, joy)n normalization for mean sound intensity;

rated by 24 (other) subjectsn Classifier approach on aud. cortex

n localizer: human voices > animal soundsn 20 stimuli for each category (5); 5 classifiers (one-versus-others)

9[Ethofer, VDV, Scherer, Vuilleumier, Current Biology, 2009; collaboration with LABNIC/CISA, UniGE]

p<0.05 (FWE)

10

Anger Sadness Neutral Relief Joy

z=0

z=+3

z=+6

[Ethofer, VDV, Scherrer, Vuilleumier, Current Biology, 2009]“Ne kalibam sout molem”

Decoding from the emotional brain

11[Ethofer, VDV, Scherer, Vuilleumier, Current Biology, 2009]

Results for bilateralvoice-sensitive cortices

solid: smoothed (10mm FWHM)dashed: non-smootheddotted: spatial average

Chance level: 20%Classifier: linear SVM

Evidence that auditory cues for emotional processing are spatially encoded; bilaterial representation

Comprehension of emotional prosody is crucial for social functioning psychiatric disorders (schizophrenia, bipolar affective disorder,...)

Brain decodingn Automatic classification of structural images

n SVM applied to similarity measures (inner product)

12[Klöppel et al., 2008]

Basics of pattern recognition

n Feature vector is typically embedded in a vector space, where each feature is a dimension

13

featurevector xi

D featuresmodel

functionmachineclassifier

...

label yi(e.g., 0/1)


14

featurematrix X

D featuresmodel

labels y

N instances

Classification as a regression problem

Linear model:

y

T = X

Tw1 + w0 + noise

where the weight vector w = (w1, w0) characterizes the model

Learning the model by least-squares fitting: w = argminw��yT �X

Tw1 � w0

��2

(notice problem when D > N )

Classify new data by comparing x

0Tw against 0.5?


15

n Linear model projects instance on line parallel to wn Decision boundary is a hyperplane in the feature space for a linear

modeln If feature dimensions correspond to voxels, then w corresponds to a map

[adapted from Jaakkola]

Classification and decision theory

Assume class-conditional densities p(x|y) and class probabilities P (y) known

Minimizing the overall probability of error corresponds to

y0 = argmax

yP (y|x0

)

= argmax

y

p(x0|y)P (y)

p(x0)

(Bayes’ rule)

= argmax

yp(x0|y)P (y)


16

featurematrix X

D featuresmodel

labels y

N instances

posterior

likelihood

P (y|x)p(x|y)

Wonderful link between discriminative model (posterior) and gener-

ative model (likelihood)

P (y|x) / p(x|y)P (y)

Discriminative model; e.g., logistic regression

log

✓P (y = 1|x)P (y = 0|x)

◆= w

T1 x+ w0 leads to P (y = 1|x) = g(w

T1 x+ w0)

Generative model; e.g., na¨ıve Bayes

p(x|y) =DY

i=1

p(xi|y)

where the features are modeled as independent random variables


17

posterior

likelihood

Model order selection

�2 log p(x|✓K , y)| {z }log-likelihood

+penalty(✓K)

Model ✓K with K parameters

Occam’s razor: make model as simple as possible

Akaike information criterion: penalty = 2K

Bayesian information criterion: penalty = K logD


18

FMRI processing pipelinen From fMRI raw data to decisions

19

acquire signal

pre-process signal

extract features

selectfeatures

trainclassifier

[trai

n]

[test]classify

decision �

classifier �

Motion correctionCoregistration

NormalisationParcellation

activity

connectivity

mean per voxel

mean per region

Univariate

Multivariate

D'

D < D'

Preprocessing n Due to large inter-subject variability (anatomical and

functional) and artifacts, (f)MRI data must be preprocessed before feature extraction

n Possible preprocessing stepsn Slice-timing correctionn Susceptibility correctionn Motion correctionn Detrending (scanner drift)n Physiological artifacts removaln Normalisation to common template (e.g., MNI)

n Exploit structural MRI data (T1)

20

Feature extractionn Three key principles

n Use engineering know-how (e.g., try to increase SNR)n Use domain knowledge (e.g., look at the right regions)n Aim at interpretability, depending on the task

n Intra- vs intersubject decoding

n Spatial dimensionsn Gray-matter segmentation (10’000 voxels)n Atlasing (100-1’000 regions) n Interhemispherical differencesn Searchlight approach [Kriegeskorte et al., 2006]n Transform domain (e.g., PCA/ICA, wavelets)

n Time dimensionn Exploit task paradigm n Incorporate knowledge about hemodynamic responsen Temporal compression

n Within and between timecourses21

Feature extraction (temporal compression)

n local features: one vector per scan (e.g., intensity or derivative)n segmental features: computed over a number of time points (e.g.,

sliding-window measures)n global features: computed over whole timecourse (e.g., mean,

standard deviation)n statistical dependencies: between timecourses (e.g., functional

connectivity based on Pearson correlation)22

Feature selectionn Establish a subset of most discriminative features

n Reduce dimensionality of the feature space to D’n Feature selection is a combinatorial search

problem for the best feature subset based on a search strategy and an objective functionn Search strategies can be univariate (select one voxel at

a time) or multivariaten optimal (exhaustive), suboptimal; best-N, genetic algorithms,

backward/forward recursive selection...n Objective functions are of three kinds:

n filter: the objective function is not a discriminant functionn wrapper: the objective function is the error rate of the final

classifiern embedded: the objective function is directly based on the

objective function of the final classifier

23

Simple method: point-biserial correlationn For each of the D features, measure point-biserial

correlation:

n Then generate a ranking according to |rpb| and keep only top D’ features

2483 27 49 16 42 39 80 22 46 48 64 58 8 26 45 2 53 17 85 21 24 79 70 69 63 61 19 6 9 54 76 10 71 15 55 32 38 14 3 74 25 12 28 90 770

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Feature number (AAL brain region)

|r pb|

D’

rpb =X!1 �X!0

�X

rN0N1

N2

Model learningn Training set with known labels (supervised learning)

n Model and its parameters depends on type of classifiern Discriminative models

n Logistic regressionn Fisher discriminant analysisn Support vector machinesn Random forestn Neural networks

n Generative modelsn Naïve Bayes, can assume different types of densitiesn Gaussian mixture modeln Hidden Markov model

25

n Classifier prediction can be correct or wrongn Accuracy statistics can be shown in a confusion

matrix (summary table):n Class 0 accuracy=A/(A+B)n Class 1 accuracy=D/(C+D)n Accuracy= (A+D)/(A+B+C+D)n Biostats (ω0 =healthy, ω1 =disease):

n sensitivity (“Disease is PRESENT and I report disease”): D/(C+D) = TP/(FN+TP)

n specificity: (“Disease is ABSENT and I report no disease”: A/(A+B) = TN/(FP+TN)

n Perfect: B=C=0. Be suspicious if this happens!n Random: A=B=C=D. Same as flipping a coin

Testing classifiers: error analysis

26

ω0 ω1

ω0 A Bω1 C Dtru

th

estimated

Testing classifiers: the ROC curven Increasing sensitivity can

never increase specificity at the same time, and vice-versa

n Receiver Operating Characteristic (ROC) curven Allow to see compromise over

the operating range of a classifier

n Can be used forclassifier comparisone.g., compute the Area Under Curve (AUC) as a summary measure of performance

27

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1−specificityse

nsiti

vity

classifier 1 − AUC 0.79classifier 2 − AUC 0.74

Proper tuning and error estimation

28

n Maintain strict separation throughoutn Remember our goal: classify unseen datan This applies at all stages of the processing chainn Always ask yourself: what would I do if I acquired data

for one more subject after my classifier is trained? n If not, main cause of over-optimistic results (and even

useless in some cases)

training evaluation testdatalabels

TR EV TE

Train Evaluate TestErrorOK?

TREV

TE

Final Performance

MeasureSplitFull

Dataset

Tune

Classifier selectionn How to choose a classifier?

n No Free Lunch Theorem n You cannot get learning for free, every classifier has its biases

n Problem characteristics (data structure, dimensions, ...)n Prior experience and personal preferencen Cross-validation + statistical testing

n Model complexity and optimal parameters ?n Occam’s razor

n As simple as possiblen Information-theoretic

criterian AIC, BIC/MDL, ...

n Empirical approachesn Cross-validationn Surrogate data 29[Hasbie et al., 2009]

high reproducibilitylow predictability

low reproducibilityhigh predictability

Challenges and opportunitiesn High-dimensional learning problem

n Large number of features compared to instancesn Kernel trick

n Basically, look at XTX instead of XXT

n “Distance” between instances depends on definition of kernel functionn Interpretation

n How to get “brain maps” (e.g., showing where information is processed)n Integration versus segregation

n Estimate where you are in the bias-variance trade-offn Multimodal approaches

n Combine different classifiers (“ensemble learning”)n Intra-subject decoding

n Fine-grained organization of brain activation patterns (“hyperacuity”)n Inter-subject decoding

n Multicentric datasetsn Diagnosis and prognosis

n Prediction at the individual patient leveln Beyond patient-or-not, predict clinical measures ~ surrogate functional markers

Recommended readingn Pereira et al., Machine learning classifiers and fMRI:

A tutorial overview, NeuroImage 45, 2009n Richiardi et al., Machine learning with brain graphs,

IEEE Signal Processing Magazine 30, 2013n Haufe et al., On the interpretation of weight vectors

of linear models in multivariate neuroimaging, NeuroImage, in press

31

Classifier selectionn Is classifier A better than classifier B?n Hypothesis testing framework

n “Do the two sets of decisions of classifiers A and B represent two different populations?”

n Careful: Samples (sets of decisions) are dependent (same evaluation/testing dataset), and measurement type is categorical binary (for a two-class problem)

⇒Use McNemar testH0: “there is no difference between the accuracies of the two classifiers”

32

test sample

AB

1 2 3 4 5 6 7 8 9 10✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✘ ✘

✔ ✔ ✘ ✔ ✔ ✔ ✘ ✔ ✔ ✘

n Compute relationships between classifier decisions

n If H0 is true, we should have N01 = N10 = 0.5(N01+N10)n We can measure the following test-static based on

the observed counts:

n Then, compare x2 to critical value of χ2 (1 DOF) at a level of significance α

n (requires N01+N10>25)

Selection with the McNemar test

33

B ✔ B ✘

A ✔ N11 N10

A ✘ N01 N00

After [Dietterich, 1998]

x

2 =(|N01 �N10|� 1)2

N01 +N10

understanding multivariate pattern analysis for...

Documents