cpsc 540: machine learningschmidtm/courses/540-w16/l23.pdf · overfitting: using all features hurt,...

CPSC 540: Machine Learning

Conditional Random Fields, Latent Dynamics

Winter 2016

Admin

• Assignment 5:

– Due in 1 week.

• Project:

– Due date moved again to April 26 (so that undergrads can graduate).

– Graduate students graduating in May must submit by April 21.

• No tutorial Friday (or in subsequent weeks).

• Final help session Monday.

• Thursday class may go long.

Motivation: Automatic Brain Tumor Segmentation

• Task: segmentation tumors and normal tissue in multi-modal MRI data.

• Applications:– Radiation therapy target planning, quantifying treatment responses.– Mining growth patterns, image-guided surgery.

• Challenges:– Variety of tumor appearances, similarity to normal tissue.– “You are never going to solve this problem.”

Input: Output:

Naïve Approach: Voxel-Level Classifier

• We could treat classifying a voxel as supervised learning:

• “Learn” model that predicts yi given xi: model can classify new voxels.

• Advantage: we can apply machine learning, and ML is cool.

• Disadvantage: it doesn’t work at all.

Naïve Approach: Voxel-Level Classifier

• Even in “nice” cases, significant overlap between tissues:

– Mixture of Gaussians and “outlier” class:

• Problems with naïve approach:

– Intensities not standardized.

– Location and texture matter.

Improvement 1: Intensity Standardization

• Want xi = <98,187,246> to mean same thing in different places.

• Pre-processing to normalize intensities:

Between slices:Within Images: Between people:

Improvement 2: Template Alignment

• Location matters:

– Seeing xi =<98,187,246> in one area of head is different than in other areas.

• Alignment to standard coordinates system:

Improvement 2: Template Alignment

• Add spatial features that take into account location information:

Aligned input images:

Template images:Priors for normal tissue locations:

Bilateral symmetry based on known axis:

Improvement 3: Convolutions

• Use convolutions to incorporate neighborhood information.

– We used fixed convolutions, now you would try to learn them.

Performance of Final System

Challenges

• Final system used linear classifer, and typically worked well.

• But several ML challenges arose:

1. Time: 14 hours to train logistic regression on 10 images.

• Lead to quasi-Newton, stochastic gradient, and SAG work.

2. Overfitting: using all features hurt, so we used manual feature selection.

• Lead to regularization, L1-regularization, and structured sparsity work.

3. Relaxation: post-processing by filtering and `hole-filling of labels.

• Lead to conditional random fields, shape priors, and structure learning work.

Outline

• Motivation

• Conditional Random Fields Clean Up

• Latent/Deep Graphical Models

Multi-Class Logistic Regression: View 1

• Recall that multi-class logistic regression makes decisions using:

• Here, f(x) are features and we have a vector wy for each class ‘y’.

• Normally fit wy using regularized maximum likelihood assuming:

• This softmax function yields a differentiable and convex NLL.


• Recall that multi-class logistic regression makes decisions using:

• Claim: can be written using a single ‘w’ and features of ‘x’ and ‘y’.

• To do this, we can use the construction:


• So multi-class logistic regression with new notation uses:

• And softmax probabilities gives:

• View 2 gives extra flexibility in defining features:

– For example, we might have different features for class 1 than 2:

Multi-Class Logistic Regression for Segmentation

• In brain tumor example, each xi is the features for one voxel:

• But we want to label the whole image:

• Probability of segmentation Y given image X with independent model:

Conditional Random Fields

• Unfortunately, independent model gives silly results:

• This model of p(Y|X,w) misses the “guilt by association”:

– Neighbouring voxels are likely to receive the same values.

• The key ingredients of conditional random fields (CRFs):

– Define features of entire image and labelling F(X,Y):

– We can model dependencies using features that depend on multiple yi.


• Interpretation of independent model as CRF:


• Example of modeling dependencies between neighbours as a CRF:

Conditional Random Fields for Segmentation

• Recall the performance with the independent classifier:

– Features of the form f(X,yi).

• Consider a CRF that also has pairwise features:

– Features f(X,yi,yj) for all (i,j) corresponding to adjacent voxels.

– Model “guilt by association”:

Conditional Random Fields as Graphical Models

• Seems great: we can now model dependencies in the labels.

– Why not model threeway interactions with F(X,yi,yj,yk)?

– How about adding things like shape priors F(X,Yr) for some region ‘r’?

• Challenge is that inference and decoding can become hard.

• We can view CRFs as undirected graphical models:

• If the graph is too complicated (and we don’t have special ‘F’):

– Intractable since we need inference (computing Z/marginals) for training.

Overview of Exact Methods for Graphical Models

• We can do exact decoding/inference/sampling for:– Small number of variables (enumeration).

– Chains (Viterbi, forward-backward, forward-filter backward-sample).

– Trees (belief propagation).

– Low treewidth graphs (junction tree).

• Other cases where exact computation is possible:– Semi-Markov chains (allow dependence on time you spend in a state).

– Context-free grammars (allows potentials on recursively-nested parts of sequence).

– Binary ‘k’ and “attractive” potentials (exact decoding via graph cuts).

– Sum-product networks (restrict potentials to allow exact computation).

Overview of Approximate Methods for Graphical Models

• Approximate decoding with local search:

– Coordinate descent is called iterated conditional mode (ICM).

• Approximate sampling with MCMC:

– We saw Gibbs sampling last week.

• Approximate inference with variational methods:

– Mean field, loopy belief propagation, tree-reweighted belief propagation.

• Approximate decoding with convex relaxations:

– Linear programming approximation.

• Block versions of all of the above:

– Variant is alpha-expansions: block moves involving classes.

Overview of Methods for Fitting Graphical Models

• If inference is intractable, there are some alternatives for learning:

– Variational inference to approximate Z and marginals.

– Pseudo-likelihood: fast and cheap convex approximation for learning.

– Structured SVMs: generalization of SVMs that only requires decoding.

– Younes: alternate between Gibbs sampling and parameter update.

• Also known as “persistent contrastive divergence”.

• For more details on graphical models, see:

– UGM software: http://www.cs.ubc.ca/~schmidtm/Software/UGM.html

– MLRG PGM crash course: http://www.cs.ubc.ca/labs/lci/mlrg

http://www.cs.ubc.ca/~schmidtm/Software/UGM.html

http://www.cs.ubc.ca/labs/lci/mlrg

Independent Logistic Regression

x1 x2 x3

h3

y3

x4 x5

Independent Logistic Regression

x1

h1

y1

x2

h2

y2

x3

h3

y3

x4

h4

y4

x5

h5

y5

Conditional Random Field (CRF)

x1

h1

y1

x2

h2

y2

x3

h3

y3

x4

h4

y4

x5

h5

y5

Neural Network

x1

h1

y1

x2

h2

y2

x3

h3

y3

x4

h4

y4

x5

h5

y5

Deep Learning

x1

h1

g1

x2

h2

g2

x3

h3

g3

x4

h4

g4

x5

h5

g5

y1 y2 y3 y4 y5

Conditional Neural Field (CNF)

x1

h1

g1

x2

h2

g2

x3

h3

g3

x4

h4

g4

x5

h5

g5

y1 y2 y3 y4 y5

Outline

• Motivation

• Conditional Random Fields Clean Up

• Latent/Deep Graphical Models

Motivation: Gesture Recognition

• Want to recognize gestures from video:

• A gesture is composed of a sequence of parts:

– Some parts appear in different gestures.

• We have gesture (sequence) labels:

– But no part labels.

– We don’t know what the parts should be.

http://groups.csail.mit.edu/vision/vip/papers/wang06cvpr.pdf

Hidden Markov Model (HMM)

x1

h1

x2

h2

x3

h3

x4

h4

x5

h5

Generative HMM Classifier

x1

h1

x2

h2

x3

h3

x4

h4

x5

h5

y

Conditional Random Field (CRF)

x1

h1

x2

h2

x3

h3

x4

h4

x5

h5

Hidden Conditional Random Field (HCRF)

x1

h1

x2

h2

x3

h3

x4

h4

x5

h5

y

Graphical Models with Hidden Variables

• As before we deal with hidden variables by marginalizing:

• If we assume a UGM over {Y,H} given X we get:

• Consider usual choice of log-linear phi:

– NLL = -log(Z(Y)) + log(Z).

Motivation: Gesture Recognition

• What if we want to label video with multiple potential gestures?

– Assume we have labeled video sequences.

http://groups.csail.mit.edu/vision/vip/papers/morency_cvpr07.pdf

Latent-Dynamic Conditional Random Field (LDCRF)

x1

h1

x2

h2

x3

h3

x4

h4

x5

h5

y1 y2 y3 y4 y5

y1 y2 y3 y4 y5

Latent-Dynamic Conditional Neural Field (LDCNF)

x1

h1

g1

x2

h2

g2

x3

h3

g3

x4

h4

g4

x5

h5

g5

Summary

• Conditional random fields generalize logistic regression:– Allows dependencies between labels.

– Requires inference in graphical model.

• Conditional neural fields combine CRFs with deep learning.– Could also replace CRF with conditional density estimators (e.g., DAGs).

• UGMs with hidden variables have nice form: ratio of normalizers.– Can do inference with same methods.

• Latent dynamic conditional random/neural fields:– Allow dependencies between hidden variables.

• Next time: Boltzmann machines, LSTMs, and beyond CPSC 540.

cpsc 540: machine learningschmidtm/courses/540-w16/l23.pdf · overfitting: using all features hurt,...

Documents