cpsc 540: machine learningschmidtm/courses/540-w16/l23.pdf · overfitting: using all features hurt,...
TRANSCRIPT
CPSC 540: Machine Learning
Conditional Random Fields, Latent Dynamics
Winter 2016
Admin
• Assignment 5:
– Due in 1 week.
• Project:
– Due date moved again to April 26 (so that undergrads can graduate).
– Graduate students graduating in May must submit by April 21.
• No tutorial Friday (or in subsequent weeks).
• Final help session Monday.
• Thursday class may go long.
Motivation: Automatic Brain Tumor Segmentation
• Task: segmentation tumors and normal tissue in multi-modal MRI data.
• Applications:– Radiation therapy target planning, quantifying treatment responses.– Mining growth patterns, image-guided surgery.
• Challenges:– Variety of tumor appearances, similarity to normal tissue.– “You are never going to solve this problem.”
Input: Output:
Naïve Approach: Voxel-Level Classifier
• We could treat classifying a voxel as supervised learning:
• “Learn” model that predicts yi given xi: model can classify new voxels.
• Advantage: we can apply machine learning, and ML is cool.
• Disadvantage: it doesn’t work at all.
Naïve Approach: Voxel-Level Classifier
• Even in “nice” cases, significant overlap between tissues:
– Mixture of Gaussians and “outlier” class:
• Problems with naïve approach:
– Intensities not standardized.
– Location and texture matter.
Improvement 1: Intensity Standardization
• Want xi = <98,187,246> to mean same thing in different places.
• Pre-processing to normalize intensities:
Between slices:Within Images: Between people:
Improvement 2: Template Alignment
• Location matters:
– Seeing xi =<98,187,246> in one area of head is different than in other areas.
• Alignment to standard coordinates system:
Improvement 2: Template Alignment
• Add spatial features that take into account location information:
Aligned input images:
Template images:Priors for normal tissue locations:
Bilateral symmetry based on known axis:
Improvement 3: Convolutions
• Use convolutions to incorporate neighborhood information.
– We used fixed convolutions, now you would try to learn them.
Performance of Final System
Challenges
• Final system used linear classifer, and typically worked well.
• But several ML challenges arose:
1. Time: 14 hours to train logistic regression on 10 images.
• Lead to quasi-Newton, stochastic gradient, and SAG work.
2. Overfitting: using all features hurt, so we used manual feature selection.
• Lead to regularization, L1-regularization, and structured sparsity work.
3. Relaxation: post-processing by filtering and `hole-filling of labels.
• Lead to conditional random fields, shape priors, and structure learning work.
Outline
• Motivation
• Conditional Random Fields Clean Up
• Latent/Deep Graphical Models
Multi-Class Logistic Regression: View 1
• Recall that multi-class logistic regression makes decisions using:
• Here, f(x) are features and we have a vector wy for each class ‘y’.
• Normally fit wy using regularized maximum likelihood assuming:
• This softmax function yields a differentiable and convex NLL.
Multi-Class Logistic Regression: View 2
• Recall that multi-class logistic regression makes decisions using:
• Claim: can be written using a single ‘w’ and features of ‘x’ and ‘y’.
• To do this, we can use the construction:
Multi-Class Logistic Regression: View 2
• So multi-class logistic regression with new notation uses:
• And softmax probabilities gives:
• View 2 gives extra flexibility in defining features:
– For example, we might have different features for class 1 than 2:
Multi-Class Logistic Regression for Segmentation
• In brain tumor example, each xi is the features for one voxel:
• But we want to label the whole image:
• Probability of segmentation Y given image X with independent model:
Conditional Random Fields
• Unfortunately, independent model gives silly results:
• This model of p(Y|X,w) misses the “guilt by association”:
– Neighbouring voxels are likely to receive the same values.
• The key ingredients of conditional random fields (CRFs):
– Define features of entire image and labelling F(X,Y):
– We can model dependencies using features that depend on multiple yi.
Conditional Random Fields
• Interpretation of independent model as CRF:
Conditional Random Fields
• Example of modeling dependencies between neighbours as a CRF:
Conditional Random Fields for Segmentation
• Recall the performance with the independent classifier:
– Features of the form f(X,yi).
• Consider a CRF that also has pairwise features:
– Features f(X,yi,yj) for all (i,j) corresponding to adjacent voxels.
– Model “guilt by association”:
Conditional Random Fields as Graphical Models
• Seems great: we can now model dependencies in the labels.
– Why not model threeway interactions with F(X,yi,yj,yk)?
– How about adding things like shape priors F(X,Yr) for some region ‘r’?
• Challenge is that inference and decoding can become hard.
• We can view CRFs as undirected graphical models:
• If the graph is too complicated (and we don’t have special ‘F’):
– Intractable since we need inference (computing Z/marginals) for training.
Overview of Exact Methods for Graphical Models
• We can do exact decoding/inference/sampling for:– Small number of variables (enumeration).
– Chains (Viterbi, forward-backward, forward-filter backward-sample).
– Trees (belief propagation).
– Low treewidth graphs (junction tree).
• Other cases where exact computation is possible:– Semi-Markov chains (allow dependence on time you spend in a state).
– Context-free grammars (allows potentials on recursively-nested parts of sequence).
– Binary ‘k’ and “attractive” potentials (exact decoding via graph cuts).
– Sum-product networks (restrict potentials to allow exact computation).
Overview of Approximate Methods for Graphical Models
• Approximate decoding with local search:
– Coordinate descent is called iterated conditional mode (ICM).
• Approximate sampling with MCMC:
– We saw Gibbs sampling last week.
• Approximate inference with variational methods:
– Mean field, loopy belief propagation, tree-reweighted belief propagation.
• Approximate decoding with convex relaxations:
– Linear programming approximation.
• Block versions of all of the above:
– Variant is alpha-expansions: block moves involving classes.
Overview of Methods for Fitting Graphical Models
• If inference is intractable, there are some alternatives for learning:
– Variational inference to approximate Z and marginals.
– Pseudo-likelihood: fast and cheap convex approximation for learning.
– Structured SVMs: generalization of SVMs that only requires decoding.
– Younes: alternate between Gibbs sampling and parameter update.
• Also known as “persistent contrastive divergence”.
• For more details on graphical models, see:
– UGM software: http://www.cs.ubc.ca/~schmidtm/Software/UGM.html
– MLRG PGM crash course: http://www.cs.ubc.ca/labs/lci/mlrg
Independent Logistic Regression
x1 x2 x3
h3
y3
x4 x5
Independent Logistic Regression
x1
h1
y1
x2
h2
y2
x3
h3
y3
x4
h4
y4
x5
h5
y5
Conditional Random Field (CRF)
x1
h1
y1
x2
h2
y2
x3
h3
y3
x4
h4
y4
x5
h5
y5
Conditional Random Field (CRF)
x1
h1
y1
x2
h2
y2
x3
h3
y3
x4
h4
y4
x5
h5
y5
Neural Network
x1
h1
y1
x2
h2
y2
x3
h3
y3
x4
h4
y4
x5
h5
y5
Deep Learning
x1
h1
g1
x2
h2
g2
x3
h3
g3
x4
h4
g4
x5
h5
g5
y1 y2 y3 y4 y5
Conditional Neural Field (CNF)
x1
h1
g1
x2
h2
g2
x3
h3
g3
x4
h4
g4
x5
h5
g5
y1 y2 y3 y4 y5
Outline
• Motivation
• Conditional Random Fields Clean Up
• Latent/Deep Graphical Models
Motivation: Gesture Recognition
• Want to recognize gestures from video:
• A gesture is composed of a sequence of parts:
– Some parts appear in different gestures.
• We have gesture (sequence) labels:
– But no part labels.
– We don’t know what the parts should be.
http://groups.csail.mit.edu/vision/vip/papers/wang06cvpr.pdf
Hidden Markov Model (HMM)
x1
h1
x2
h2
x3
h3
x4
h4
x5
h5
Generative HMM Classifier
x1
h1
x2
h2
x3
h3
x4
h4
x5
h5
y
Conditional Random Field (CRF)
x1
h1
x2
h2
x3
h3
x4
h4
x5
h5
Hidden Conditional Random Field (HCRF)
x1
h1
x2
h2
x3
h3
x4
h4
x5
h5
y
Graphical Models with Hidden Variables
• As before we deal with hidden variables by marginalizing:
• If we assume a UGM over {Y,H} given X we get:
• Consider usual choice of log-linear phi:
– NLL = -log(Z(Y)) + log(Z).
Motivation: Gesture Recognition
• What if we want to label video with multiple potential gestures?
– Assume we have labeled video sequences.
http://groups.csail.mit.edu/vision/vip/papers/morency_cvpr07.pdf
Latent-Dynamic Conditional Random Field (LDCRF)
x1
h1
x2
h2
x3
h3
x4
h4
x5
h5
y1 y2 y3 y4 y5
y1 y2 y3 y4 y5
Latent-Dynamic Conditional Neural Field (LDCNF)
x1
h1
g1
x2
h2
g2
x3
h3
g3
x4
h4
g4
x5
h5
g5
Summary
• Conditional random fields generalize logistic regression:– Allows dependencies between labels.
– Requires inference in graphical model.
• Conditional neural fields combine CRFs with deep learning.– Could also replace CRF with conditional density estimators (e.g., DAGs).
• UGMs with hidden variables have nice form: ratio of normalizers.– Can do inference with same methods.
• Latent dynamic conditional random/neural fields:– Allow dependencies between hidden variables.
• Next time: Boltzmann machines, LSTMs, and beyond CPSC 540.