cpsc 540: machine learningschmidtm/courses/540-w16/l23.pdf · overfitting: using all features hurt,...

43
CPSC 540: Machine Learning Conditional Random Fields, Latent Dynamics Winter 2016

Upload: others

Post on 16-Sep-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

CPSC 540: Machine Learning

Conditional Random Fields, Latent Dynamics

Winter 2016

Page 2: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Admin

• Assignment 5:

– Due in 1 week.

• Project:

– Due date moved again to April 26 (so that undergrads can graduate).

– Graduate students graduating in May must submit by April 21.

• No tutorial Friday (or in subsequent weeks).

• Final help session Monday.

• Thursday class may go long.

Page 3: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Motivation: Automatic Brain Tumor Segmentation

• Task: segmentation tumors and normal tissue in multi-modal MRI data.

• Applications:– Radiation therapy target planning, quantifying treatment responses.– Mining growth patterns, image-guided surgery.

• Challenges:– Variety of tumor appearances, similarity to normal tissue.– “You are never going to solve this problem.”

Input: Output:

Page 4: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Naïve Approach: Voxel-Level Classifier

• We could treat classifying a voxel as supervised learning:

• “Learn” model that predicts yi given xi: model can classify new voxels.

• Advantage: we can apply machine learning, and ML is cool.

• Disadvantage: it doesn’t work at all.

Page 5: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Naïve Approach: Voxel-Level Classifier

• Even in “nice” cases, significant overlap between tissues:

– Mixture of Gaussians and “outlier” class:

• Problems with naïve approach:

– Intensities not standardized.

– Location and texture matter.

Page 6: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Improvement 1: Intensity Standardization

• Want xi = <98,187,246> to mean same thing in different places.

• Pre-processing to normalize intensities:

Between slices:Within Images: Between people:

Page 7: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Improvement 2: Template Alignment

• Location matters:

– Seeing xi =<98,187,246> in one area of head is different than in other areas.

• Alignment to standard coordinates system:

Page 8: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Improvement 2: Template Alignment

• Add spatial features that take into account location information:

Aligned input images:

Template images:Priors for normal tissue locations:

Bilateral symmetry based on known axis:

Page 9: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Improvement 3: Convolutions

• Use convolutions to incorporate neighborhood information.

– We used fixed convolutions, now you would try to learn them.

Page 10: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,
Page 11: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Performance of Final System

Page 12: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Challenges

• Final system used linear classifer, and typically worked well.

• But several ML challenges arose:

1. Time: 14 hours to train logistic regression on 10 images.

• Lead to quasi-Newton, stochastic gradient, and SAG work.

2. Overfitting: using all features hurt, so we used manual feature selection.

• Lead to regularization, L1-regularization, and structured sparsity work.

3. Relaxation: post-processing by filtering and `hole-filling of labels.

• Lead to conditional random fields, shape priors, and structure learning work.

Page 13: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Outline

• Motivation

• Conditional Random Fields Clean Up

• Latent/Deep Graphical Models

Page 14: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Multi-Class Logistic Regression: View 1

• Recall that multi-class logistic regression makes decisions using:

• Here, f(x) are features and we have a vector wy for each class ‘y’.

• Normally fit wy using regularized maximum likelihood assuming:

• This softmax function yields a differentiable and convex NLL.

Page 15: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Multi-Class Logistic Regression: View 2

• Recall that multi-class logistic regression makes decisions using:

• Claim: can be written using a single ‘w’ and features of ‘x’ and ‘y’.

• To do this, we can use the construction:

Page 16: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Multi-Class Logistic Regression: View 2

• So multi-class logistic regression with new notation uses:

• And softmax probabilities gives:

• View 2 gives extra flexibility in defining features:

– For example, we might have different features for class 1 than 2:

Page 17: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Multi-Class Logistic Regression for Segmentation

• In brain tumor example, each xi is the features for one voxel:

• But we want to label the whole image:

• Probability of segmentation Y given image X with independent model:

Page 18: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Conditional Random Fields

• Unfortunately, independent model gives silly results:

• This model of p(Y|X,w) misses the “guilt by association”:

– Neighbouring voxels are likely to receive the same values.

• The key ingredients of conditional random fields (CRFs):

– Define features of entire image and labelling F(X,Y):

– We can model dependencies using features that depend on multiple yi.

Page 19: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Conditional Random Fields

• Interpretation of independent model as CRF:

Page 20: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Conditional Random Fields

• Example of modeling dependencies between neighbours as a CRF:

Page 21: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Conditional Random Fields for Segmentation

• Recall the performance with the independent classifier:

– Features of the form f(X,yi).

• Consider a CRF that also has pairwise features:

– Features f(X,yi,yj) for all (i,j) corresponding to adjacent voxels.

– Model “guilt by association”:

Page 22: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Conditional Random Fields as Graphical Models

• Seems great: we can now model dependencies in the labels.

– Why not model threeway interactions with F(X,yi,yj,yk)?

– How about adding things like shape priors F(X,Yr) for some region ‘r’?

• Challenge is that inference and decoding can become hard.

• We can view CRFs as undirected graphical models:

• If the graph is too complicated (and we don’t have special ‘F’):

– Intractable since we need inference (computing Z/marginals) for training.

Page 23: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Overview of Exact Methods for Graphical Models

• We can do exact decoding/inference/sampling for:– Small number of variables (enumeration).

– Chains (Viterbi, forward-backward, forward-filter backward-sample).

– Trees (belief propagation).

– Low treewidth graphs (junction tree).

• Other cases where exact computation is possible:– Semi-Markov chains (allow dependence on time you spend in a state).

– Context-free grammars (allows potentials on recursively-nested parts of sequence).

– Binary ‘k’ and “attractive” potentials (exact decoding via graph cuts).

– Sum-product networks (restrict potentials to allow exact computation).

Page 24: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Overview of Approximate Methods for Graphical Models

• Approximate decoding with local search:

– Coordinate descent is called iterated conditional mode (ICM).

• Approximate sampling with MCMC:

– We saw Gibbs sampling last week.

• Approximate inference with variational methods:

– Mean field, loopy belief propagation, tree-reweighted belief propagation.

• Approximate decoding with convex relaxations:

– Linear programming approximation.

• Block versions of all of the above:

– Variant is alpha-expansions: block moves involving classes.

Page 25: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Overview of Methods for Fitting Graphical Models

• If inference is intractable, there are some alternatives for learning:

– Variational inference to approximate Z and marginals.

– Pseudo-likelihood: fast and cheap convex approximation for learning.

– Structured SVMs: generalization of SVMs that only requires decoding.

– Younes: alternate between Gibbs sampling and parameter update.

• Also known as “persistent contrastive divergence”.

• For more details on graphical models, see:

– UGM software: http://www.cs.ubc.ca/~schmidtm/Software/UGM.html

– MLRG PGM crash course: http://www.cs.ubc.ca/labs/lci/mlrg

Page 26: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Independent Logistic Regression

x1 x2 x3

h3

y3

x4 x5

Page 27: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Independent Logistic Regression

x1

h1

y1

x2

h2

y2

x3

h3

y3

x4

h4

y4

x5

h5

y5

Page 28: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Conditional Random Field (CRF)

x1

h1

y1

x2

h2

y2

x3

h3

y3

x4

h4

y4

x5

h5

y5

Page 29: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Conditional Random Field (CRF)

x1

h1

y1

x2

h2

y2

x3

h3

y3

x4

h4

y4

x5

h5

y5

Page 30: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Neural Network

x1

h1

y1

x2

h2

y2

x3

h3

y3

x4

h4

y4

x5

h5

y5

Page 31: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Deep Learning

x1

h1

g1

x2

h2

g2

x3

h3

g3

x4

h4

g4

x5

h5

g5

y1 y2 y3 y4 y5

Page 32: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Conditional Neural Field (CNF)

x1

h1

g1

x2

h2

g2

x3

h3

g3

x4

h4

g4

x5

h5

g5

y1 y2 y3 y4 y5

Page 33: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Outline

• Motivation

• Conditional Random Fields Clean Up

• Latent/Deep Graphical Models

Page 34: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Motivation: Gesture Recognition

• Want to recognize gestures from video:

• A gesture is composed of a sequence of parts:

– Some parts appear in different gestures.

• We have gesture (sequence) labels:

– But no part labels.

– We don’t know what the parts should be.

http://groups.csail.mit.edu/vision/vip/papers/wang06cvpr.pdf

Page 35: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Hidden Markov Model (HMM)

x1

h1

x2

h2

x3

h3

x4

h4

x5

h5

Page 36: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Generative HMM Classifier

x1

h1

x2

h2

x3

h3

x4

h4

x5

h5

y

Page 37: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Conditional Random Field (CRF)

x1

h1

x2

h2

x3

h3

x4

h4

x5

h5

Page 38: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Hidden Conditional Random Field (HCRF)

x1

h1

x2

h2

x3

h3

x4

h4

x5

h5

y

Page 39: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Graphical Models with Hidden Variables

• As before we deal with hidden variables by marginalizing:

• If we assume a UGM over {Y,H} given X we get:

• Consider usual choice of log-linear phi:

– NLL = -log(Z(Y)) + log(Z).

Page 40: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Motivation: Gesture Recognition

• What if we want to label video with multiple potential gestures?

– Assume we have labeled video sequences.

http://groups.csail.mit.edu/vision/vip/papers/morency_cvpr07.pdf

Page 41: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Latent-Dynamic Conditional Random Field (LDCRF)

x1

h1

x2

h2

x3

h3

x4

h4

x5

h5

y1 y2 y3 y4 y5

Page 42: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

y1 y2 y3 y4 y5

Latent-Dynamic Conditional Neural Field (LDCNF)

x1

h1

g1

x2

h2

g2

x3

h3

g3

x4

h4

g4

x5

h5

g5

Page 43: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L23.pdf · Overfitting: using all features hurt, so we used manual feature selection. •Lead to regularization, L1-regularization,

Summary

• Conditional random fields generalize logistic regression:– Allows dependencies between labels.

– Requires inference in graphical model.

• Conditional neural fields combine CRFs with deep learning.– Could also replace CRF with conditional density estimators (e.g., DAGs).

• UGMs with hidden variables have nice form: ratio of normalizers.– Can do inference with same methods.

• Latent dynamic conditional random/neural fields:– Allow dependencies between hidden variables.

• Next time: Boltzmann machines, LSTMs, and beyond CPSC 540.