an introduction to machine learning and probabilistic

88
. An introduction to machine learning and probabilistic graphical models Kevin Murphy MIT AI Lab Presented at Intel’s workshop on “Machine learning for the life sciences”, Berkeley, CA, 3 November 2003

Upload: butest

Post on 29-Jun-2015

715 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: An introduction to machine learning and probabilistic

.

An introduction to machine learning and probabilistic

graphical models

Kevin Murphy

MIT AI Lab

Presented at Intel’s workshop on “Machine learningfor the life sciences”, Berkeley, CA, 3 November 2003

Page 2: An introduction to machine learning and probabilistic

2

Overview

Supervised learning Unsupervised learning Graphical models Learning relational models

Thanks to Nir Friedman, Stuart Russell, Leslie Kaelbling andvarious web sources for letting me use many of their slides

Page 3: An introduction to machine learning and probabilistic

3

Supervised learningyes no

Color Shape Size Output

Blue Torus Big Y

Blue Square Small Y

Blue Star Small Y

Red Arrow Small N

F(x1, x2, x3) -> tLearn to approximate function

from a training set of (x,t) pairs

Page 4: An introduction to machine learning and probabilistic

4

Supervised learning

X1 X2 X3 T

B T B Y

B S S Y

B S S Y

R A S N

X1 X2 X3 T

B A S ?

Y C S ?

Learner

Training data

Hypothesis

Testing dataT

Y

N

Prediction

Page 5: An introduction to machine learning and probabilistic

5

Key issue: generalization

yes no

? ?Can’t just memorize the training set (overfitting)

Page 6: An introduction to machine learning and probabilistic

6

Hypothesis spaces

Decision trees Neural networks K-nearest neighbors Naïve Bayes classifier Support vector machines (SVMs) Boosted decision stumps …

Page 7: An introduction to machine learning and probabilistic

7

Perceptron(neural net with no hidden layers)

Linearly separable data

Page 8: An introduction to machine learning and probabilistic

8

Which separating hyperplane?

Page 9: An introduction to machine learning and probabilistic

9

The linear separator with the largest margin is the best one to pick

margin

Page 10: An introduction to machine learning and probabilistic

10

What if the data is not linearly separable?

Page 11: An introduction to machine learning and probabilistic

11

Kernel trick

x1x2

z1

z2

z3

kernel

2

2

2

xx

xyy

y

Kernel implicitly maps from 2D to 3D,making problem linearly separable

Page 12: An introduction to machine learning and probabilistic

12

Support Vector Machines (SVMs)

Two key ideas: Large margins Kernel trick

Page 13: An introduction to machine learning and probabilistic

13

Boosting

Simple classifiers (weak learners) can have their performanceboosted by taking weighted combinations

Boosting maximizes the margin

Page 14: An introduction to machine learning and probabilistic

14

Supervised learning success stories

Face detection Steering an autonomous car across the US Detecting credit card fraud Medical diagnosis …

Page 15: An introduction to machine learning and probabilistic

15

Unsupervised learning

What if there are no output labels?

Page 16: An introduction to machine learning and probabilistic

16

K-means clustering1. Guess number of clusters, K

2. Guess initial cluster centers, 1, 2

3. Assign data points xi to nearest cluster center4. Re-compute cluster centers based on assignments

Re

itera

te

Page 17: An introduction to machine learning and probabilistic

17

AutoClass (Cheeseman et al, 1986)

EM algorithm for mixtures of Gaussians “Soft” version of K-means Uses Bayesian criterion to select K Discovered new types of stars from spectral data Discovered new classes of proteins and introns

from DNA/protein sequence databases

Page 18: An introduction to machine learning and probabilistic

18

Hierarchical clustering

Page 19: An introduction to machine learning and probabilistic

.

Principal Component Analysis (PCA)

PCA seeks a projection that best represents the data in a least-squares sense.

PCA reduces the dimensionality of feature space by restricting attention to those directions along which the scatter of the cloud is greatest.

Page 20: An introduction to machine learning and probabilistic

20

Discovering nonlinear manifolds

Page 21: An introduction to machine learning and probabilistic

21

Combining supervised and unsupervised learning

Page 22: An introduction to machine learning and probabilistic

22

Discovering rules (data mining)Occup. Income Educ. Sex Married Age

Student $10k MA M S 22

Student $20k PhD F S 24

Doctor $80k MD M M 30

Retired $30k HS F M 60

Find the most frequent patterns (association rules)

Num in household = 1 ^ num children = 0 => language = English

Language = English ^ Income < $40k ^ Married = false ^num children = 0 => education {college, grad school}

Page 23: An introduction to machine learning and probabilistic

23

Unsupervised learning: summary

Clustering Hierarchical clustering Linear dimensionality reduction (PCA) Non-linear dim. Reduction Learning rules

Page 24: An introduction to machine learning and probabilistic

24

Discovering networks

?

From data visualization to causal discovery

Page 25: An introduction to machine learning and probabilistic

25

Networks in biology

Most processes in the cell are controlled by networks of interacting molecules:

Metabolic Network Signal Transduction Networks Regulatory Networks

Networks can be modeled at multiple levels of detail/ realism

Molecular level Concentration level Qualitative level

Decreasing detail

Page 26: An introduction to machine learning and probabilistic

26

Molecular level: Lysis-Lysogeny circuit in Lambda phage

Arkin et al. (1998), Genetics 149(4):1633-48

5 genes, 67 parameters based on 50 years of researchStochastic simulation required supercomputer

Page 27: An introduction to machine learning and probabilistic

27

Concentration level: metabolic pathways

Usually modeled with differential equations

w23

g1g2

g3g4

g5

w12

w55

Page 28: An introduction to machine learning and probabilistic

28

Qualitative level: Boolean Networks

Page 29: An introduction to machine learning and probabilistic

29

Probabilistic graphical models

Supports graph-based modeling at various levels of detail

Models can be learned from noisy, partial data Can model “inherently” stochastic phenomena, e.g.,

molecular-level fluctuations… But can also model deterministic, causal

processes. "The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful. Therefore the true logic for this world is the calculus of probabilities."-- James Clerk Maxwell

"Probability theory is nothing but common sense reduced tocalculation." -- Pierre Simon Laplace

Page 30: An introduction to machine learning and probabilistic

30

Graphical models: outline

What are graphical models? Inference Structure learning

Page 31: An introduction to machine learning and probabilistic

31

Simple probabilistic model:linear regression

Y

Y = + X + noise Deterministic (functional) relationship

X

Page 32: An introduction to machine learning and probabilistic

32

Simple probabilistic model:linear regression

Y

Y = + X + noise Deterministic (functional) relationship

X

“Learning” = estimatingparameters , , from(x,y) pairs.

Can be estimate byleast squares

Is the empirical mean

Is the residual variance

Page 33: An introduction to machine learning and probabilistic

33

Piecewise linear regression

Latent “switch” variable – hidden process at work

Page 34: An introduction to machine learning and probabilistic

34

Probabilistic graphical model for piecewise linear regression

X

Y

Q

•Hidden variable Q chooses which set ofparameters to use for predicting Y.

•Value of Q depends on value of input X.

output

input

•This is an example of “mixtures of experts”

Learning is harder because Q is hidden, so we don’t know whichdata points to assign to each line; can be solved with EM (c.f., K-means)

Page 35: An introduction to machine learning and probabilistic

35

Classes of graphical models

Probabilistic modelsGraphical models

Directed Undirected

Bayes nets MRFs

DBNs

Page 36: An introduction to machine learning and probabilistic

36

Family of Alarm

Bayesian Networks

Qualitative part:

Directed acyclic graph (DAG) Nodes - random variables Edges - direct influence

Quantitative part: Set of conditional probability distributions

0.9 0.1

e

b

e

0.2 0.8

0.01 0.99

0.9 0.1

be

b

b

e

BE P(A | E,B)Earthquake

Radio

Burglary

Alarm

Call

Compact representation of probability distributions via conditional independence

Together:Define a unique distribution in a factored form

)|()|(),|()()(),,,,( ACPERPEBAPEPBPRCAEBP

Page 37: An introduction to machine learning and probabilistic

37

Example: “ICU Alarm” networkDomain: Monitoring Intensive-Care Patients 37 variables 509 parameters

…instead of 254

PCWP CO

HRBP

HREKG HRSAT

ERRCAUTERHRHISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG VENITUBE

DISCONNECT

MINVOLSET

VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS

PAP SHUNT

ANAPHYLAXIS

MINOVL

PVSAT

FIO2

PRESS

INSUFFANESTHTPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME

HYPOVOLEMIA

CVP

BP

Page 38: An introduction to machine learning and probabilistic

38

Success stories for graphical models

Multiple sequence alignment Forensic analysis Medical and fault diagnosis Speech recognition Visual tracking Channel coding at Shannon limit Genetic pedigree analysis …

Page 39: An introduction to machine learning and probabilistic

39

Graphical models: outline

What are graphical models? p Inference Structure learning

Page 40: An introduction to machine learning and probabilistic

40

Probabilistic Inference Posterior probabilities

Probability of any event given any evidence P(X|E)

Earthquake

Radio

Burglary

Alarm

Call

Radio

Call

Page 41: An introduction to machine learning and probabilistic

41

Viterbi decoding

Y1 Y3

X1 X2 X3

Y2

Compute most probable explanation (MPE) of observed data

Hidden Markov Model (HMM)

“Tomato”

hidden

observed

Page 42: An introduction to machine learning and probabilistic

42

Inference: computational issues

PCWP CO

HRBPHREKGHRSAT

ERRCAUTERHRHISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG VENITUBE

DISCONNECT

MINVOLSET

VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS

PAP SHUNT

MINOVL

PVSAT

PRESS

INSUFFANESTHTPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME

HYPOVOLEMIA

CVP

BP

Easy Hard

Chains

TreesGrids

Dense, loopy graphs

Page 43: An introduction to machine learning and probabilistic

43

Inference: computational issues

PCWP CO

HRBPHREKGHRSAT

ERRCAUTERHRHISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG VENITUBE

DISCONNECT

MINVOLSET

VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS

PAP SHUNT

MINOVL

PVSAT

PRESS

INSUFFANESTHTPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME

HYPOVOLEMIA

CVP

BP

Easy Hard

Chains

TreesGrids

Dense, loopy graphs

Many difference inference algorithms,both exact and approximate

Page 44: An introduction to machine learning and probabilistic

44

Bayesian inference

Bayesian probability treats parameters as random variables

Learning/ parameter estimation is replaced by probabilistic inference P(|D)

Example: Bayesian linear regression; parameters are = (, , )

X1

Y1

Xn

Yn

Parameters are tied (shared)across repetitions of the data

Page 45: An introduction to machine learning and probabilistic

45

Bayesian inference

+ Elegant – no distinction between parameters and other hidden variables

+ Can use priors to learn from small data sets (c.f., one-shot learning by humans)

- Math can get hairy - Often computationally intractable

Page 46: An introduction to machine learning and probabilistic

46

Graphical models: outline

What are graphical models? Inference Structure learning

p

p

Page 47: An introduction to machine learning and probabilistic

47

Why Struggle for Accurate Structure?

Increases the number of parameters to be estimated

Wrong assumptions about domain structure

Cannot be compensated for by fitting parameters

Wrong assumptions about domain structure

Earthquake Alarm Set

Sound

Burglary Earthquake Alarm Set

Sound

Burglary

Earthquake Alarm Set

Sound

Burglary

Adding an arcMissing an arc

Page 48: An introduction to machine learning and probabilistic

48

Score based Learning

E, B, A<Y,N,N><Y,Y,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>

E B

A

E

B

A

E

BA

Search for a structure that maximizes the score

Define scoring function that evaluates how well a structure matches the data

Page 49: An introduction to machine learning and probabilistic

49

Learning Trees

Can find optimal tree structure in O(n2 log n) time: just find the max-weight spanning tree

If some of the variables are hidden, problem becomes hard again, but can use EM to fit mixtures of trees

Page 50: An introduction to machine learning and probabilistic

50

Heuristic Search

Learning arbitrary graph structure is NP-hard.So it is common to resort to heuristic search

Define a search space: search states are possible structures operators make small changes to structure

Traverse space looking for high-scoring structures Search techniques:

Greedy hill-climbing Best first search Simulated Annealing ...

Page 51: An introduction to machine learning and probabilistic

51

Local Search Operations

Typical operations:

S C

E

D Reverse C EDelete C

E

Add C

D

S C

E

D

S C

E

D

S C

E

D

score = S({C,E} D) - S({E} D)

Page 52: An introduction to machine learning and probabilistic

52

Problems with local search S

(G|D

)

Easy to get stuck in local optima

“truth”

you

Page 53: An introduction to machine learning and probabilistic

53

Problems with local search II

E

R

B

A

C

P(G|D)Picking a single best model can be misleading

Page 54: An introduction to machine learning and probabilistic

54

Problems with local search II

Small sample size many high scoring models Answer based on one model often useless Want features common to many models

E

R

B

A

C

E

R

B

A

C

E

R

B

A

C

E

R

B

A

C

E

R

B

A

C

P(G|D)Picking a single best model can be misleading

Page 55: An introduction to machine learning and probabilistic

55

Bayesian Approach to Structure Learning

Posterior distribution over structures Estimate probability of features

Edge XY Path X… Y …

G

DGPGfDfP )|()()|(

Feature of G,e.g., XY

Indicator functionfor feature f

Bayesian scorefor G

Page 56: An introduction to machine learning and probabilistic

56

Bayesian approach: computational issues

Posterior distribution over structures

G

DGPGfDfP )|()()|(

How compute sum over super-exponential number of graphs?

•MCMC over networks•MCMC over node-orderings (Rao-Blackwellisation)

Page 57: An introduction to machine learning and probabilistic

57

Structure learning: other issues

Discovering latent variables Learning causal models Learning from interventional data Active learning

Page 58: An introduction to machine learning and probabilistic

58

Discovering latent variables

a) 17 parameters b) 59 parameters

There are some techniques for automatically detecting thepossible presence of latent variables

Page 59: An introduction to machine learning and probabilistic

59

Learning causal models

So far, we have only assumed that X -> Y -> Z means that Z is independent of X given Y.

However, we often want to interpret directed arrows causally.

This is uncontroversial for the arrow of time. But can we infer causality from static observational

data?

Page 60: An introduction to machine learning and probabilistic

60

Learning causal models

We can infer causality from static observational data if we have at least four measured variables and certain “tetrad” conditions hold.

See books by Pearl and Spirtes et al. However, we can only learn up to Markov

equivalence, not matter how much data we have.

X Y Z

X Y Z

X Y Z

X Y Z

Page 61: An introduction to machine learning and probabilistic

61

Learning from interventional data

The only way to distinguish between Markov equivalent networks is to perform interventions, e.g., gene knockouts.

We need to (slightly) modify our learning algorithms.

smoking

Yellowfingers

P(smoker|observe(yellow)) >> prior

smoking

Yellowfingers

P(smoker | do(paint yellow)) = prior

Cut arcs cominginto nodes whichwere set byintervention

Page 62: An introduction to machine learning and probabilistic

62

Active learning

Which experiments (interventions) should we perform to learn structure as efficiently as possible?

This problem can be modeled using decision theory.

Exact solutions are wildly computationally intractable.

Can we come up with good approximate decision making techniques?

Can we implement hardware to automatically perform the experiments?

“AB: Automated Biologist”

Page 63: An introduction to machine learning and probabilistic

63

Learning from relational data

Can we learn concepts from a set of relations between objects,instead of/ in addition to just their attributes?

Page 64: An introduction to machine learning and probabilistic

64

Learning from relational data: approaches

Probabilistic relational models (PRMs) Reify a relationship (arcs) between nodes

(objects) by making into a node (hypergraph)

Inductive Logic Programming (ILP) Top-down, e.g., FOIL (generalization of C4.5) Bottom up, e.g., PROGOL (inverse deduction)

Page 65: An introduction to machine learning and probabilistic

65

ILP for learning protein folding: input

yes no

TotalLength(D2mhr, 118) ^ NumberHelices(D2mhr, 6) ^ …

100 conjuncts describing structure of each pos/neg example

Page 66: An introduction to machine learning and probabilistic

66

ILP for learning protein folding: results

PROGOL learned the following rule to predict if a protein will form a “four-helical up-and-down bundle”:

In English: “The protein P folds if it contains a long helix h1 at a secondary structure position between 1 and 3 and h1 is next to a second helix”

Page 67: An introduction to machine learning and probabilistic

67

ILP: Pros and Cons

+ Can discover new predicates (concepts) automatically

+ Can learn relational models from relational (or flat) data

- Computationally intractable - Poor handling of noise

Page 68: An introduction to machine learning and probabilistic

68

The future of machine learning for bioinformatics?

Oracle

Page 69: An introduction to machine learning and probabilistic

69

Learner

Prior knowledge

Replicated experiments

Biological literature

Hypotheses

Expt.design

Real world

The future of machine learning for bioinformatics

•“Computer assisted pathway refinement”

Page 70: An introduction to machine learning and probabilistic

70

The end

Page 71: An introduction to machine learning and probabilistic

71

Decision trees

blue?

big?

oval?

no

no

yes

yes

Page 72: An introduction to machine learning and probabilistic

72

Decision trees

blue?

big?

oval?

no

no

yes

yes

+ Handles mixed variables+ Handles missing data+ Efficient for large data sets+ Handles irrelevant attributes+ Easy to understand- Predictive power

Page 73: An introduction to machine learning and probabilistic

73

Feedforward neural network

( ), ( ) 1/(1 )cxi i

i

f J s f x e

input Hidden layer Output

Weights on each arc Sigmoid function at each node

Page 74: An introduction to machine learning and probabilistic

74

Feedforward neural network

input Hidden layer Output

- Handles mixed variables- Handles missing data- Efficient for large data sets- Handles irrelevant attributes- Easy to understand+ Predicts poorly

Page 75: An introduction to machine learning and probabilistic

75

Nearest Neighbor Remember all your data When someone asks a question,

find the nearest old data point return the answer associated with it

Page 76: An introduction to machine learning and probabilistic

76

Nearest Neighbor

?

- Handles mixed variables- Handles missing data- Efficient for large data sets- Handles irrelevant attributes- Easy to understand+ Predictive power

Page 77: An introduction to machine learning and probabilistic

77

Support Vector Machines (SVMs)

Two key ideas: Large margins are good Kernel trick

Page 78: An introduction to machine learning and probabilistic

78

Training data : l-dimensional vector with flag of true or false

2 /d w( ) 1 0,i iy b i x w

0b w x Separating hyperplane :

Inequalities :

Margin :

Support vectors :

Support vector expansion:

ii

iw x

Decision:

,{ }, , { 1,1}li iy y i ix x R

SVM: mathematical details

margin

Page 79: An introduction to machine learning and probabilistic

79

Replace all inner products with kernels

Kernel function

Page 80: An introduction to machine learning and probabilistic

80

SVMs: summary

- Handles mixed variables- Handles missing data- Efficient for large data sets- Handles irrelevant attributes- Easy to understand+ Predictive power

•Kernel trick can be used to make many linear methods non-linear e.g., kernel PCA, kernelized mutual information

•Large margin classifiers are good

General lessons from SVM success:

Page 81: An introduction to machine learning and probabilistic

81

Boosting: summary

Can boost any weak learner Most commonly: boosted decision “stumps”

+ Handles mixed variables+ Handles missing data+ Efficient for large data sets+ Handles irrelevant attributes- Easy to understand+ Predictive power

Page 82: An introduction to machine learning and probabilistic

82

Supervised learning: summary

Learn mapping F from inputs to outputs using a training set of (x,t) pairs

F can be drawn from different hypothesis spaces, e.g., decision trees, linear separators, linear in high dimensions, mixtures of linear

Algorithms offer a variety of tradeoffs Many good books, e.g.,

“The elements of statistical learning”,Hastie, Tibshirani, Friedman, 2001

“Pattern classification”, Duda, Hart, Stork, 2001

Page 83: An introduction to machine learning and probabilistic

83

Inference Posterior probabilities

Probability of any event given any evidence Most likely explanation

Scenario that explains evidence Rational decision making

Maximize expected utility Value of Information

Effect of intervention

Earthquake

Radio

Burglary

Alarm

Call

Radio

Call

Page 84: An introduction to machine learning and probabilistic

84

Assumption needed to makelearning work

We need to assume “Future futures will resemble past futures” (B. Russell)

Unlearnable hypothesis: “All emeralds are grue”, where “grue” means:green if observed before time t, blue afterwards.

Page 85: An introduction to machine learning and probabilistic

85

Structure learning success stories: gene regulation network (Friedman et al.)

Yeast data [Hughes et al 2000]

600 genes 300 experiments

Page 86: An introduction to machine learning and probabilistic

86

Structure learning success stories II: Phylogenetic Tree Reconstruction (Friedman et al.)

Input: Biological sequences

Human CGTTGC…

Chimp CCTAGG…

Orang CGAACG…….

Output: a phylogeny

leaf

10 billion years

Uses structural EM,with max-spanning-treein the inner loop

Page 87: An introduction to machine learning and probabilistic

87

Instances of graphical models

Probabilistic modelsGraphical models

Directed Undirected

Bayes nets MRFs

DBNs

Hidden Markov Model (HMM)

Naïve Bayes classifier

Mixturesof experts

Kalman filtermodel Ising model

Page 88: An introduction to machine learning and probabilistic

88

ML enabling technologies

Faster computers More data

The web Parallel corpora (machine translation) Multiple sequenced genomes Gene expression arrays

New ideas Kernel trick Large margins Boosting Graphical models …