probabilistic learning and graphical modelsdidawiki.di.unipi.it/lib/exe/fetch.php/magistrale... ·...

IntroductionProbability and Learning Refresher

Bayesian Networks

Probabilistic Learning and Graphical Models

Davide Bacciu

Dipartimento di InformaticaUniversità di [email protected]

Machine Learning: Neural Networks and Advanced Models(AA2)


Bayesian Networks

Generative Machine Learning ModelsModule Outline

Protein Secondary Structure Prediction


Bayesian Networks


Document Understanding

D. M. Blei, Probabilistic Topic Models, Communications of the ACM, Vol. 55 No. 4, Pages 77-84


Bayesian Networks


Image Understanding


Bayesian Networks


Probabilistic Learning Models

Learning models that represent knowledge inferred fromdata under the form of probabilities

Supervised, unsupervised, weakly supervised learningtasksDescribes how data is generated (interpretation)Allow to incorporate prior knowledge on the data and on thetask

The majority of the modern task comprises large numbersof variables

Modeling the joint distribution of all variables can becomeimpracticalExponential size of the parameter spaceComputationally impractical to train and predict


Bayesian Networks


The Graphical Models Framework

RepresentationGraphical models are a compact way to representexponentially large probability distributionsEncode conditional independence assumptionsDifferent classes of graph structures imply differentassumptions/capabilities

InferenceHow to query (predict with) a graphical model?Probability of unknown X given observations d, P(X |d)

Most likely hypothesisLearning

Find the right model parameter (Parameter Learning)Find the right model structure (Structure Learning)An inference problem after all


Bayesian Networks


Graphical Model Representation

A graph whose nodes (vertices) are random variables whoseedges (links) represent probabilistic relationships between thevariables

Different classes of graphs

Directed Models

Directed edgesexpress causalrelationships

Undirected Models

Undirected edgesexpress softconstraints

Dynamic Models

Structure changesto reflet dynamicprocesses


Bayesian Networks


Graphical Models in this Module

RepresentationDirected graphical models: Bayesian NetworksUndirected graphical models: Markov random fieldsDynamic graphical models: Hidden Markov Models

InferenceExact inference: message passing, junction tree algorithmsApproximate Inference: loopy belief propagation, sampling,variational inference

LearningParameter learning: Expectation-Maximization algorithmStructure learning: PC algorithm, search-and-score


Bayesian Networks


Generative Models ModulePlan of the Lectures

Lesson 1 Introduction to Probabilistic and Graphical ModelsLesson 2 Directed and Undirected Graphical ModelsLesson 3 Inference in Graphical ModelsLesson 4 Dynamic Approaches: The Hidden Markov ModelLesson 5 Graphical models for Structured DataLesson 6 Exact and Approximate Learning: Latent Variable

ModelsLesson 7 Bridging Probabilistic and Neural: Deep Learning


Bayesian Networks


Lecture Outline

IntroductionA probabilistic refresher

Probability theoryConditional independenceLearning in probabilistic models

Directed graphical modelsBayesian NetworksRepresentationConditional independenceInference and Learning

Applications and conclusions


Bayesian Networks

Probability TheoryConditional IndependenceInference and Learning

Random Variables

A Random Variable (RV) is a function describing theoutcome of a random process by assigning unique valuesto all possible outcomes of the experimentA RV models an attribute of our data (e.g. age, speechsample,...)Use uppercase to denote a RV, e.g. X , and lowercase todenote a value (observation), e.g. xA discrete (categorical) RV is defined on a finite orcountable list of values Ω

A continuous RV can take infinitely many values


Bayesian Networks


Probability Functions

Discrete Random VariablesA probability function P(X = x) ∈ [0,1] measures theprobability of a RV X attaining the value xSubject to sum-rule

∑x∈Ω

P(X = x) = 1

Continuous Random VariablesA density function p(t) describes the relative likelihood of aRV to take on a value tSubject to sum-rule

∫Ω

p(t)dt = 1

Defines a probability distribution, e.g.

P(X ≤ x) =

∫ x

−∞p(t)dt

Shorthand P(x) for P(X = x) or P(X ≤ x)


Bayesian Networks


Joint and Conditional Probabilities

If a discrete random process is described by a set of RVsX1, . . . ,XN , then the joint probability writes

P(X1 = x1, . . . ,XN = xn) = P(x1 ∧ · · · ∧ xn)

The joint conditional probability of x1, . . . , xn given y

P(x1, . . . , xn|y)

measures the effect of the realization of an event y on theoccurrence of x1, . . . , xn

A conditional distribution P(x |y) is actually a family ofdistributions

For each y , there is a distribution P(x |y)


Bayesian Networks


Chain Rule

Definition (Product Rule a.k.a. Chain Rule)

P(x1, . . . , xi , . . . , xn|y) =N∏

i=1

P(xi |x1, . . . , xi−1, y)

Definition (Marginalization)Using the sum and product rules together yield to the completeprobability

P(X1 = x1) =∑x2

P(X1 = x1|X2 = x2)P(X2 = x2)


Bayesian Networks


Bayes Rule

Given hypothesis hi ∈ H and observations d

P(hi |d) =P(d|hi)P(hi)

P(d)=

P(d|hi)P(hi)∑j P(d|hj)P(hj)

P(hi) is the prior probability of hi

P(d|hi) is the conditional probability of observing d giventhat hypothesis hi is true (likelihood).P(d) is the marginal probability of dP(hi |d) is the posterior probability that hypothesis is truegiven the data and the previous belief about thehypothesis.


Bayesian Networks


Inference and Learning in Probabilistic Models

Inference - How can one determine the distribution of thevalues of one/several RV, given the observed values of others?

P(graduate|exam1, . . . ,examn)

Machine Learning view - Given a set of observations (data) dand a set of hypotheses hiKi=1, how can I use them to predictthe distribution of a RV X?

Learning - A very specific inference problem!Given a set of observations d and a probabilistic model ofa given structure, how do I find the parameters θ of itsdistribution?Amounts to determining the best hypothesis hθ regulatedby a (set of) parameters θ


Bayesian Networks


3 Approaches to Inference

Bayesian Consider all hypotheses weighted by theirprobabilities

P(X |d) =∑

i

P(X |hi)P(hi |d)

MAP Infer X from P(X |hMAP) where hMAP is theMaximum a-Posteriori hypothesis given d

hMAP = arg maxh∈H

P(h|d) = arg maxh∈H

P(d|h)P(h)

ML Assuming uniform priors P(hi) = P(hj), yields theMaximum Likelihood (ML) estimate P(X |hML)

hML = arg maxh∈H

P(d|h)


Bayesian Networks


Considerations About Bayesian Inference

The Bayesian approach is optimal but poses computationaland analytical tractability issues

P(X |d) =

∫H

P(X |h)P(h|d)dh

ML and MAP are point estimates of the Bayesian sincethey infer based only on one most likely hypothesisMAP and Bayesian predictions become closer as moredata gets availableMAP is a regularization of the ML estimation

Hypothesis prior P(h) embodies trade-off betweencomplexity and degree of fitWell-suited to working with small datasets and/or largeparameter spaces


Bayesian Networks


Maximum-Likelihood (ML) Learning

Find the model θ that is most likely to have generated the data d

θML = arg maxθ∈Θ

P(d|θ)

from a family of parameterized distributions P(x |θ).

Optimization problem that considers the Likelihood function

L(θ|x) = P(x |θ)

to be a function of θ.

Can be addressed by solving

∂L(θ|x)

∂θ= 0


Bayesian Networks


ML Learning with Hidden Variables

What if my probabilistic models contains bothObserved random variables X (i.e. for which we havetraining data)Unobserved (hidden/latent) variables Z (e.g. data clusters)

ML learning can still be used to estimate model parametersThe Expectation-Maximization algorithm which optimizesthe complete likelihood

Lc(θ|X,Z) = P(X,Z|θ) = P(Z|X, θ)P(X|θ)

A 2-step iterative process

θ(k+1) = arg maxθ

∑z

P(Z = z|X, θ(k)) logLc(θ|X,Z = z)


Bayesian Networks

RepresentationConditional IndependenceConcluding Remarks

Joint Probabilities and Exponential Complexity

Discrete Joint Probability Distribution as a Table

X1 . . . Xi . . . Xn P(X1, . . . ,Xn)

x′

1 . . . x′

i . . . x′n P(x

′

1, . . . , x′n)

x l1 . . . x l

i . . . x ln P(x l

1, . . . , xln)

Describes P(X1, . . . ,Xn) for all the RV instantiationsFor n binary RV Xi the table has 2n entries!

Any probability can be obtained from the Joint ProbabilityDistribution P(X1, . . . ,Xn) by marginalization but again at anexponential cost (e.g. 2n−1 for a marginal distribution from

binary RV).


Bayesian Networks


Directed Graphical Models

Compact graphical representation for exponentially largejoint distributionsSimplifies marginalization and inference algorithmsAllow to incorporate prior knowledge concerning causalrelationships between RV

Directed Graphical Models a.k.a.Bayesian NetworksDescribe conditional independencebetween subsets of RV by a graphicalmodel

P(H|P,S,E) = P(H|P,S)


Bayesian Networks


Bayesian Network

Directed Acyclic Graph (DAG)G = (V, E)

Nodes v ∈ V represent randomvariables

Shaded⇒ observedEmpty⇒ un-observed

Edges e ∈ E describe theconditional independencerelationships

Conditional Probability Tables (CPT) local to each nodedescribe the probability distribution given its parents

P(Y1, . . . ,YN) =N∏

i=1

P(Yi |pa(Yi))


Bayesian Networks


A Simple Example

Assume N discrete RV Yi who can take k distinct valuesHow many parameters in the joint probability distribution?kN − 1 independent parameters

How many independentparameters if all N variables areindependent? N ∗ (k − 1)

P(Y1, . . . ,YN) =N∏

i=1

P(YI)

What if only part of the variablesare (conditionally) independent?

If the N nodes have a maximum of L children⇒ (k − 1)L × Nindependent parameters


Bayesian Networks


A Compact Representation of Replication

If the same causal relationship is replicated for a number ofvariables, we can compactly represent it by plate notation

The Naive BayesClassifier Replication for L

attributesReplication for Ndata samples


Bayesian Networks


Full Plate Notation

Gaussian Mixture Model

Boxes denote replication for anumber of times denoted by theletter in the cornerShaded nodes are observedvariablesEmpty nodes denote un-observedlatent variablesBlack seeds (optional) identifymodel parameters

π → multinomial prior distributionµ→ means of the C Gaussiansσ → std of the C Gaussians


Bayesian Networks


Local Markov Property

Definition (Local Markov property)Each node / random variable is conditionally independent of allits non-descendants given a joint state of its parents

Yv ⊥ YV\ch(v)|Ypa(v) for all v ∈ V

P(Y1,Y3,Y5,Y6|Y2,Y4) =

P(Y6|Y2,Y4)×P(Y1,Y3,Y5|Y2,Y4)


Bayesian Networks


Markov Blanket

The Markov Blanket Mb(A) of a node A isthe minimal set of vertices that shield thenode from the rest of Bayesian NetworkThe behavior of a node can be completelydetermined and predicted from theknowledge of its Markov blanket

P(A|Mb(A),Z ) = P(A|Mb(A)) ∀Z /∈ Mb(A)

The Markov blanket of A containsIts parents pa(A)Its children ch(A)Its children’s parents pa(ch(A))


Bayesian Networks


Why Using Bayesian Network?

Compacting parameter space

Reducing inference costs

Variable elimination: e.g.compute P(Y4|Y3)

Inference algorithms exploitingsparse graph structure


Bayesian Networks


Learning in Bayesian Networks

Parameter learning

Infer the θ parameters of eachconditional probability from data

Includes hidden randomvariables (e.g. Y2)

Learning network structure

Y1 Y2 Y3 Y4 Y51 2 1 0 34 0 0 0 1

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .0 0 1 3 2

Determine edge presence andorientation from data

Infer causal relationships


Bayesian Networks


Learning to Segment Image Parts

Latent Topics Network

Yuan et al, A novel topic-level random walk framework for scene image co-segmentation, ECCV 2014


Bayesian Networks


Discovering Gene Interaction Networks

Gene Expression Data


Bayesian Networks


Assessing Influenza Virulence

Inferred Gene Bayesian Graphical Model for H5N1 Virulence

http://hiv.bio.ed.ac.uk/samLycett/research.html

http://hiv.bio.ed.ac.uk/samLycett/research.html


Bayesian Networks


Take Home Messages

Graphical models provide a compact representation forprobabilistic models with a large number of variables

Use conditional independence to simplyfy joint probabilityEfficient inference methods that exploit the sparse graphstructureLearning is a special-case of inference performed on theparameters of the probability functions in the graphicalmodel

Directed graphical models (Bayesian Networks)Directed edges - Provide a representation of the causalrelationships between variablesParameter learning - Estimate the conditional probabilitiesassociated with the nodes (visible or unobserved)Structure learning - Use data to determine edge presenceand orientation


Bayesian Networks


Next Lecture

Directed graphical modelsBayesian NetworksDetermining conditional independence (d-separation)Structure learning

Undirected graphical modelsMarkov Random FieldsJoint probability factorizationIsing model

probabilistic learning and graphical modelsdidawiki.di.unipi.it/lib/exe/fetch.php/magistrale... ·...

Documents