1 parameter estimation shyh-kang jeng department of electrical engineering/ graduate institute of...

Parameter EstimationParameter Estimation

Shyh-Kang JengShyh-Kang JengDepartment of Electrical Engineering/Department of Electrical Engineering/Graduate Institute of Communication/Graduate Institute of Communication/Graduate Institute of Networking and MultiGraduate Institute of Networking and Multimedia, National Taiwan Universitymedia, National Taiwan University

Typical Classification ProblemTypical Classification ProblemRarely know the complete Rarely know the complete probabilistic structure of the problemprobabilistic structure of the problemHave vague, general knowledgeHave vague, general knowledgeHave a number of design samples or Have a number of design samples or training data as representatives of training data as representatives of patterns for classificationpatterns for classificationFind some way to use this Find some way to use this information to design or train the information to design or train the classifierclassifier

Estimating ProbabilitiesEstimating ProbabilitiesNot difficulty to Estimate prior Not difficulty to Estimate prior probabilitiesprobabilitiesHard to estimate class-conditional Hard to estimate class-conditional densitiesdensities– Number of available samples always Number of available samples always

seems too smallseems too small– Serious when dimensionality is largeSerious when dimensionality is large

Estimating ParametersEstimating ParametersProblems permit to parameterize the coProblems permit to parameterize the conditional densitiesnditional densitiesSimplifies the problem from one of estiSimplifies the problem from one of estimating an unknown function to one of emating an unknown function to one of estimating the parametersstimating the parameters– e.g.,e.g., mean vector and covariance matrix for mean vector and covariance matrix for multi-variate normal distribution multi-variate normal distribution

Maximum-Likelihood EstimationMaximum-Likelihood EstimationView the parameters as quantities View the parameters as quantities whose values are fixed but unknownwhose values are fixed but unknownBest estimate is the one that Best estimate is the one that maximize the probability of obtaining maximize the probability of obtaining the samples actually observedthe samples actually observedNearly always have good Nearly always have good convergence properties as the convergence properties as the number of samples increasesnumber of samples increasesOften simpler than alternative Often simpler than alternative methodsmethods

I. I. D. Random VariablesI. I. D. Random VariablesSeparate data into Separate data into DD11, . . ., , . . ., DDccSamples in Samples in DDjj are drawn independently a are drawn independently according to ccording to pp((xx||jj))Such samples are independent and identiSuch samples are independent and identically distributed (i. i. d.) random variablescally distributed (i. i. d.) random variablesLet Let pp((xx||jj)) has a known parametric form a has a known parametric form and is determined uniquely by a parametend is determined uniquely by a parameter vector r vector j,j,, , i.e.,i.e., p p((xx||jj))=p=p((xx||jj,,jj))

Simplification AssumptionsSimplification AssumptionsSamples in Samples in DDii give no information about give no information about jj, if , if ii is not equal to is not equal to jjCan work with each class separatelyCan work with each class separatelyHave Have cc separate problems of the same fo separate problems of the same formrm– Use set Use set DD of i. i. d. samples from of i. i. d. samples from pp((xx||)) to esti to estimate unknown parameter vector mate unknown parameter vector

Maximum-likelihood EstimateMaximum-likelihood Estimate

)|( maximizes

ˆ estimate likelihood-maximum

) respect to with of d(likelihoo

)|()|(

,, samples d. i. i.contain Let

Maximum-likelihood EstimationMaximum-likelihood Estimation

A NoteA NoteThe likelihood The likelihood pp((DD||)) as a function of as a function of is is not a probability density function of not a probability density function of Its area on the Its area on the -domain has no significa-domain has no significancenceThe likelihood The likelihood pp((DD||)) can be regarded as can be regarded as probability of probability of DD for a given for a given

Analytical ApproachAnalytical Approach

xplxpl

0:ˆfor condition necessary

)|(ln,)|(ln)(

)(maxargˆ)|(ln)(function likelihood-log

MAP EstimatorsMAP Estimators

prior uniform for theestimator MAPan isestimator (ML) likelihood-maximum

valuesparameter different ofy probabilitprior :)()(ln)( maximize that find

:estimator (MAP) posteriori a maximum

Gaussian Case: Unknown Gaussian Case: Unknown

)|(ln212ln

21)|(ln

xΣxΣx

Univariate Gaussian Case: UnknowUnivariate Gaussian Case: Unknown n and and 22

ˆ1ˆ,1ˆ0

21)|(ln

Multivariate Gaussian Case: Multivariate Gaussian Case: Unknown Unknown and and

ˆˆ1ˆ

Bias, Absolutely Unbiased, and Bias, Absolutely Unbiased, and Asymptotically Unbiased Asymptotically Unbiased

unbiasedally asymptotic is ofestimator ML

1matrix covariancefor estimator unbiased y)(absolutelan

estimation biased a as for estimator ML

Model ErrorModel ErrorFor reliable model, the ML classifier For reliable model, the ML classifier can give excellent resultscan give excellent resultsIf the model is wrong, the ML If the model is wrong, the ML classifier can not get the best results, classifier can not get the best results, even for the assumed set of modelseven for the assumed set of models

Bayesian Estimation Bayesian Estimation (Bayesian Learning)(Bayesian Learning)

Answers obtained in general is nearly Answers obtained in general is nearly identical to those by maximum-identical to those by maximum-likelihoodlikelihoodBasic conceptual differenceBasic conceptual difference– The parameter vector The parameter vector is a random is a random

variablevariable– Use the training data to convert a Use the training data to convert a

distribution on this variable into a distribution on this variable into a posterior probability densityposterior probability density

Central ProblemCentral Problem

)|( determine to)(unknown but fixed toaccordingtly independendrawn samples of set a Use

:learningBaysean of problem Centraltly.independen treatedbecan classEach

)(),|(

)(),|(),|( if ),,|(

affect not do in Samples .,, toseparated be Let )()|(find easy to are iesprobabilitprior Assume

)|(),|(

)|(),|(),|(, sample Given the

PDpDpjiDp

DDDDPDP

DPDpDpD

Parameter DistributionParameter DistributionAssume Assume pp((xx)) has a known parametric for has a known parametric form with parameter vector m with parameter vector of unknown va of unknown valuelueThus,Thus, p p((xx||)) is completely known is completely knownInformation about Information about prior to observing sa prior to observing samples is contained in known prior densitmples is contained in known prior density y pp(())Observations convert Observations convert pp(()) to to pp((||DD)) – should be sharply peaked about the true valushould be sharply peaked about the true value of e of

Parameter DistributionParameter Distribution

)ˆ|()|(

ˆ someabout sharply very peaks )|( if

)|()|()|(

)|(),|()|(),|()|,(

)|,()|(

dDppDp

pDpDpDpDp

Univariate Gaussian Case: Univariate Gaussian Case: pp((||DD))

dpDppDpDpxxD

12121exp"

21exp'

)()|()|(

)()|()()|()|(,,,

guess) about thisy uncertaint: ; of guessbest :(

known are and ),,(~)( Assume

unknownonly theis ),,(~)|(

Reproducing DensityReproducing Density

ˆ,11prior] conjugate:)( [c.f.

density] ng[reproduci ),(~)|(

Bayesian LearningBayesian Learning

DogmatismDogmatism

are and t matter wha no

,ˆ toconverge will finite, is dogmatismWhen )(dogmatism

and of ratio by theset is data empirical and

knowledgeprior between balance relative Theembetween th somewhere lies always and, and ˆ ofn combinatiolinear a is

Univariate Gaussian Case: Univariate Gaussian Case: pp((xx||DD))

),(~)|(

21exp),(

),(21exp

)|()|()|(

dDpxpDxp

Multivariate Gaussian CaseMultivariate Gaussian Case

ˆ,21exp'

)()|()|(

,,)(~)(),(~)|(

ΣΣΣΣΣΣ

xxΣΣx

Multivariate Gaussian CaseMultivariate Gaussian Case

),(~)|()|(),(~)|(

),(~)()|( lettingby or,

)|()|()|(

NDpDpNDp

dDppDp

ΣΣyxΣ

Σ0yyxy

ΣΣΣΣΣ

ΣΣΣΣΣΣ

ABABBBAABA

Multivariate Bayesian LearningMultivariate Bayesian Learning

General Bayesian EstimationGeneral Bayesian Estimation

dpDppDpDp

dDppDp

)|()|(

)()|()()|()|(

)|()|()|(

Recursive Bayesian LearningRecursive Bayesian Learning

)|()|()|()|()|(

)()|()()|()|(

)()|()|()()|()|(

)()|()()|()|(

)|()|()|(,,,

dDppDppDp

dpDppDpDp

dpDpppDpp

dpDppDpDp

DppDpD

Example 1:Example 1:Recursive Bayes LearningRecursive Bayes Learning

10maxfor/1)|(

otherwise0107for /1

)|()|()|(

otherwise0104for /1

)|()|()|(

)10,0(~)()|(

8,2,7,4),10,0(~)(otherwise00/1

),0(~)|(

nn DDp

DpxpDp

Example 1:Example 1:Recursive Bayes LearningRecursive Bayes Learning

Example 1: Bayes vs. MLExample 1: Bayes vs. ML

IdentifiabilityIdentifiabilitypp((xx||)) is identifiable is identifiable – Sequence of posterior densities Sequence of posterior densities pp((||DDnn)) conve converge to a delta functionrge to a delta function– Only one Only one causes causes pp((xx||)) to fit the data to fit the dataIn some occasions, more than one In some occasions, more than one valu values may yield the same es may yield the same pp((xx||)) – pp((||DDnn)) will peak near all will peak near all that explain the da that explain the datata– Ambiguity is erased in integration for Ambiguity is erased in integration for pp((xx||DDnn), ), which converges towhich converges to pp((xx) ) whether or notwhether or not pp((xx||)) is identifiableis identifiable

ML vs. Bayes MethodsML vs. Bayes MethodsComputational complexityComputational complexityInterpretabilityInterpretabilityConfidence in prior informationConfidence in prior information– Form of the underlying distribution Form of the underlying distribution pp((xx||))

Results differs when Results differs when pp((||DD)) is broad, or a is broad, or asymmetric around the estimated symmetric around the estimated – Bayes methods would exploit such informatBayes methods would exploit such information whereas ML would notion whereas ML would not

Classification ErrorsClassification ErrorsBayes or indistinguishability errorBayes or indistinguishability errorModel errorModel errorEstimation errorEstimation error– Parameters are estimated from a finite samParameters are estimated from a finite sampleple– Vanishes in the limit of infinite training data Vanishes in the limit of infinite training data (ML and Bayes would have the same total cl(ML and Bayes would have the same total classification error)assification error)

Invariance and Invariance and Non-informative PriorsNon-informative Priors

Guidance in creating priorsGuidance in creating priorsInvarianceInvariance– Translation invarianceTranslation invariance– Scale invarianceScale invarianceNon-informative with respect to an Non-informative with respect to an invarianceinvariance– Much better than accommodating Much better than accommodating

arbitrary transformation in a MAP arbitrary transformation in a MAP estimatorestimator

– Of great use in Bayesian estimation Of great use in Bayesian estimation

Gibbs AlgorithmGibbs Algorithm

classifier optimal Bayes theoferror expected themost twiceat iserror icationmisclassif thes,assumption given weak

Algorithm] Gibbs[)|()|(Let )|( toaccording apick

)|()|()|(

dDppDp

Sufficient StatisticsSufficient StatisticsStatisticStatistic– Any function of samplesAny function of samples

Sufficient statisticSufficient statistic s s of samplesof samples DD – ss Contains all information relevant to estimat Contains all information relevant to estimat

ing some parameter ing some parameter – Definition: Definition: pp((DD||ss, , )) is independent of is independent of – If If can be regarded as a random variable can be regarded as a random variable

)|()|(

)|(),|(),|( ss

sss pDp

Factorization TheoremFactorization TheoremA statistic A statistic ss is sufficient for is sufficient for if and only if if and only if PP((DD||)) can be written as the product can be written as the product

PP((DD||)) = = gg((ss, , ) ) hh((DD)) for some functions for some functions gg(.,.)(.,.) and and hh(.)(.)

Example: Multivariate GaussianExample: Multivariate Gaussian

for sufficient are 1ˆ thusand

),(~)|(

12/12/

xΣxΣ

Proof of Factorization Theorem: Proof of Factorization Theorem: The “Only if” PartThe “Only if” Part

),()()|()()|(),|(

)|(),|()|,()( oft independen is ),( ,for sufficient is Suppose

gDhPDhPDP

PDPDPD|P

Proof of Factorization Theorem: Proof of Factorization Theorem: The “if” PartThe “if” Part

for sufficient is and , oft independen

)(),()(),(

)|()|(

)|,()|,(

)|()|,(),|(

)(|),(

DDDDDD

DhgDhg

Kernel DensityKernel DensityFactoring of Factoring of PP((DD||)) into into gg((ss,,))hh((DD)) is not u is not uniquenique– If If ff((ss)) is any function, is any function, gg’(’(ss,,)=)=ff((ss))gg((ss,,)) and and hh’’

((DD) = ) = hh((DD)/)/ff((ss)) are equivalent factors are equivalent factors

Ambiguity is removed by defining the kerAmbiguity is removed by defining the kernel density invariant to such scalingnel density invariant to such scaling

),(),(),(

Example: Multivariate GaussianExample: Multivariate Gaussian

)ˆ(1)ˆ(21exp

1),ˆ(

)ˆ2(2

exp),ˆ(

1ˆ ),(),ˆ(

2exp)|(

),(~)|(

xΣxΣ

Kernel Density and Kernel Density and Parameter EstimationParameter Estimation

Maximum-likelihoodMaximum-likelihood– maximization of maximization of gg((ss,,))BayesianBayesian

– If prior knowledge of If prior knowledge of is vague, is vague, pp(()) tend to tend to be uniform, and be uniform, and pp((||DD)) is approximately the is approximately the same as the kernel densitysame as the kernel density

– If If pp((xx||)) is identifiable, is identifiable, gg((ss,,)) peaks sharply a peaks sharply at some value, and t some value, and pp(()) is continuous as well is continuous as well as non-zero there, as non-zero there, pp((||DD)) approaches the ke approaches the kernel density rnel density

dpDppDpDp

)(),()(),(

)()|()()|()|(

Sufficient Statistics for Sufficient Statistics for Exponential FamilyExponential Family

)()(exp),(,)(1)(),(

)()()()(exp)|(

)()()(exp)()|(

sbasxcs

xcbaxx

Error Rate and DimensionalityError Rate and Dimensionality

case,t independenlly conditiona In the

rateerror Bayes ies,probabilitprior equalWith

2,1),,(~)|(case normal temultivaria class-woConsider t

tindependenlly statistica are features Suppose

rdueeP

Accuracy and DimensionalityAccuracy and Dimensionality

Effects of Additional FeaturesEffects of Additional FeaturesIn practice, beyond a certain point, In practice, beyond a certain point, inclusion of additional features leads inclusion of additional features leads to worse rather than better to worse rather than better performanceperformanceSources of difficultySources of difficulty– Wrong modelsWrong models– Number of design or training samples is Number of design or training samples is

finite and thus the distributions are not finite and thus the distributions are not estimated accuratelyestimated accurately

Computational Complexity for Computational Complexity for Maximum-Likelihood EstimationMaximum-Likelihood Estimation

)1()()()()(

)(lnˆln21ˆˆˆ

)( :matrix a oft determinan find

)( :matrix a of inverse find

)(:ˆˆ1ˆ

)(:1ˆ

OnOdOndOndO

dndOdd

ΣxΣxx

Computational Complexity for Computational Complexity for ClassificationClassification

learningan simpler th)( :tionclassificafor Total

)(:decision )(max)(: vectorseparation by the

matrix covariance inverse heMultiply t)(:ˆ Compute

Approaches for Approaches for Inadequate SamplesInadequate Samples

Reduce dimensionalityReduce dimensionality– Redesign feature extractorRedesign feature extractor– Select appropriate subset of featuresSelect appropriate subset of features– Combine the existing featuresCombine the existing features– Pool the available data by assuming all Pool the available data by assuming all

classes share the same covariance matrixclasses share the same covariance matrixLook for a better estimate for Look for a better estimate for – Use Bayesian estimate and diagonal Use Bayesian estimate and diagonal 00

– Threshold sample covariance matrixThreshold sample covariance matrix– Assume statistical independenceAssume statistical independence

Shrinkage Shrinkage (Regularized Discriminant Analysis)(Regularized Discriminant Analysis)

10,)-(1)(matrixidentity the toward shrink"" or,

1onecommon thematrix to covariance individual shrink""

matrix covariance same assumingby estimated is questionin categories on theindex an is

IΣΣΣ

ΣΣΣ

Concept of OverfittingConcept of Overfitting

Best Representative PointBest Representative Point

)( minimizes

)()()(

minimized is )(

such that find,,,Given

Projection Along a LineProjection Along a Line

Best Projection to a Line Through Best Projection to a Line Through the Sample Meanthe Sample Mean

)();,,(

minimize Toerror with by Represent

mxmxee

emxemx

Best Representative DirectionBest Representative Direction

eSeeeSee

mxemxmxe

0)1( maximize :method Lagrange

1 subject to Maximize

minimize to Find

Principal Component Analysis Principal Component Analysis (PCA)(PCA)

seigenvaluelargest thehaving of rseigenvecto theare ,,

minimize to',,1, Find

: space Projection

iikidd

Concept of Concept of Fisher Linear DiscriminantFisher Linear Discriminant

Fisher Linear Discriminant AnalysisFisher Linear Discriminant Analysis

)( maximize To

~~ :scatter class-Within

)~(~,1~

on separation maximalget to Find

xwmwxw

wSwmwmw

SSSwSw

wSwmwxw

scales] [ignoring )(

)( ofdirection thein always is ))((

problem) eigenvalue ed(generaliz when maximized is

quotient,Rayleigh dgeneralize ,)(

mmwmmmmwS

wSwwSww

Fisher Linear Discriminant Analysis Fisher Linear Discriminant Analysis for Multivariate Normalfor Multivariate Normal

analysis]nt discriminalinear Fisher to[solution

, and , ,for estimationWith

boundarydecision optimalmatrix covariance same Assume

Concept of Multidimensional DiscriConcept of Multidimensional Discriminant Analysisminant Analysis

Multiple Discriminant AnalysisMultiple Discriminant Analysis

xmmxmxSSS

WSWmxWmxWS

~1~,1~

subspaceldimensiona-)1( tospace ldimensiona- from Projection

problem class-Consider

SSmmmmS

mmmmmxmx

mmmxmmmx

WSWmmmmS

~)(let

)directions principal in the variances ofproduct thet to(equivalenmatrix scatter theof

tdeterminan theisscatter of measurescalar simpleA scatter class- withon thescatter to class-between

theof ratio themaximize toation transformaSeek

~~~~~1

etc. matrices, scalingor rotation by multipliedbecan it since unique,not is optimal

eigenvaluelargest the torelatedr eigenvecto dgeneralize theis and

satisfies optimal of Columns

Expectation-Maximization (EM)Expectation-Maximization (EM)Finding the maximum-likelihood estimate of Finding the maximum-likelihood estimate of the parameters of an underlying distribution the parameters of an underlying distribution – from a given data set when the data is from a given data set when the data is

incomplete or has missing valuesincomplete or has missing valuesTwo main applicationsTwo main applications– When the data indeed has missing valuesWhen the data indeed has missing values– When optimizing the likelihood function is When optimizing the likelihood function is

analytically intractable but when the likelihood analytically intractable but when the likelihood function can be simplified by assuming the function can be simplified by assuming the existence of and values for additional but existence of and values for additional but missing (or hidden) parametersmissing (or hidden) parameters

Expectation-Maximization (EM)Expectation-Maximization (EM)Full sample Full sample DD = { = {xx11, . . ., , . . ., xxnn}}

xxkk = { = { xxkgkg, , xxkbkb } }Separate individual features into Separate individual features into DDgg an and d DDbb

– DD is the union of is the union of DDgg and and DDbbForm the functionForm the function igbgDi DDDpEQ

b ;|);,(ln);(

Expectation-Maximization (EM)Expectation-Maximization (EM)begin initialize begin initialize 00, , TT,, i i 0 0 do do i i i + i + 11 E step: Compute E step: Compute QQ((; ; ii)) M step: M step: ii+1+1 arg max arg max QQ((,,ii))

until until QQ((ii+1+1;;ii)-)-QQ((ii;;ii-1-1) ) TT return return ii+1+1

end end

Expectation-Maximization (EM)Expectation-Maximization (EM)

Example: 2D ModelExample: 2D Model

matrix covariancediagonal with modelGaussian 2D Assume

ln)|(ln

)|(ln)|(ln

,|);,(ln);(41

DxDpEQ

0.2938.0

0.275.0

1)|(ln

)4(21exp

)|(ln);(

0.200667.0

0.20.1

at converges algorithm the,iterations 3After

Generalized Expectation-Generalized Expectation-Maximization (GEM)Maximization (GEM)

Instead of maximizing Instead of maximizing QQ((; ; ii), we find s), we find some ome ii+1+1 such thatsuch thatQQ((ii+1+1;;ii)>)>QQ((;;ii) )

and is also guaranteed to convergeand is also guaranteed to convergeConvergence will not as rapidConvergence will not as rapidOffers great freedom to choose computaOffers great freedom to choose computationally simpler stepstionally simpler steps– e.g., using maximum-likelihood value of unke.g., using maximum-likelihood value of unknown values, if they lead to a greater likelihnown values, if they lead to a greater likelihoodood

Hidden Markov Model (HMM)Hidden Markov Model (HMM)Used for problems of making a series of Used for problems of making a series of decisionsdecisions– e.ge.g., speech or gesture recognition., speech or gesture recognitionProblem states at time Problem states at time tt are influenced d are influenced directly by a state at irectly by a state at t-t-11More reference:More reference:– L. A. Rabiner and B. W. Juang, L. A. Rabiner and B. W. Juang, FundamentalFundamentals of Speech Recognitions of Speech Recognition, Prentice-Hall, 1993,, Prentice-Hall, 1993, Chapter 6. Chapter 6.

First Order Markov ModelsFirst Order Markov Models

1321223213

6312231

6 )|(,,,,,,.,.

)(,),2(),1( states of sequence

aaaaaPge

First Order Hidden Markov ModelsFirst Order Hidden Markov Models

bttvPvvvvvvge

))(|)((,,,,,,.,.

)(,),2(),1( states visibleof Sequence

3241146 V

Hidden Markov Model ProbabilitiesHidden Markov Model Probabilities

))(|)((:state visiblea ofemission ofy probabilit

))(|)1(( :yprobabilit transition1: state absorbingor final 000

Hidden Markov Model ComputationHidden Markov Model ComputationEvaluation problemEvaluation problem– Given Given aaijij and and bbjkjk, determine , determine PP((VVTT||))

Decoding problemDecoding problem– Given Given VVTT, determine the most likely sequenc, determine the most likely sequence of hidden states that lead to e of hidden states that lead to VVTT

Learning problemLearning problem– Given training observations of visible symbolGiven training observations of visible symbols and the coarse structure but not the probas and the coarse structure but not the probabilities, determine bilities, determine aaijij and and bbjkjk

EvaluationEvaluation

))1(|)(())(|)(()(

))(|)(()|(

))1(|)(()(

)()|()(

ttPttvPP

HMM ForwardHMM Forward

)()()|)(()),(()(

state initial,1state initial,0

)),1(())1(|)(())(|(

)),(()(

))1(|)(())(|)(()(

iiijtjkv

PPTPTPT

tPttPtvP

ttPttvPP

HMM Forward and TrellisHMM Forward and Trellis

HMM ForwardHMM Forward

endstate finalfor )()(return

,,0 ,1for

)0(,,,,0 initialize

i ijitjkvj

HMM BackwardHMM Backward

)()()|)0(()),0(()0(

0,10,0

)),1(())(|)1(())1(|(

)),(()(

))1(|)(())(|)(()(

TTTinit

Tinitinit

jjijtjkv

tPttPtvP

ttPttvPP

HMM BackwardHMM Backward

endstate initialfor )0()(return

0 until

,,1 ,1for

)(,,,, initialize

j tjkvijji

Example 3: Hidden Markov ModelExample 3: Hidden Markov Model

2.01.02.05.001.07.01.01.002.01.04.03.00

1.00.01.08.01.02.05.02.04.01.03.02.0

Example 3: Hidden Markov ModelExample 3: Hidden Markov Model

Left-to-Right Models for SpeechLeft-to-Right Models for Speech

)()()|()|( T

HMM DecodingHMM Decoding

Problem of Local OptimizationProblem of Local OptimizationThis decoding algorithm depends This decoding algorithm depends only on the single previous time step, only on the single previous time step, not the full sequencenot the full sequenceNot guarantee that the path is Not guarantee that the path is indeed allowableindeed allowable

HMM DecodingHMM Decoding

endreturn

to Append

)(maxarg until

1for 0 ,1for

{},0 initialize

PathTt

jjjttPatht

i ijitjkvj

Example 4: HMM DecodingExample 4: HMM Decoding

100100

Forward-Backward AlgorithmForward-Backward AlgorithmDetermines model parameters, Determines model parameters, aaijij and and bbjkjk,, from an ensemble of training samples from an ensemble of training samplesAn instance of a generalized expectatioAn instance of a generalized expectation-maximization algorithmn-maximization algorithmNo known method for the optimal or moNo known method for the optimal or most likely set of parameters from datast likely set of parameters from data

101101

Probability of TransitionProbability of Transition

)|()()1(

)|()|),(),1((

),|)(),1((

),|)(),1(()(

Tjjkiji

102102

Improved Estimate for Improved Estimate for aaijij

t k ik

t k iki

: of Estimate

from tionsany transi ofnumber expected Total

:sequence in the any timeat )( and )1(statebetween ns transitioofnumber Expected

103103

Improved Estimate for Improved Estimate for bbjkjk

vtvt lil

104104

Forward-Backward AlgorithmForward-Backward Algorithm(Baum-Welch Algorithm)(Baum-Welch Algorithm)

)();(return

)1()(),1()(max until

)(ˆ)(

)1( and )1( all from )(ˆ all compute

)1( and )1( all from )(ˆ all compute 1 do

0, threshold, sequence training,, initialize

zbbzaa

zbzbzaza

zbzazb

zbzazazz

jkjkijij

jkjkijijkji

jkijjk

jkijij

1 parameter estimation shyh-kang jeng department of electrical engineering/ graduate institute of...

Documents

bandwidth measurements jeng lung webtp meeting 10/25/99

yong tzyy jeng - master of engineering science dissertation...

reporter: yu ting huang advising prof: ru jong jeng 1

jeng menul

judson hwang wong shyh long - utar...

(michelle jeng) (vickie hickman) (michelle jeng) packet...

modeling tcp throughput jeng lung webtp meeting 11/1/99

computational cognitive neuroscience shyh-kang jeng...

1 multivariate normal distribution shyh-kang jeng department...

1 factor analysis and inference for structured covariance...

11 comparison of several multivariate means shyh-kang jeng...

jeng ywan jeng_inside 3d printing hong kong

1 matrix algebra and random vectors shyh-kang jeng...

feng-jeng,huang's portfolio

faa-jeng lin ieee fellow, iet fellow -...

1 bayesian decision theory shyh-kang jeng department of...

mr jeng presentation (japan)new

2012.08.19 sheng-fu chen/ huai-ching tai / hong-jeng yu

1 inferences about a mean vector shyh-kang jeng department...

adela jeng