1 bayesian learning for latent semantic analysis jen-tzung chien, meng-sun wu and chia-sheng wu...

Bayesian Learning for Latent Semantic AnalysisBayesian Learning for Latent Semantic Analysis

Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng WuJen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu

Presenter: Hsuan-Sheng Chiu

Speech Lab. NTNU 22

ReferenceReference

Chia-Sheng Wu, “Bayesian Latent Semantic Analysis for Text CChia-Sheng Wu, “Bayesian Latent Semantic Analysis for Text Categorization and Information Retrieval”, 2005ategorization and Information Retrieval”, 2005

Q. Huo and C.-H. Lee, “On-line adaptive learning of the continuoQ. Huo and C.-H. Lee, “On-line adaptive learning of the continuous density hidden Markov model based on approximate recursive us density hidden Markov model based on approximate recursive Bayes estimate”, 1997Bayes estimate”, 1997

Speech Lab. NTNU 33

OutlineOutline

IntroductionIntroduction

PLSAPLSAML (Maximum Likelihood)ML (Maximum Likelihood)

MAP (Maximum A Posterior)MAP (Maximum A Posterior)

QB (Quasi-Bayes)QB (Quasi-Bayes)

ExperimentsExperiments

ConclusionsConclusions

Speech Lab. NTNU 44

LSA vs. PLSALSA vs. PLSALinear algebra and probabilityLinear algebra and probability

Semantic space and latent topicsSemantic space and latent topics

Batch learning vs. Incremental learningBatch learning vs. Incremental learning

Speech Lab. NTNU 55

PLSAPLSA

PLSA is a general machine learning technique, which adopts the PLSA is a general machine learning technique, which adopts the aspect model to represent the co-occurrence data.aspect model to represent the co-occurrence data.

Topics (hidden variables)Topics (hidden variables)

Corpus (document-word pairs)Corpus (document-word pairs)

Kk zzZz ,...,1

MjNiji wwwdddwdY ,...,,,..., , , 11

Speech Lab. NTNU 66

PLSAPLSA

Assume that dAssume that dii and w and wjj are independent conditionally on the mixtu are independent conditionally on the mixtu

re of associated topic zre of associated topic zkk

Joint probability:Joint probability:

kjkikji zwPzdPzwdP |||,

kikkji

kikjki

kijkiijiji

dzPzwPdPdP

zdPzwPzPdP

zdwPzPdP

dwzPdP

dwzPdPdwPdPwdP

Speech Lab. NTNU 77

ML PLSAML PLSA

Log likelihood of Y:Log likelihood of Y:

ML estimation:ML estimation:

|logmaxarg YML

jjiji wdPwdnYP

,log,|log

ikkj dzPzwP |,|

Speech Lab. NTNU 88

ML PLSAML PLSA

Maximization:Maximization:

jijiji

dwPwdn

dwPwdndPwdn

dwPdPwdn

wdPwdnY

1 11 1

|log,max

|log,log,max

|loglog,max

|log,max

,log,max|logmax

Speech Lab. NTNU 99

ML PLSAML PLSA

Complete data:Complete data:

Incomplete data:Incomplete data:

EM (Expectation-Maximization) AlgorithmEM (Expectation-Maximization) AlgorithmE-step E-step

M-stepM-step

ikj dzwP |,

ij dwP |

ijkijikj dwzPdwPdzwP ,|||,

Speech Lab. NTNU 1010

ML PLSAML PLSA

E-StepE-Step

kijkijkji

kikjijkji

jdwzijkikjji

jdwzijji

dwzPdwzPwdn

dzwPdwzPwdn

dwzPdzwPEwdn

dwPEwdn

ˆ,ˆ,

,|log,|,

|,log,|,

,|log|,log,

ML PLSAML PLSA

Auxiliary function:Auxiliary function:

AndAnd

l illj

ikkjjik

dzPzwP

dzPzwPwdzP

kikkjjikji

dzPzwPwdzPwdn

|ˆ|ˆlog,|,

,|ˆ|,log|ˆ

ML PLSAML PLSA

M-step:M-step:Lagrange multiplierLagrange multiplier

kkjjikji

zwPwdzPwdnQkj

1 1 1|

|ˆlog,|,

kikjikji

dzPwdzPwdnQik

1 1 1|

|ˆlog,|,

ML PLSAML PLSA

DifferentiationDifferentiation

New parameter estimation:New parameter estimation:

wyyywF

j jilji

i jikjiikML

i mikmi

i jikjikjML

wdzPwdn

wdzPwdndzP

wdzPwdn

wdzPwdnzwP

,|,|ˆ

MAP PLSAMAP PLSA

Estimation by Maximizing the posteriori probability:Estimation by Maximizing the posteriori probability:

Definition of prior distribution:Definition of prior distribution:Dirichlet density:Dirichlet density:

Prior density:Prior density:

gXPXPMAP log|logmaxarg|maxarg

iii xxxf i

ikkj dzPzwPg1 1

1 ,, ||

jijiij

,1Kronecker delta

kj zwP | kj zwP |Assume andare independent

MAP PLSAMAP PLSA

Consider prior density:Consider prior density:

Maximum a Posteriori:Maximum a Posteriori:

jkjkj dzPzwPg

1 1, |log1|log1log

jijji dwPwdng

|log,logmax

MAP PLSAMAP PLSA

E-step:E-step:expectationexpectation

kikkjjikji

dzPzwP

dzPzwPwdzPwdnR

|ˆlog1|ˆlog1

|ˆ|ˆlog,|,|ˆ~

jdwzijji gdwPEwdn

ijk1 1

,|log|log,

MAP PLSAMAP PLSA

M-stepM-stepLagrange multiplierLagrange multiplier

kikkjjikji

dzPzwP

dzPzwPwdzPwdnR

|ˆ1|ˆ1

|ˆlog1|ˆlog1

|ˆ|ˆlog,|,|ˆ~

MAP PLSAMAP PLSA

kkjjikji

MAPzwP

zwPzwP

zwPwdzPwdnQkj

1 1 1|

|ˆ1|ˆlog1

|ˆlog,|,

kikjikji

MAPdzP

dzPdzP

dzPwdzPwdnQik

1 1 1|

|ˆ1|ˆlog1

|ˆlog,|,

MAP PLSAMAP PLSA

DifferentiationDifferentiation

New parameter estimation:New parameter estimation:

i mjmikmi

i kjjikjikjMAP

wdzPwdn

wdzPwdnzwP

1,|,|ˆ

j ikjikji

ikMAPdn

wdzPwdndzP

1,|,|ˆ

jijkiji dwzPdwndn

QB PLSAQB PLSA

It needs to update continuously for an online information system.It needs to update continuously for an online information system.Estimation by maximize the posteriori probability:Estimation by maximize the posteriori probability:

Posterior density is approximated by the closest tractable prior density Posterior density is approximated by the closest tractable prior density with hyperparameterswith hyperparameters

As compared to MAP PLSA, the key difference using QB PLSA As compared to MAP PLSA, the key difference using QB PLSA is due to the updating of hyperparameters.is due to the updating of hyperparameters.

||maxarg

||maxarg|maxarg

1 , nik

nikQBk

nQB dzPzwP |,|

QB PLSAQB PLSA

Conjugate prior:Conjugate prior:In Bayesian probability theory, a conjugate prior is a prior distribution In Bayesian probability theory, a conjugate prior is a prior distribution which has the property that the posterior distribution is the same type which has the property that the posterior distribution is the same type of distribution.of distribution.

A close-form solutionA close-form solution

A reproducible prior/posteriori pair for incremental learningA reproducible prior/posteriori pair for incremental learning

QB PLSAQB PLSA

Hyperparameter α:Hyperparameter α:

zwPzwP

zwPzwPg

1|ˆ,1,1|ˆ

|ˆ1|ˆlog1log

i mjmikmi

i kjjikjikj

wdzPwdn

wdzPwdnzwP

1,|,|ˆ

1, ,|,

nkj wdzPwdn

QB PLSAQB PLSA

After careful arrangement, exponential of posteriori expectation fAfter careful arrangement, exponential of posteriori expectation function can be expressed:unction can be expressed:

A reproducible prior/posterior pair is generated to build the updatA reproducible prior/posterior pair is generated to build the updating mechanism of hyperparametersing mechanism of hyperparameters

nkj dzPzwP

1 ,, |ˆ|ˆ

|ˆexp

1, ,|,

nkj wdzPwdn

1, ,|,

nik wdzPwdn

Initial HyperparametersInitial Hyperparameters

A open issue in Bayesian learningA open issue in Bayesian learning

If the initial prior knowledge is too strong or after a lot of If the initial prior knowledge is too strong or after a lot of adaptation data have been incrementally processed, the new adaptation data have been incrementally processed, the new adaptation data usually have only a small impact on parameters adaptation data usually have only a small impact on parameters updating in incremental training. updating in incremental training.

ijikkj wdzP

0, ,|1

jjikik wdzP

0, ,|1

MED Corpus: MED Corpus:

1033 medical abstracts with 30 queries1033 medical abstracts with 30 queries

7014 unique terms7014 unique terms

433 abstracts for ML training433 abstracts for ML training

600 abstracts for MAP or QB training600 abstracts for MAP or QB training

Query subset for testingQuery subset for testing

K=8K=8

Reuters-21578Reuters-21578

4270 documents for training4270 documents for training

2925 for QB learning2925 for QB learning

2790 documents for testing2790 documents for testing

13353 unique words13353 unique words

10 categories10 categories

This paper presented an adaptive text modeling and classification This paper presented an adaptive text modeling and classification approach for PLSA based information system.approach for PLSA based information system.

Future work:Future work:Extension of PLSA for bigram or trigram will be explored.Extension of PLSA for bigram or trigram will be explored.

Application for spoken document classification and retrievalApplication for spoken document classification and retrieval

Discriminative Maximum Entropy Discriminative Maximum Entropy Language Model for Speech RecognitionLanguage Model for Speech Recognition

Chuang-Hua Chueh, To-Chang Chien and Jen-TzunChuang-Hua Chueh, To-Chang Chien and Jen-Tzung Chieng Chien

Presenter: Hsuan-Sheng Chiu

ReferenceReference

R. Rosenfeld, S. F. Chen and X. Zhu, “Whole-sentence exponentiR. Rosenfeld, S. F. Chen and X. Zhu, “Whole-sentence exponential language models : a vehicle for linguistic statistical integrational language models : a vehicle for linguistic statistical integration”, 2001”, 2001

W.H. Tsai, “An Initial Study on Language Model Estimation and W.H. Tsai, “An Initial Study on Language Model Estimation and Adaptation Techniques for Mandarin Large Vocabulary ContinuoAdaptation Techniques for Mandarin Large Vocabulary Continuous Speech Recognition”, 2005us Speech Recognition”, 2005

OutlineOutline

Whole-sentence exponential modelWhole-sentence exponential model

Discriminative ME language modelDiscriminative ME language model

ExperimentExperiment

Language modelLanguage modelStatistical n-gram modelStatistical n-gram model

Latent semantic language modelLatent semantic language model

Structured language modelStructured language model

Based on maximum entropy principle, we can integrate different Based on maximum entropy principle, we can integrate different features to establish optimal probability distribution.features to establish optimal probability distribution.

Whole-Sentence Exponential ModelWhole-Sentence Exponential Model

Traditional method:Traditional method:

Exponential form:Exponential form:

Usage:Usage:When used for speech recognition, the model is not suitable for the When used for speech recognition, the model is not suitable for the first pass of the recognizer, and should be used to re-score N-best lists.first pass of the recognizer, and should be used to re-score N-best lists.

iii sfsp

Zsp exp

iiin wwwpwwpsp

1111 ...|...

Whole-Sentence ME Language ModelWhole-Sentence ME Language Model

Expectation of feature function:Expectation of feature function:Empirical:Empirical:

Actual:Actual:

Constraint:Constraint:

Rsfspfp

Li sfspfp

Fifpfp Li

Li ,...,1for ,~

Whole-Sentence ME Language ModelWhole-Sentence ME Language Model

To Solve the constrained optimization problem:To Solve the constrained optimization problem:

11expexp

1expexp ,1log

0log1,

sfspsfsp

sfspsp

spsfspsfspspsp

spfpfppHp

GIS algorithmGIS algorithm

converged.not has if 2 step toGo 3.

on based update ,1each For 2.

,...,1 allfor 0tion with Initializa 1.

ˆ multiplier Lagrange Optimal :Output

~on distributi empirical and ,..., functions Feature :Input

sfsfspsfsp

,...,Fi

Discriminative ME Language ModelDiscriminative ME Language Model

In general, ME can be considered as a maximum likelihood In general, ME can be considered as a maximum likelihood model using log-linear distribution.model using log-linear distribution.

Propose a Discriminative language model based on whole-Propose a Discriminative language model based on whole-sentence ME model (DME)sentence ME model (DME)

Acoustic features for ME estimation:Acoustic features for ME estimation:Sentence-level log-likelihood ratio of competing and target sentencesSentence-level log-likelihood ratio of competing and target sentences

Feature weight parameter:Feature weight parameter:

Namely, we activate feature parameter to be one for those speech signals Namely, we activate feature parameter to be one for those speech signals observed in training database observed in training database

sentence competing :

sentence target :

New estimation:New estimation:

Upgrade to discriminative linguistic parametersUpgrade to discriminative linguistic parameters

Corpus: TCC300Corpus: TCC30032 mixtures32 mixtures

12 Mel-frequency cepstral coefficients12 Mel-frequency cepstral coefficients

1 log-energy and first derivation1 log-energy and first derivation

4200 sentences for training, 450 for testing4200 sentences for training, 450 for testing

Corpus: Academia Sinica CKIP balanced corpusCorpus: Academia Sinica CKIP balanced corpusFive million wordsFive million words

Vocabulary 32909 wordsVocabulary 32909 words

A new ME language model integrating linguistic and acoustic A new ME language model integrating linguistic and acoustic features for speech recognitionfeatures for speech recognition

The derived ME language model was inherent with The derived ME language model was inherent with discriminative power.discriminative power.

DME model involved a constrained optimization procedure and DME model involved a constrained optimization procedure and was powerful for knowledge integration.was powerful for knowledge integration.

Relation between DME and MMI Relation between DME and MMI

MMI criterion:MMI criterion:

Modified MMI criterion:Modified MMI criterion:

Express ME model as ML model:Express ME model as ML model:

MMI SpSXp

MMI spsXp

1 ''''|

Relation between DME and MMIRelation between DME and MMI

The optimal parameter:The optimal parameter:

rrLALADME

logmaxarg

logmaxargˆ

'exp'|

'|log'exp

|logexp

1 bayesian learning for latent semantic analysis jen-tzung chien, meng-sun wu and chia-sheng wu...

qb plsaconjugate prior

posterior density

posteriori probability

map plsaestep

map plsaestimation

dirichlet density

qb plsait

initial prior knowledge

Documents

practical optimization -...

con-tact® planarization process of spin-on dielectrics for...

introduction to structural equation modeling ·...

department of applied foreign languages supervisor: dr....

1 does credit score really help explain insurance losses?...

autour du monde 11.09.2014 20:00 - philharmonie...

chen, feiyang wu, chung-ying liao, hung-sheng chung, yuwei...

texas section: june 2016 - american physical society ·...

k-anonymity for crowdsourcing...

christian schmitt | orgel wu wei | sheng bamberger ... ·...

proceedingsltis.icnslab.net/altis/files/20101209_sang-hona...2010/12/09...

wu, jeffrey chi-sheng （吳紀聖） · 2016. 6. 4. ·...

sheng yi wu - era.library.ualberta.ca · iii preface...

論文著述 - amazon web...

1 an xml implementation process model for enterprise...

repository.asu.edukristi hannon--+ jeremy ruth# sheng wen wu...

web service-based remote monitoring system for smart home...

arxiv:2005.05535v4 [cs.cv] 20 may 2020luis rp freelancer...

© deloitte consulting, 2005 predictive modeling – panacea...

1 general iteration algorithms by luyang fu, ph. d., state...