lsa – algorithmslsa uses the vectors corresponding to the largest eigenvalues. informatics and...

24
Informatics and Mathematical Modelling / Intelligent Signal Processing 1 Jan Larsen LSA – Algorithms Ph.D.-student Lasse Lohilahti Mølgaard

Upload: others

Post on 04-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LSA – AlgorithmsLSA uses the vectors corresponding to the largest eigenvalues. Informatics and Mathematical Modelling / Intelligent Signal Processing Jan Larsen 6 SVD properties

Informatics and Mathematical Modelling / Intelligent Signal Processing

1Jan Larsen

LSA – Algorithms Ph.D.-student

Lasse Lohilahti Mølgaard

Page 2: LSA – AlgorithmsLSA uses the vectors corresponding to the largest eigenvalues. Informatics and Mathematical Modelling / Intelligent Signal Processing Jan Larsen 6 SVD properties

Informatics and Mathematical Modelling / Intelligent Signal Processing

Jan Larsen 2

Agenda

Basics of LSA– Mathematical model

PLSI– Probabilistic interpretation– Results in retrieval

Demos– Castsearch– Wikipedia

Page 3: LSA – AlgorithmsLSA uses the vectors corresponding to the largest eigenvalues. Informatics and Mathematical Modelling / Intelligent Signal Processing Jan Larsen 6 SVD properties

Informatics and Mathematical Modelling / Intelligent Signal Processing

Jan Larsen 3

Page 4: LSA – AlgorithmsLSA uses the vectors corresponding to the largest eigenvalues. Informatics and Mathematical Modelling / Intelligent Signal Processing Jan Larsen 6 SVD properties

Informatics and Mathematical Modelling / Intelligent Signal Processing

Jan Larsen 4

LSA - algorithm

terms

docs

X̂ U

x

Σ

x

TV

t x d t x k k x k k x d

SVD:

Page 5: LSA – AlgorithmsLSA uses the vectors corresponding to the largest eigenvalues. Informatics and Mathematical Modelling / Intelligent Signal Processing Jan Larsen 6 SVD properties

Informatics and Mathematical Modelling / Intelligent Signal Processing

Jan Larsen 5

SVD properties

SVD solves the eigenvalue-problem for the matrices,

Eigenvectors for these matrices are orthogonal, i.e.

LSA uses the vectors corresponding to the largest eigenvalues.

Page 6: LSA – AlgorithmsLSA uses the vectors corresponding to the largest eigenvalues. Informatics and Mathematical Modelling / Intelligent Signal Processing Jan Larsen 6 SVD properties

Informatics and Mathematical Modelling / Intelligent Signal Processing

Jan Larsen 6

SVD properties

The extracted vectors correspond to the directions containing most of the variance.

SVD can be solved very efficiently because of these properties and the sparsity of the termdoc matrix

Page 7: LSA – AlgorithmsLSA uses the vectors corresponding to the largest eigenvalues. Informatics and Mathematical Modelling / Intelligent Signal Processing Jan Larsen 6 SVD properties

Informatics and Mathematical Modelling / Intelligent Signal Processing

Jan Larsen 7

Text corpus example

Page 8: LSA – AlgorithmsLSA uses the vectors corresponding to the largest eigenvalues. Informatics and Mathematical Modelling / Intelligent Signal Processing Jan Larsen 6 SVD properties

Informatics and Mathematical Modelling / Intelligent Signal Processing

Jan Larsen 8

LSA - shortcomings

LSA components are orthogonalComponents are not directly interpretable –subsequent clustering necessaryMissing a way to find number of componentsMay return negative entries in X

LSA-components represent space containing most of the variance. Is this the most useful model?

Page 9: LSA – AlgorithmsLSA uses the vectors corresponding to the largest eigenvalues. Informatics and Mathematical Modelling / Intelligent Signal Processing Jan Larsen 6 SVD properties

Informatics and Mathematical Modelling / Intelligent Signal Processing

Jan Larsen 9

Cepstral features in music [Hansen, Feng]

Page 10: LSA – AlgorithmsLSA uses the vectors corresponding to the largest eigenvalues. Informatics and Mathematical Modelling / Intelligent Signal Processing Jan Larsen 6 SVD properties

Informatics and Mathematical Modelling / Intelligent Signal Processing

Jan Larsen 10

Cepstral Features of Phonemes [Hansen, Feng]

Page 11: LSA – AlgorithmsLSA uses the vectors corresponding to the largest eigenvalues. Informatics and Mathematical Modelling / Intelligent Signal Processing Jan Larsen 6 SVD properties

Informatics and Mathematical Modelling / Intelligent Signal Processing

Jan Larsen 11

Related algorithmsICA – Independent Component Analysis [Kolenda et al.]– Components independent

PLSA - Probabilistic LSA [Hoffman]– Similar to ICA– Feasible algorithms for optimisation.

LDA – Latent Dirichlet Association [Blei,Ng,Jordan]– An extension of the PLSA model

Page 12: LSA – AlgorithmsLSA uses the vectors corresponding to the largest eigenvalues. Informatics and Mathematical Modelling / Intelligent Signal Processing Jan Larsen 6 SVD properties

Informatics and Mathematical Modelling / Intelligent Signal Processing

Jan Larsen 12

ICA example

Page 13: LSA – AlgorithmsLSA uses the vectors corresponding to the largest eigenvalues. Informatics and Mathematical Modelling / Intelligent Signal Processing Jan Larsen 6 SVD properties

Informatics and Mathematical Modelling / Intelligent Signal Processing

Jan Larsen 13

PLSI

Probabilistic latent variable model– Generative model with an unobserved class variable k– Joint probability model over terms and documents

– Components are found using an iterative EM-algorithm – Analogous structure, but with the benefits of a probabilistic

framework – similarities as probabilities vs. cosines.

Page 14: LSA – AlgorithmsLSA uses the vectors corresponding to the largest eigenvalues. Informatics and Mathematical Modelling / Intelligent Signal Processing Jan Larsen 6 SVD properties

Informatics and Mathematical Modelling / Intelligent Signal Processing

Jan Larsen 14

terms

docs

d)(t,P̂ k)|P(t

x

)(kP

x

d)|P(k

t x d t x k k x k k x d

PLSI:

terms

docs

X̂ U

x

Σ

x

TV

t x d t x k k x k k x d

SVD:

Same structure but different interpretation.

Page 15: LSA – AlgorithmsLSA uses the vectors corresponding to the largest eigenvalues. Informatics and Mathematical Modelling / Intelligent Signal Processing Jan Larsen 6 SVD properties

Informatics and Mathematical Modelling / Intelligent Signal Processing

Jan Larsen 15

PLSI shortcomingsEM algorithm requires full matrix reconstruction of

E-step

M-step

d)(t,P̂

terms

docs

=d)(t,P̂ k)|P(t

x

)(kP

x

d)|P(k

t x d t x k k x k k x d

PLSI:

Page 16: LSA – AlgorithmsLSA uses the vectors corresponding to the largest eigenvalues. Informatics and Mathematical Modelling / Intelligent Signal Processing Jan Larsen 6 SVD properties

Informatics and Mathematical Modelling / Intelligent Signal Processing

Jan Larsen 16

Infeasible for large termdoc-matrices– 100.000 documents x 20.000 terms -> 15 GB td-matrix

Model can be approximated using Nonnegative Matrix Factorisation

Page 17: LSA – AlgorithmsLSA uses the vectors corresponding to the largest eigenvalues. Informatics and Mathematical Modelling / Intelligent Signal Processing Jan Larsen 6 SVD properties

Informatics and Mathematical Modelling / Intelligent Signal Processing

Jan Larsen 17

Query relevanceQuery: a new document d* with equal probability for terms

Page 18: LSA – AlgorithmsLSA uses the vectors corresponding to the largest eigenvalues. Informatics and Mathematical Modelling / Intelligent Signal Processing Jan Larsen 6 SVD properties

Informatics and Mathematical Modelling / Intelligent Signal Processing

18Jan Larsen

CastsearchCastsearch.imm.dtu.dk

Page 19: LSA – AlgorithmsLSA uses the vectors corresponding to the largest eigenvalues. Informatics and Mathematical Modelling / Intelligent Signal Processing Jan Larsen 6 SVD properties

Informatics and Mathematical Modelling / Intelligent Signal Processing

Jan Larsen 19

Context evaluations

Data:– CNN hourly podcasts continuously retrieved.

In the period: April 4th 2006 – September 9th 2006

– 2099 CNN-News podcasts processed

Vector space representation– 30977 speaker documents– Vocabulary of 37791 words (excluding stop words)

Page 20: LSA – AlgorithmsLSA uses the vectors corresponding to the largest eigenvalues. Informatics and Mathematical Modelling / Intelligent Signal Processing Jan Larsen 6 SVD properties

Informatics and Mathematical Modelling / Intelligent Signal Processing

Jan Larsen 20

Text segmentationResegment text according to topicsNMF-model:– Assign topic given pseudo-documents as query

Tests performed on a manually segmented subset of podcasts

last month viacom demanded you to remove more than a hundred thousand unauthorized clips || attorney general alberto gonzalez

d*

Page 21: LSA – AlgorithmsLSA uses the vectors corresponding to the largest eigenvalues. Informatics and Mathematical Modelling / Intelligent Signal Processing Jan Larsen 6 SVD properties

Informatics and Mathematical Modelling / Intelligent Signal Processing

Jan Larsen 21

Text segmentation example

Page 22: LSA – AlgorithmsLSA uses the vectors corresponding to the largest eigenvalues. Informatics and Mathematical Modelling / Intelligent Signal Processing Jan Larsen 6 SVD properties

Informatics and Mathematical Modelling / Intelligent Signal Processing

Jan Larsen 22

Castsearch DemoWebdemo– Automatic retrieval of CNN

podcasts and preprocessing

– 70 contexts annotated manually

– castsearch.imm.dtu.dk

Page 23: LSA – AlgorithmsLSA uses the vectors corresponding to the largest eigenvalues. Informatics and Mathematical Modelling / Intelligent Signal Processing Jan Larsen 6 SVD properties

Informatics and Mathematical Modelling / Intelligent Signal Processing

Jan Larsen 23

Page 24: LSA – AlgorithmsLSA uses the vectors corresponding to the largest eigenvalues. Informatics and Mathematical Modelling / Intelligent Signal Processing Jan Larsen 6 SVD properties

Informatics and Mathematical Modelling / Intelligent Signal Processing

Jan Larsen 24

References[Hansen,Feng]: Hansen, L. K., Feng, L., Cogito Componentiter Ergo Sum, ICA2006 - 6th International Conference on Independent Component Analysisand Blind Source Separation, pp. 446 - 453, 2006 [Kolenda et al.] : Kolenda, T., Hansen, L. K., Sigurdsson, S., IndepedentComponents in Text, Advances in Independent Component Analysis, pp. 229-250, 2000 [Hoffman] : Hofmann, T. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual international ACM SIGIR Conference onResearch and Development in information Retrieval, SIGIR '99. [Blei,Ng,Jordan] : Blei, D., Ng, A., Jordan, M., Latent Dirichlet Allocation, Journal of Machine Learning Research 3 (2003) 993-1022Castsearch: Mølgaard, L. L., Jørgensen, K.W., Hansen, L.K., Castsearch -Context Based Spoken Document Retrieval ICASSP, 2007