latent semantic indexing - illinoissifaka.cs.uiuc.edu/~wang296/course/ir_fall/docs/pdfs/latent...
TRANSCRIPT
![Page 1: Latent Semantic Indexing - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/Latent Semanti… · Latent Semantic Analysis (LSA) •What we have achieved via LSA –Terms/documents](https://reader034.vdocuments.mx/reader034/viewer/2022050807/603dc86fe5596a6615471182/html5/thumbnails/1.jpg)
Latent Semantic Analysis
Hongning Wang
CS@UVa
![Page 2: Latent Semantic Indexing - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/Latent Semanti… · Latent Semantic Analysis (LSA) •What we have achieved via LSA –Terms/documents](https://reader034.vdocuments.mx/reader034/viewer/2022050807/603dc86fe5596a6615471182/html5/thumbnails/2.jpg)
VS model in practice
• Document and query are represented by termvectors
– Terms are not necessarily orthogonal to each other
• Synonymy: car v.s. automobile
• Polysemy: fly (action v.s. insect)
CS@UVa CS6501: Information Retrieval 2
![Page 3: Latent Semantic Indexing - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/Latent Semanti… · Latent Semantic Analysis (LSA) •What we have achieved via LSA –Terms/documents](https://reader034.vdocuments.mx/reader034/viewer/2022050807/603dc86fe5596a6615471182/html5/thumbnails/3.jpg)
Choosing basis for VS model
• A concept space is preferred
– Semantic gap will be bridged
Sports
Education
Finance
D4
D2
D1D5
D3
Query
CS@UVa CS6501: Information Retrieval 3
![Page 4: Latent Semantic Indexing - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/Latent Semanti… · Latent Semantic Analysis (LSA) •What we have achieved via LSA –Terms/documents](https://reader034.vdocuments.mx/reader034/viewer/2022050807/603dc86fe5596a6615471182/html5/thumbnails/4.jpg)
How to build such a space
• Automatic term expansion
– Construction of thesaurus
• WordNet
– Clustering of words
• Word sense disambiguation
– Dictionary-based
• Relation between a pair of words should be similar as in text and dictionary’s descrption
– Explore word usage context
CS@UVa CS6501: Information Retrieval 4
![Page 5: Latent Semantic Indexing - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/Latent Semanti… · Latent Semantic Analysis (LSA) •What we have achieved via LSA –Terms/documents](https://reader034.vdocuments.mx/reader034/viewer/2022050807/603dc86fe5596a6615471182/html5/thumbnails/5.jpg)
How to build such a space
• Latent Semantic Analysis
– Assumption: there is some underlying latent semantic structure in the data that is partially obscured by the randomness of word choice with respect to retrieval
– It means: the observed term-document association data is contaminated by random noise
CS@UVa CS6501: Information Retrieval 5
![Page 6: Latent Semantic Indexing - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/Latent Semanti… · Latent Semantic Analysis (LSA) •What we have achieved via LSA –Terms/documents](https://reader034.vdocuments.mx/reader034/viewer/2022050807/603dc86fe5596a6615471182/html5/thumbnails/6.jpg)
How to build such a space
• Solution
– Low rank matrix approximation
Imagine this is our observed term-document matrix
Imagine this is *true* concept-document matrix
Random noise over the word selection in each document
CS@UVa CS6501: Information Retrieval 6
![Page 7: Latent Semantic Indexing - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/Latent Semanti… · Latent Semantic Analysis (LSA) •What we have achieved via LSA –Terms/documents](https://reader034.vdocuments.mx/reader034/viewer/2022050807/603dc86fe5596a6615471182/html5/thumbnails/7.jpg)
Latent Semantic Analysis (LSA)
• Low rank approximation of term-document matrix 𝐶𝑀×𝑁
– Goal: remove noise in the observed term-document association data
– Solution: find a matrix with rank 𝑘 which is closest to the original matrix in terms of Frobenius norm
𝑍 = argmin𝑍|𝑟𝑎𝑛𝑘 𝑍 =𝑘
𝐶 − 𝑍 𝐹
= argmin𝑍|𝑟𝑎𝑛𝑘 𝑍 =𝑘
𝑖=1𝑀 𝑗=1
𝑁 𝐶𝑖𝑗 − 𝑍𝑖𝑗2
CS@UVa CS6501: Information Retrieval 7
![Page 8: Latent Semantic Indexing - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/Latent Semanti… · Latent Semantic Analysis (LSA) •What we have achieved via LSA –Terms/documents](https://reader034.vdocuments.mx/reader034/viewer/2022050807/603dc86fe5596a6615471182/html5/thumbnails/8.jpg)
Basic concepts in linear algebra
• Symmetric matrix
– 𝐶 = 𝐶𝑇
• Rank of a matrix
– Number of linearly independent rows (columns) in a matrix 𝐶𝑀×𝑁
– 𝑟𝑎𝑛𝑘 𝐶𝑀×𝑁 ≤ min(𝑀, 𝑁)
CS@UVa CS6501: Information Retrieval 8
![Page 9: Latent Semantic Indexing - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/Latent Semanti… · Latent Semantic Analysis (LSA) •What we have achieved via LSA –Terms/documents](https://reader034.vdocuments.mx/reader034/viewer/2022050807/603dc86fe5596a6615471182/html5/thumbnails/9.jpg)
Basic concepts in linear algebra
• Eigen system
– For a square matrix 𝐶𝑀×𝑀
– If 𝐶𝑥 = 𝜆𝑥, 𝑥 is called the right eigenvector of 𝐶and 𝜆 is the corresponding eigenvalue
• For a symmetric full-rank matrix 𝐶𝑀×𝑀
– We have its eigen-decomposition as
• 𝐶 = 𝑄Λ𝑄𝑇
• where the columns of 𝑄 are the orthogonal and normalized eigenvectors of 𝐶 and Λ is a diagonal matrix whose entries are the eigenvalues of 𝐶
CS@UVa CS6501: Information Retrieval 9
![Page 10: Latent Semantic Indexing - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/Latent Semanti… · Latent Semantic Analysis (LSA) •What we have achieved via LSA –Terms/documents](https://reader034.vdocuments.mx/reader034/viewer/2022050807/603dc86fe5596a6615471182/html5/thumbnails/10.jpg)
Basic concepts in linear algebra
• Singular value decomposition (SVD)
– For matrix 𝐶𝑀×𝑁 with rank 𝑟, we have
• 𝐶 = 𝑈Σ𝑉𝑇
• where 𝑈𝑀×𝑟 and 𝑉𝑁×𝑟 are orthogonal matrices, and Σ
is a 𝑟 × 𝑟 diagonal matrix, with Σ𝑖𝑖 = 𝜆𝑖 and 𝜆1 … 𝜆𝑟
are the eigenvalues of 𝐶𝐶𝑇
– We define 𝐶𝑀×𝑁𝑘 = 𝑈𝑀×𝑘Σ𝑘×𝑘𝑉𝑁×𝑘
𝑇
• where we place Σ𝑖𝑖 in a descending order and set Σ𝑖𝑖 =
𝜆𝑖 for 𝑖 ≤ 𝑘, and Σ𝑖𝑖 = 0 for 𝑖 > 𝑘
CS@UVa CS6501: Information Retrieval 10
![Page 11: Latent Semantic Indexing - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/Latent Semanti… · Latent Semantic Analysis (LSA) •What we have achieved via LSA –Terms/documents](https://reader034.vdocuments.mx/reader034/viewer/2022050807/603dc86fe5596a6615471182/html5/thumbnails/11.jpg)
Latent Semantic Analysis (LSA)
• Solve LSA by SVD
– Procedure of LSA
1. Perform SVD on document-term adjacency matrix
2. Construct 𝐶𝑀×𝑁𝑘 by only keeping the largest 𝑘 singular
values in Σ non-zero
𝑍 = argmin𝑍|𝑟𝑎𝑛𝑘 𝑍 =𝑘
𝐶 − 𝑍 𝐹
= argmin𝑍|𝑟𝑎𝑛𝑘 𝑍 =𝑘
𝑖=1𝑀 𝑗=1
𝑁 𝐶𝑖𝑗 − 𝑍𝑖𝑗2
= 𝐶𝑀×𝑁𝑘
Map to a lower dimensional space
CS@UVa CS6501: Information Retrieval 11
![Page 12: Latent Semantic Indexing - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/Latent Semanti… · Latent Semantic Analysis (LSA) •What we have achieved via LSA –Terms/documents](https://reader034.vdocuments.mx/reader034/viewer/2022050807/603dc86fe5596a6615471182/html5/thumbnails/12.jpg)
Latent Semantic Analysis (LSA)
• Another interpretation– 𝐶𝑀×𝑁 is the term-document adjacency matrix
– 𝐷𝑀×𝑀 = 𝐶𝑀×𝑁 × 𝐶𝑀×𝑁𝑇
• 𝐷𝑖𝑗: document-document similarity by counting how many terms co-occur in 𝑑𝑖 and 𝑑𝑗
• 𝐷 = 𝑈Σ𝑉𝑇 × 𝑈Σ𝑉𝑇 𝑇 = 𝑈Σ2𝑈𝑇
– Eigen-decomposition of document-document similarity matrix
– 𝑑𝑖′s new representation is then 𝑈Σ
1
2 𝑖 in this system(space)
– In the lower dimensional space, we will only use the first 𝑘
elements in 𝑈Σ1
2 𝑖 to represent 𝑑𝑖
– The same analysis applies to 𝑇𝑁×𝑁 = 𝐶𝑀×𝑁𝑇 × 𝐶𝑀×𝑁
CS@UVa CS6501: Information Retrieval 12
![Page 13: Latent Semantic Indexing - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/Latent Semanti… · Latent Semantic Analysis (LSA) •What we have achieved via LSA –Terms/documents](https://reader034.vdocuments.mx/reader034/viewer/2022050807/603dc86fe5596a6615471182/html5/thumbnails/13.jpg)
Geometric interpretation of LSA
• 𝐶𝑀×𝑁𝑘 (i, j) measures the relatedness between
𝑑𝑖 and 𝑤𝑗in the 𝑘-dimensional space
• Therefore
– As 𝐶𝑀×𝑁𝑘 = 𝑈𝑀×𝑘Σ𝑘×𝑘𝑉𝑁×𝑘
𝑇
– 𝑑𝑖 is represented as 𝑈𝑀×𝑘Σ𝑘×𝑘
1
2
𝑖
– 𝑤𝑗 is represented as 𝑉𝑁×𝑘Σ𝑘×𝑘
1
2
𝑗
CS@UVa CS6501: Information Retrieval 13
![Page 14: Latent Semantic Indexing - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/Latent Semanti… · Latent Semantic Analysis (LSA) •What we have achieved via LSA –Terms/documents](https://reader034.vdocuments.mx/reader034/viewer/2022050807/603dc86fe5596a6615471182/html5/thumbnails/14.jpg)
Latent Semantic Analysis (LSA)
• VisualizationHCI
Graph theory
CS@UVa CS6501: Information Retrieval 14
![Page 15: Latent Semantic Indexing - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/Latent Semanti… · Latent Semantic Analysis (LSA) •What we have achieved via LSA –Terms/documents](https://reader034.vdocuments.mx/reader034/viewer/2022050807/603dc86fe5596a6615471182/html5/thumbnails/15.jpg)
What are those dimensions in LSA
• Principle component analysis
CS@UVa CS6501: Information Retrieval 15
![Page 16: Latent Semantic Indexing - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/Latent Semanti… · Latent Semantic Analysis (LSA) •What we have achieved via LSA –Terms/documents](https://reader034.vdocuments.mx/reader034/viewer/2022050807/603dc86fe5596a6615471182/html5/thumbnails/16.jpg)
Latent Semantic Analysis (LSA)
• What we have achieved via LSA
– Terms/documents that are closely associated are placed near one another in this new space
– Terms that do not occur in a document may still close to it, if that is consistent with the major patterns of association in the data
– A good choice of concept space for VS model!
CS@UVa CS6501: Information Retrieval 16
![Page 17: Latent Semantic Indexing - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/Latent Semanti… · Latent Semantic Analysis (LSA) •What we have achieved via LSA –Terms/documents](https://reader034.vdocuments.mx/reader034/viewer/2022050807/603dc86fe5596a6615471182/html5/thumbnails/17.jpg)
LSA for retrieval
• Project queries into the new document space
– 𝑞 = 𝑞𝑉𝑁×𝑘Σ𝑘×𝑘−1
• Treat query as a pseudo document of term vector
• Cosine similarity between query and documents in this lower-dimensional space
CS@UVa CS6501: Information Retrieval 17
![Page 18: Latent Semantic Indexing - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/Latent Semanti… · Latent Semantic Analysis (LSA) •What we have achieved via LSA –Terms/documents](https://reader034.vdocuments.mx/reader034/viewer/2022050807/603dc86fe5596a6615471182/html5/thumbnails/18.jpg)
LSA for retrieval
q: “human computer interaction”
HCI
Graph theory
CS@UVa CS6501: Information Retrieval 18
![Page 19: Latent Semantic Indexing - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/Latent Semanti… · Latent Semantic Analysis (LSA) •What we have achieved via LSA –Terms/documents](https://reader034.vdocuments.mx/reader034/viewer/2022050807/603dc86fe5596a6615471182/html5/thumbnails/19.jpg)
Discussions
• Computationally expensive
– Time complexity 𝑂(𝑀𝑁2)
• Empirically helpful for recall but not for precision
– Recall increases as 𝑘 decreases
• Optimal choice of 𝑘
• Difficult to handle dynamic corpus
• Difficult to interpret the decomposition results
We will come back to this later!CS@UVa CS6501: Information Retrieval 19
![Page 20: Latent Semantic Indexing - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/Latent Semanti… · Latent Semantic Analysis (LSA) •What we have achieved via LSA –Terms/documents](https://reader034.vdocuments.mx/reader034/viewer/2022050807/603dc86fe5596a6615471182/html5/thumbnails/20.jpg)
LSA beyond text
• Collaborative filtering
CS@UVa CS6501: Information Retrieval 20
![Page 21: Latent Semantic Indexing - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/Latent Semanti… · Latent Semantic Analysis (LSA) •What we have achieved via LSA –Terms/documents](https://reader034.vdocuments.mx/reader034/viewer/2022050807/603dc86fe5596a6615471182/html5/thumbnails/21.jpg)
LSA beyond text
• Eigen face
CS@UVa CS6501: Information Retrieval 21
![Page 22: Latent Semantic Indexing - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/Latent Semanti… · Latent Semantic Analysis (LSA) •What we have achieved via LSA –Terms/documents](https://reader034.vdocuments.mx/reader034/viewer/2022050807/603dc86fe5596a6615471182/html5/thumbnails/22.jpg)
LSA beyond text
• Cat from deep neuron network
One of the neurons in the artificial neural network,
trained from still frames from unlabeled YouTube
videos, learned to detect cats.
CS@UVa CS6501: Information Retrieval 22
![Page 23: Latent Semantic Indexing - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/Latent Semanti… · Latent Semantic Analysis (LSA) •What we have achieved via LSA –Terms/documents](https://reader034.vdocuments.mx/reader034/viewer/2022050807/603dc86fe5596a6615471182/html5/thumbnails/23.jpg)
What you should know
• Assumption in LSA
• Interpretation of LSA
– Low rank matrix approximation
– Eigen-decomposition of co-occurrence matrix for documents and terms
• LSA for IR
CS@UVa CS6501: Information Retrieval 23