information retrieval through various approximate matrix decompositions
DESCRIPTION
Information Retrieval through Various Approximate Matrix Decompositions. Kathryn Linehan Advisor: Dr. Dianne O’Leary. Querying a Document Database. We want to return documents that are relevant to entered search terms Given data: Term-Document Matrix, A - PowerPoint PPT PresentationTRANSCRIPT
1
Information Retrieval through Various Approximate Matrix
Decompositions
Kathryn Linehan
Advisor: Dr. Dianne O’Leary
2
Querying a Document Database
We want to return documents that are relevant to entered search terms
Given data: • Term-Document Matrix, A
• Entry ( i , j ): importance of term i in document j
• Query Vector, q• Entry ( i ): importance of term i in the query
3
Solutions
Literal Term Matching• Compute score vector: s = qTA
• Return the highest scoring documents
• May not return relevant documents that do not contain the exact query terms
Latent Semantic Indexing (LSI)• Same process as above, but use an approximation
to A
4
Term-Document Matrix Approximation
Standard approximation used in LSI: rank-k SVD
Project Goal: evaluate use of term-document matrix approximations other than rank-k SVD in LSI• Nonnegative Matrix Factorization (NMF)
• CUR Decomposition
5
Matrix Approximation Validation
Let be an approximation to A As the rank of increases, we expect the
relative error, , to go to zero Matrix approximation can be applied to any
matrix A• Preliminary test matrix A: 50 x 30 random sparse
matrix
• Future test matrices: three large sparse term-document matrices
A~
FFAAA
~
A~
6
Nonnegative Matrix Factorization (NMF)
Term-document matrix is nonnegative
HWA *
m x n m x k
k x n
• W and H are nonnegative
• kWHrank )(
7
NMF
Multiplicative update algorithm of Lee and Seung found in [1]• Find W, H to minimize
• Random initialization for W,H
• Convergence is not guaranteed, but in practice it is very common
• Slow due to matrix multiplications in iteration
2
2
1FWHA
8
NMF Validation
A: 50 x 30 random sparse matrix. Average over 5 runs.
5 10 15 20 25 300
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8NMF Validation: Relative Error
k
rela
tive
erro
r
5 10 15 20 25 300
2
4
6
8
10
12NMF Validation: Run Time
k
run
time
9
CUR Decomposition
Term-document matrix is sparse
**A C U R
m x n m x c
c x r r x n
• C (R) holds c (r) sampled and rescaled columns (rows) of A
• U is computed using C and R
• kCURrank )(where k is a rank parameter
,
10
CUR Implementations
CUR algorithm in [2] by Drineas, Kannan, and Mahoney• Linear time algorithm
• Modification: use ideas in [3] by Drineas, Mahoney, Muthukrishnan (no longer linear time)
• Improvement: Compact Matrix Decomposition (CMD) in [5] by Sun, Xie, Zhang, and Faloutsos
• Other Modifications: our ideas
Deterministic CUR code by G. W. Stewart
11
Sampling Column (Row) norm sampling [2]
• Prob(col j) = (similar for row i)
Subspace Sampling [3]• Uses rank-k SVD of A for column probabilities
• Prob(col j) =
• Uses “economy size” SVD of C for row probabilities
• Prob(row i) =
Sampling without replacement
22)(:, FAjA
kjV kA2
, :),(
ciUC2:),(
12
Sampling Comparison
A: 50 x 30 random sparse matrix. Average over 5 runs.Legend: Sampling,U,Scaling (Scaling only for without replacement sampling)
5 10 15 20 25 300
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04CUR Validation: Run Time
k
run
time
CN,L
S,Lw/o R,L,w/o Sc
w/o R,L,Sc
5 10 15 20 25 300
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1CUR Validation: Relative Error
k
rela
tive
erro
r
CN,L
S,Lw/o R,L,w/o Sc
w/o R,L,Sc
13
Computation of U
Linear algorithm U: approximately solves
, where [2]
Optimal U: solves
FU
UCA ˆminˆ
URU ˆ
2min F
U
CURA
14
U Comparison
A: 50 x 30 random sparse matrix. Average over 5 runs. Legend: Sampling,U
5 10 15 20 25 300
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08CUR Validation: Run Time
k
run
time
CN,L
CN,OS,L
S,O
5 10 15 20 25 30
0.4
0.5
0.6
0.7
0.8
0.9
1CUR Validation: Relative Error
k
rela
tive
erro
r
CN,L
CN,OS,L
S,O
15
Compact Matrix Decomposition (CMD) Improvement
Remove repeated columns (rows) in C (R) Decreases storage while still achieving the
same relative error [5]
Algorithm [2] [2] with CMD
Runtime 0.008060 0.007153
Storage 880.5 550.5
Relative Error 0.820035 0.820035
A: 50 x 30 random sparse matrix, k = 15. Average over 10 runs.
16
Deterministic CUR
Code by G. W. Stewart Uses a RRQR algorithm that does not
store Q• We only need the permutation vector
• Gives us the columns (rows) for C (R)
Uses optimal U
17
CUR Comparison
A: 50 x 30 random sparse matrix. Average over 5 runs.Legend: Sampling,U,Scaling (Scaling only for without replacement sampling)
5 10 15 20 25 300
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1CUR Validation: Relative Error
k
rela
tive
erro
r
5 10 15 20 25 300
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08CUR Validation: Run Time
k
run
time
CN,LCN,O
S,L
S,O
w/o R,L,w/o Sc
w/o R,L,ScD
18
Future Project Goals
Finish investigation of CUR improvement Validate NMF and CUR using term-document
matrices Investigate storage, computation time and
relative error of NMF and CUR Test performance of NMF and CUR in LSI
• Use average precision and recall, where the average is taken over all queries in the data set
19
Precision and Recall
Measurements of performance for document retrieval Let Retrieved = number of documents retrieved,
Relevant = total number of relevant documents to the query, RetRel = number of documents retrieved that are relevant.
Precision:
Recall:
Retrieved
RetRel)Retrieved( P
Relevant
RetRel)Retrieved( R
20
Further Topics
Time permitting investigations• Parallel implementations of matrix
approximations
• Testing performance of matrix approximations in forming a multidocument summary
21
ReferencesMichael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca, and Robert J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis, 52(1):155-173, September 2007.
Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition. SIAM Journal on Computing, 36(1):184-206, 2006.
Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844-881, 2008.
Tamara G. Kolda and Dianne P. O'Leary. A semidiscrete matrix decomposition for latent semantic indexing in information retrieval. ACM Transactions on Information Systems, 16(4):322-346, October 1998.
Jimeng Sun, Yinglian Xie, Hui Zhang, and Christos Faloutsos. Less is more: Sparse graph mining with compact matrix decomposition. Statistical Analysis and Data Mining, 1(1):6-22, February 2008.
[3]
[2]
[1]
[4]
[5]