efficient linear algebra methods in data miningsaad/pdf/m2a_07.pdfefficient linear algebra methods...

36
Efficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and Engineering M2A07 – Luminy, France 1

Upload: others

Post on 17-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

Efficient Linear Algebra Methods in DataMining

Yousef SaadUniversity of Minnesota

Dept. of Computer Science andEngineering

M2A07 – Luminy, France

1

Page 2: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

Introduction and Background:

ä Information sciences : Data Mining, Data Analysis, Machine Learn-

ing, Classification, .... are a huge source of interesting matrix prob-

lems

ä Effective linear algebra methods are just starting to be deployed

ä In this talk 3 sample problems:

1. Information retrieval

2. Face recognotion

3. Clustering

M2A–07, Oct-2007 2

2

Page 3: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

Information Retrieval: Vector Space Model

Given: 1) set of docu-

ments (columns of a matrix

A); 2) a query vector q.

Entry aij of A = frequency

of term i in document j +

weighting.Te

rms

Documents

ä Queries (‘pseudo-documents’) q represented similarly to columns

Problem: find columns of A that best match q

M2A–07, Oct-2007 3

3

Page 4: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

Vector Space Model and the Truncated SVD

ä Similarity metric: angle between column Aj,: and query q

Use Cosines:|qTA:,j|

‖A:,j‖2‖q‖2

ä To rank all documents compute the similarity vector:

s = ATq

ä ‘Litteral’ matching – not very effective. Problems : polysemy,

synonymy, ...

ä LSI: replace matrix A by low rank approximation

A = UΣV T → Ak = UkΣkVT

k → sk = ATk q

ä Uk : term space, Vk: document space.

ä Called TSVD – Expensive, hard to update, ..

M2A–07, Oct-2007 4

4

Page 5: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

IR: Use of approximation theory

ä Use of polynomial filters * Joint work with E. Kokiopoulou

Idea: Replace Ak by Aφ(ATA) where φ = a filter function

ä Consider the step-function:

φ(x) =

0, 0 ≤ x ≤ σ2k

1, σ2k ≤ x ≤ σ2

1σ σ

φ

k 1

2 2

(t)

ä This would yield the same result as with TSVD but...

ä ... Not easy to use this function directly

ä Solution : use a polynomial approximation to φ

ä Note: sT = qTAφ(ATA) , requires only Mat-Vec’s

M2A–07, Oct-2007 5

5

Page 6: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

How to get the polynomial filter?

Idea: First select an “ideal fiter”

ä e.g. a piecewise

polynomial function

ba

φ

ä For example φ = Hermite interpolating pol. in [0,a], and φ = 1 in

[a, b]

M2A–07, Oct-2007 6

6

Page 7: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

ä Then approximate this filter by an ‘optimal’ (least-squares) poly-

nomial

ba

φ

Main advantage: Extremely flexible.

Method: Build a sequence of polynomials φk which approximate

the ideal PP filter φ, in the L2 sense.

M2A–07, Oct-2007 7

7

Page 8: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

ä If {Pj} is a basis of polynomials that are orthogomal w.r.t. some

L2 inner-product, then

φk(t) =k∑

j=1

〈φ, Pj〉Pj(t),

ä Can use Stieljes procedure to compute orthogonal polynomials

[Erhel, Guyomarch, YS’99]

ä Or can use a Conjugate residual-type algorithm in polynomial

space [YS’05, Bekas-Kokiopoulou-YS’05]

ä Accuracy close to that of TSVD – But no SVD required

ä Experiments and details skipped.

M2A–07, Oct-2007 8

8

Page 9: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

IR: Use of the Lanczos algorithm

* Joint work with Jie Chen – in progress

ä Lanczos is good at catching large (and small) eigenvalues: can

compute singular vectors with Lanczos, & use them in LSI

ä Can do better: Use the Lanczos vectors directly for the projec-

tion..

ä First advocated by: K. Blom and A. Ruhe [SIMAX, vol. 26, 2005].

Use Lanczos bidiagonalization.

ä Use a similar approach – But directly with AAT or ATA.

M2A–07, Oct-2007 9

9

Page 10: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

IR: Use of the Lanczos algorithm (1)

ä Let A ∈ Rm×n. Apply the Lanczos procedure to M = AAT .

Result:

QTk AATQk = Tk

with Qk orthogonal, Tk tridiagonal.

ä Define si ≡ orth. projection of Ab on subspace span{Qi}

si := QiQTi Ab.

ä si can be easily updated from si−1:

si = si−1 + qiqTi Ab.

M2A–07, Oct-2007 10

10

Page 11: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

IR: Use of the Lanczos algorithm (2)

ä If n < m it may be more economial to apply Lanczos to M =

ATA which is n × n. Result:

QTk ATAQk = Tk

ä Define:

ti := AQiQTi b,

ä Project b first before applying A to result.

M2A–07, Oct-2007 11

11

Page 12: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

Why does this work?

ä First, recall a result on Lanczos algorithm [YS 83]

Let {λj, uj} = j-th eigen-pair of M (label ↓)

‖(I − QkQTk )uj‖

‖QkQTk uj‖

≤Kj

Tk−j(γj)

‖(I − Q1QT1 )uj‖

‖Q1QT1 uj‖

,

where

γj = 1 + 2λj − λj+1

λj+1 − λn

, Kj =

1 j = 1∏j−1i=1

λi−λn

λi−λjj 6= 1

,

and Tl(x) = Chebyshev polynomial of 1st kind of degree l.

This has the form

‖(I − QkQTk )uj‖ ≤ cj/Tk−j(γj),

where cj = constant independent of k

M2A–07, Oct-2007 12

12

Page 13: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

ä Result: Distance between unit eigenvector uj and Krylov sub-

space span(Qk) decays fast (for small j)

ä Consider component of difference between Ab − sk along left

singular directions of A. If A = UΣV T , then uj’s (columns of U )

are eigenvectors of M = AAT . So:

|〈Ab − sk, uj〉| =∣∣⟨(I − QkQ

Tk )Ab, uj

⟩∣∣=

∣∣⟨(I − QkQTk )uj, Ab

⟩∣∣≤ ‖(I − QkQ

Tk )uj‖‖Ab‖

≤ cj‖Ab‖T −1k−j(γj)

ä {si} converges rapidly to Ab in directions of the major left sin-

gular vectors of A.

M2A–07, Oct-2007 13

13

Page 14: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

ä Similar result for left projection sequence tj

ä Here is a typical distribution of eigenvalues of M : [Matrix of size

1398 × 1398]

x0

y

λ1λ2

λ3

λ4

λ5

· · ·

ä Convergence toward first few singular vectors very fast –

M2A–07, Oct-2007 14

14

Page 15: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

Advantages of Lanczos over polynomial filters:

(1) No need for eigenvalue estimates

(2) Mat-vecs performed only in preprocessing

Disadvantages:

(1) Need to store Lanczos vectors;

(2) Preprocessing must be redone when A changes.

(3) Need for reorthogonalization – expensive for large k.

M2A–07, Oct-2007 15

15

Page 16: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

Tests: IR

Information

retrieval

datasets

# Terms # Docs # queries sparsity

MED 7,014 1,033 30 0.735

CRAN 3,763 1,398 225 1.412

Pre

proc

essi

ngtim

es

Med dataset.

0 50 100 150 200 250 3000

10

20

30

40

50

60

70

80

90Med

iterations

prep

roce

ssin

g tim

e

lanczostsvd

Cran dataset.

0 50 100 150 200 250 3000

10

20

30

40

50

60Cran

iterations

prep

roce

ssin

g tim

e

lanczostsvd

M2A–07, Oct-2007 16

16

Page 17: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

Average query times

Med dataset

0 50 100 150 200 250 3000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5x 10

−3 Med

iterations

aver

age

quer

y tim

e

lanczostsvdlanczos−rftsvd−rf

Cran dataset.

0 50 100 150 200 250 3000

0.5

1

1.5

2

2.5

3

3.5

4x 10

−3 Cran

iterationsav

erag

e qu

ery

time

lanczostsvdlanczos−rftsvd−rf

M2A–07, Oct-2007 17

17

Page 18: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

Average retrieval precision

Med dataset

0 50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9Med

iterations

aver

age

prec

isio

n

lanczostsvdlanczos−rftsvd−rf

Cran dataset

0 100 200 300 400 500 6000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9Cran

iterationsav

erag

e pr

ecis

ion

lanczostsvdlanczos−rftsvd−rf

Retrieval precision comparisons

M2A–07, Oct-2007 18

18

Page 19: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

Problem 2: Face Recognition – background

Problem: We are given a database of images: [arrays of pixel val-

ues]. And a test (new) image.

ÿ ÿ ÿ ÿ ÿ ÿ

↖ ↑ ↗

Question: Does this new image correspond to one of those in the

database?

M2A–07, Oct-2007 19

19

Page 20: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

Difficulty

ä Different positions, expressions, lighting, ..., situations :

Common approach: eigenfaces – Principal Component Analysis tech-

nique

M2A–07, Oct-2007 20

20

Page 21: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

Example: Occlusion.

See recent paper by

John Wright et al.

Top test image:

deliberate disguise.

Bottom: 50% pixels

randomly changed

Source: http://perception.csl.uiuc.edu/ ...

... recognition/Robust face.html

ä See also: Recent real-life example – international man-hunt

M2A–07, Oct-2007 21

21

Page 22: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

Eigenfaces

– Consider each picture as a one-dimensional colum of all pixels

– Put together into an array A of size # pixels × # images.

. . . =⇒ . . .

︸ ︷︷ ︸A

– Do an SVD of A and perform comparison with any test image in

low-dim. space

– Similar to LSI in spirit – but data is not sparse.

Idea: replace SVD by Lanczos vectors (same as for IR)

M2A–07, Oct-2007 22

22

Page 23: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

Tests: Face Recognition

Tests with 2 well-known data sets:

ORL 40 subjects, 10 sample images each – example:

# of pixels : 112 × 92 TOT. # images : 400

AR set 126 subjects – 4 facial expressions selected for each [natu-

ral, smiling, angry, screaming] – example:

# of pixels : 112 × 92 # TOT. # images : 504

M2A–07, Oct-2007 23

23

Page 24: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

Tests: Face Recognition

Recognition accuracy of Lanczos approximation vs SVD

ORL dataset

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1ORL

iterations

aver

age

erro

r ra

te

lanczostsvd

AR dataset

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1AR

iterations

aver

age

erro

r ra

te

lanczostsvd

Vertical axis shows average error rate. Horizontal = Subspace di-

mension

M2A–07, Oct-2007 24

24

Page 25: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

Problem 3: Clustering

* Joint work with Haw-Ren Fang – in progress

Problem: A set X of n objects in some space. Find subsets of

X that each contain objects that are most ’alike’

ä ‘Bread-and-butter problem’ – arises in *many* applications

ä Variation of the problem: Graph partitioning [need closeness +

few edge cuts]

ä Supervised clustering: Subsets are known – problem is to opti-

mally ‘classify’ a new item into one of the subsets

Questions: ‘alike’ in what sense? How many subsets?

M2A–07, Oct-2007 25

25

Page 26: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

Clustering: using farthest centroids

ä Given X = [x1 x2 · · · xn] ∈ Rm×n

ä Centroid of a set Y = [y1, · · · , yp] is

cY = 1p

∑pj=1 yj = 1

pY e e = [1, 1, · · · , 1]T

ä Clustering into 2 even sets. Idea: find partition vector c:

Maximize ‖Xc‖2

subject to

ci = ±1, i = 1, · · · , n

cTe = 0

ä Subset X+ = set with ci = 1, Subset X− = set with ci = −1

ä cTe = 0 is a balance constraint between the 2 sets

M2A–07, Oct-2007 26

26

Page 27: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

ä Hard problem to solve [integer programming – NP-hard]

ä But: can be solved approximately [∼ graph partitioning]

ä Can also relax constraints.

Ê ’center’ X, i.e., use X = X − 1nXeT for X

Ë Replace ci = ±1 by cTc = n

Maximize ‖Xc‖2

subject to

‖c‖2 = 1,

cTe = 0

Solution = dominant singular vector.

ä Exploited by Boley ’97 in PDDP – [See also Juhasz ’81]

ä Similar idea exploited in graph partitioning

M2A–07, Oct-2007 27

27

Page 28: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

Even-sets clustering by exchange

ä Go back to constraint ci = ±1 – i.e., use actual centroids

ä Need to improve a given partition

ä Similar to Kernigan and Lin in graph partitioning

ä Let Y = [y1, · · · , yn/2]. Z = [z1, · · · , zn/2]

ä Scaled squared distance between the centroids is

d = ‖Y e − Ze‖22 = (Y e − Ze)T (Y e − Ze)

ä What happens if we swap y∗ ∈ Y and z∗ ∈ Z ?

M2A–07, Oct-2007 28

28

Page 29: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

ä Call δ = y∗ − z∗

ä New distance:

dnew = ‖(Y e − y∗ + z∗) − (Ze − z∗ + y∗)‖22

= ‖(Y e − δ) − (Ze + δ)‖22

= ‖(Y e − Ze) − 2δ‖22

= d + 4‖δ‖22 − 4((Y e − Ze), δ)

ä Distance gains if :

−(Y e − Ze)Tδ + ‖δ‖22 > 0

M2A–07, Oct-2007 29

29

Page 30: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

Idea:

ä Begin with the Lanczos algorithm for XTX to get ~s.v.v1

ä Get a marginal set among components of v1 for refining

ä Repeat: exchange marginal points (only) – until no further gains

are made

M2A–07, Oct-2007 30

30

Page 31: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

Clustering: example

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

−4 −3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

4

Initialization of two sets of n = 1, 000 random points on two-dimensional

plane. Green points are margin set (100). Left: uniform distribution;

right: normal distribution.

M2A–07, Oct-2007 31

31

Page 32: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

Clustering : K-means + improvement

ALGORITHM : 1 K-means clustering algorithm

Given: K initial centroids p1, · · · , pK

Do:

Set Sj := ∅ for j = 1, . . . , K.

For i = 1, 2 . . . , n

Find k = argminj‖xi − pj‖

Set Sk := Sk ∪ {xi}.

EndFor

For j = 1, 2, . . . , K

Set pj == mean of points in Sj.

EndFor

While { p1, . . . , pK } have not converged.M2A–07, Oct-2007 32

32

Page 33: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

In words: Find closest centroid pk to each xi. Add this xi to Sk. Get

new centroids. Repeat.

ä Excellent algorithm – but very slow. Depends on initial set.

ä Common practice: start with something else – [cheaper]

Ideas:

Ê Start with PDDP [Lanczos] then refine with K-means

Ë Start with FCDP [Lanczos] then refine with K-means

M2A–07, Oct-2007 33

33

Page 34: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

Clustering : test with ORL –get 40 clusters

2 3 4 5 6 7 8 9 10

0.4

0.5

0.6

0.7

0.8

0.9

140−clusters for face image clustering

# samples per subject

tota

l ent

ropy

PDDPFCDPPDDP+KmeansFCDP+KmeansKmeans

M2A–07, Oct-2007 34

34

Page 35: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

ä Result of clustering displayed on a 2-D plane:

Left: clustering by PCA. Right: clustering by FCDC.

M2A–07, Oct-2007 35

35

Page 36: Efficient Linear Algebra Methods in Data Miningsaad/PDF/M2A_07.pdfEfficient Linear Algebra Methods in Data Mining Yousef Saad University of Minnesota Dept. of Computer Science and

Conclusion

ä Many interesting linear algebra problems in data mining.

ä Current methods mix 1) statistics, 2) Linear algebra 3) Differential

geometry (manifold learning) 4) (Basic) graph theory

ä Have shown some simple techniques put to work..

ä Work on clustering still challenging..

ä Modern dimension reduction techniques (LLE, Eigenmaps, Isomap,

...) exploit nearest neighbor graph. Resulting methods quite power-

ful

M2A–07, Oct-2007 36

36