dimensionality reduction

Dimensionality Reduction

Slides by Jure Leskovec 2


• High-dimensional == many features• Find concepts/topics/genres:

– Documents:• Features: Thousands of words, millions of word pairs

– Surveys – Netflix: 480k users x 177k movies



• Compress / reduce dimensionality:– 106 rows; 103 columns; no updates– random access to any cell(s); small error: OK



• Assumption: Data lies on or near a low d-dimensional subspace

• Axes of this subspace are effective representation of the data


Why Reduce Dimensions?

Why reduce dimensions?• Discover hidden correlations/topics

– Words that occur commonly together• Remove redundant and noisy features

– Not all words are useful• Interpretation and visualization• Easier storage and processing of the data


SVD - Definition

A[m x n] = U[m x r] [ r x r] (V[n x r])T

• A: Input data matrix– m x n matrix (e.g., m documents, n terms)

• U: Left singular vectors – m x r matrix (m documents, r concepts)

• : Singular values– r x r diagonal matrix (strength of each ‘concept’)

(r : rank of the matrix A)• V: Right singular vectors

– n x r matrix (n terms, r concepts)


SVD

Am

n

m

n

U

VT

T


SVD

Am

n

+

1u1v1 2u2v2

σi … scalarui … vectorvi … vector

T


SVD - Properties

It is always possible to decompose a real matrix A into A = U VT , where

• U, , V: unique• U, V: column orthonormal:

– UT U = I; VT V = I (I: identity matrix)– (Cols. are orthogonal unit vectors)

• : diagonal– Entries (singular values) are positive,

and sorted in decreasing order (σ1 σ2 ... 0)


SVD – Example: Users-to-Movies

• A = U VT - example:

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 3 0 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=

SciFi

Romnce

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie


SVD – Example: Users-to-Movies


1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

SciFi-conceptRomance-concept

SciFi

Romnce

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie


SVD - Example


1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

SciFi-conceptRomance-concept

U is “user-to-concept” similarity matrix

SciFi

Romnce

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie


SVD - Example


1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

‘strength’ of SciFi-concept

SciFi

Romnce

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie


SVD - Example


1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

V is “movie-to-concept”similarity matrix

SciFi-concept

SciFi

Romnce

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie


SVD - Example


1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

SciFi-concept

SciFi

Romnce

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie

V is “movie-to-concept”similarity matrix


SVD - Interpretation #1

‘movies’, ‘users’ and ‘concepts’:• U: user-to-concept similarity matrix

• V: movie-to-concept sim. matrix

• : its diagonal elements: ‘strength’ of each concept


SVD - interpretation #2

SVD gives best axis to project on:

• ‘best’ = min sum of squares of projection errors

• minimum reconstruction error

v1

first right singular vector

Movie 1 rating

Mo

vie

2 ra

tin

g




1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

v1

=

v1

first right singular vector

Movie 1 rating

Mo

vie

2 ra

tin

g




1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

variance (‘spread’) on the v1 axis

=



More details• Q: How exactly is dim. reduction done?

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x=



More details• Q: How exactly is dim. reduction done?• A: Set the smallest singular values to zero

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

A=




1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

9.64 0

0 0x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

A=

~



More details• Q: How exactly is dim. reduction done?• A: Set the smallest singular values to zero:

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

9.64 0

0 0x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

A=

~



More details• Q: How exactly is dim. reduction done?• A: Set the smallest singular values to zero:

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18

0.36

0.18

0.90

0

00

9.64x

0.58 0.58 0.58 0 0

x

A=

~




1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

~

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 0 0

0 0 0 0 00 0 0 0 0

A=

B=

Frobenius norm:

ǁMǁF = Σij Mij2

ǁA-BǁF = Σij (Aij-Bij)2

is “small”


A U

Sigma

VT=

B U

Sigma

VT

=

B is approx A

27

SVD – Best Low Rank Approx.• Theorem: Let A = U VT (σ1σ2…, rank(A)=r)

then B = U S VT – S = diagonal nxn matrix where si=σi (i=1…k) else si=0

is a best rank-k approximation to A:– B is solution to minB ǁA-BǁF where rank(B)=k

• We will need 2 facts:– where M = P Q R is SVD of M– U VT - U S VT = U ( - S) VT

Slides by Jure Leskovec

Σ

𝜎 11

𝜎 𝑟𝑟


SVD – Best Low Rank Approx.• We will need 2 facts:

– where M = P Q R is SVD of M

– U VT - U S VT = U ( - S) VT

We apply:-- P column orthonormal-- R row orthonormal-- Q is diagonal


SVD – Best Low Rank Approx.• A = U VT , B = U S VT (σ1σ2… 0, rank(A)=r)

– S = diagonal nxn matrix where si=σi (i=1…k) else si=0

then B is solution to minB ǁA-BǁF , rank(B)=kWhy?

• We want to choose si to minimize we set si=σi (i=1…k) else si=0

r

kii

r

kii

k

iiis s

i1

2

1

2

1

2)(min

r

iiisFFkBrankBsSBA

i1

2

)(,)(minminmin

U VT - U S VT = U ( - S) VT



Equivalent:‘spectral decomposition’ of the matrix:

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

= x xu1 u2

σ1

σ2

v1

v2


SVD - Interpretation #2Equivalent:‘spectral decomposition’ of the matrix

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

= u1σ1 vT1 u2σ2 vT

2+ +...

n

m

n x 1 1 x m

k terms

Assume: σ1 σ2 σ3 ... 0

Why is setting small σs the thing to do?Vectors ui and vi are unit length, so σi scales them.So, zeroing small σs introduces less error.



Q: How many σs to keep?A: Rule-of-a thumb:

keep 80-90% of ‘energy’ (=σi2)

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

= u1σ1 vT1 u2σ2 vT

2+ +...n

m

assume: σ1 σ2 σ3 ...


SVD - Complexity

• To compute SVD:– O(nm2) or O(n2m) (whichever is less)

• But:– Less work, if we just want singular values– or if we want first k singular vectors– or if the matrix is sparse

• Implemented in linear algebra packages like– LINPACK, Matlab, SPlus, Mathematica ...


SVD - Conclusions so far

• SVD: A= U VT: unique– U: user-to-concept similarities– V: movie-to-concept similarities– : strength of each concept

• Dimensionality reduction: – keep the few largest singular values

(80-90% of ‘energy’)– SVD: picks up linear correlations


Case study: How to query?

Q: Find users that like ‘Matrix’ and ‘Alien’

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 3 0 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=

SciFi

Romnce

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie



Q: Find users that like ‘Matrix’A: Map query into a ‘concept space’ – how?

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 3 0 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=

SciFi

Romnce

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie



Q: Find users that like ‘Matrix’A: map query vectors into ‘concept space’ – how?

5 0 0 0 0

q=

MatrixAl

ien

v1

q

v2

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie

Project into concept space:Inner product with each ‘concept’ vector vi



Q: Find users that like ‘Matrix’A: map the vector into ‘concept space’ – how?

v1

q

q*v1

5 0 0 0 0

q=

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie

v2

MatrixAl

ien

Project into concept space:Inner product with each ‘concept’ vector vi



Compactly, we have:qconcept = q V

E.g.: 0.58 0

0.58 0

0.58 0

0 0.71

0 0.71

movie-to-concept similarities

= 2.9 0

SciFi-concept

5 0 0 0 0

q=

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie


Case study: How to query?How would the user d that rated

(‘Alien’, ‘Serenity’) be handled?dconcept = d V

E.g.:0.58 0

0.58 0

0.58 0

0 0.71

0 0.71

movie-to-concept similarities

= 5.22 0

SciFi-concept

0 4 5 0 0

d=

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie



Observation: User d that rated (‘Alien’, ‘Serenity’) will be similar to query “user” q that rated (‘Matrix’), although d did not rate ‘Matrix’!

0 4 5 0 0

d= 2.9 0

SciFi-concept

5 0 0 0 0

5.22 0

q=

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie

Similarity = 0 Similarity ≠ 0


SVD: Drawbacks

+ Optimal low-rank approximation:• in Frobenius norm

- Interpretability problem:– A singular vector specifies a linear

combination of all input columns or rows- Lack of sparsity:

– Singular vectors are dense!

=

U

VT