(review) linear algebra statistical inference python & r · review some basic concepts linear...

37
(Review) Linear Algebra Statistical Inference Python & R CS57300 - Data Mining Spring 2016 Instructor: Bruno Ribeiro © 2016 Bruno Ribeiro

Upload: others

Post on 21-Mar-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

(Review)Linear Algebra

Statistical InferencePython & R

CS57300 - Data MiningSpring 2016

Instructor: Bruno Ribeiro

© 2016 Bruno Ribeiro

Page 2: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

} Review some basic concepts◦ Linear Algebra◦ Statistical Inference

} Introduce useful tools in Python and R

} Introduce the Scholar cluster

Goals today

2

Page 3: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

} A woman is leading an environmental protest outside PMU today

} What is more likely?a) That she is an investment banker?b) That she is an investment banker studying Environmental

Engineering at Purdue?

3

But before…

P [A] =X

b2B

P [A, b]

Page 4: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

4

Linear Algebra Review

Page 5: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

5

Why Linear Algebra?

} Why Algebra? ◦ Computing is all about algebra◦ Way to mathematically describe data

} Why Linear? ◦ Fast tools◦ Easy to understand◦ Many non-linear problems can be transformed or

approximated as linear systems

} Combination works well in practice

Page 6: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

6

Example of Linear Algebra Application: Explain Relationships

} Marvel character appears on comic book} Described as an adjacency matrix (representing a graph)

Matrix Transpose & Symmetry

3. (AB)T = BTAT

3.1 symmetric matrix: A= AT

3.2 (BT)T = B3.3 (BBT)T =?

Cocitation networks

! C = ATA

(Newman’s notation C = AAT)

Bibliographic coupling

! B = AAT

6

Marvel character

Comic book

Page 7: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

7

Important PropertiesLinear algebra

1. Matrix multiplication is not commutative

!

0 10 0

"!

0 01 0

"

=

!

1 00 0

"

=!

0 01 0

"!

0 10 0

"

=

!

0 00 1

"

2. Trace: Tr(AB) = Tr(BA)

2. inner product ⟨x ,y⟩= xTy = ∑∀i xiyi

2.1 ⟨x ,x⟩ ≥ 02.2 ∥x∥2 ≡ ⟨x ,x⟩2.3 ⟨x ,y⟩= 0 iff x ⊥ y2.4 ⟨x ,y⟩= ∥x∥∥y∥cosθ , thus y⟨x ,y⟩= ∥x∥u, where u = y/∥y∥

5

Page 8: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

8

Common Matrix RepresentationsAdjacency matrix

Undirected graph:A= AT

Bipartite graph:

A=

[

0 B

BT 0

]

k connected components:

A=

A1 · · · 0

0. . . 0

0 · · · Ak

4

(undirected)

Page 9: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

} Consider vectors

}

}

} Orthonormal if both

9

Orthogonal vectors (linearly independent)

Normalized if uiuTi = 1 ,

also defined as kuik = 1, i = 1, . . . , k

Orthogonal if uiuTj = 0 ,

also defined as ui?uj , i 6= j

If U is orthonormal then UU�1= UUT

= I,

where I is the identity matrix

Page 10: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

We will eventually use this to show a drawback of

Principal Component Analysis (PCA)10

Orthogonal ProjectionsNote that I = UUT

=

kX

i=1

uiuTi

Thus a = Ia = U(UTa), a 2 Rk

The vector (UTa) is the projection of a onto U

Thus, a =

kX

i=1

(uTi a)ui

Page 11: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

} Yes

} For instance:

11

Can we have non-orthogonal basis?

Let U = [u1, u2] be an orthonormal basis and let

v1 = 2u1 + u2

v2 = u1 + 2u2

The vectors v1 and v1 are not orthogonal:

v1vT2 = 4 ,

but still form a basis for R2

Page 12: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

Eigenvalues and EigenvectorsA> 0 is n×n full rank, λ eigenvalue

det(A−λ I ) = 0

and x eigenvectorAx = λx ,

we assume ∥x∥= 1.

A[

x1, · · · , xn]

=[

x1, · · · , xn]

λ1. . .

λn

,

where xi is the i-th eigenvector of A ordered s.t. λ1 ≥ · · ·≥ λnIf A has n linearly independent eigenvectors

A= VΛV−1

7Inverse is A�1 = V ⇤�1V �1 , where ⇤�1 =

2

641/�1

. . .1/�n

3

7512

(right eigenvector)

If A is symmetric positive semidefinite V-1 = VT

Square is A2 = V ⇤2V �1

Page 13: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

} X(i,j) = value of user i for property j

◦ X(Alice, cholesterol) = 10

◦ X(i,j) = number of times i buys j

◦ X(i,j) = how much i pays j

◦ X(i,j) = 1 if i and j are friends, 0 otherwise

◦ X(i,j) = temperature of sensor j at time i

13

Singular Value Decomposition (SVD)

2

2

j5

i

© 2015 Bruno Ribeiro

Page 14: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

σ1σ2

σk

=

14

SVD Dimensions

… …

X UΣ

V T

Left singular vectors Right singular vectors

singular values

Data

x1 xmx2 u1 uku2

v1v2

vk

© 2015 Bruno Ribeiro

Page 15: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

} SVD gives best rank-k approximation of X in L2 and Frobenius norm

15

SVD Definition

X = +u1

v1σ1

u2

v2σ2

Outer product

+ …

© 2015 Bruno Ribeiro

Page 16: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

} “Almost unique” decomposition

} There are two sources of ambiguity◦ Orientation of singular vectors� Permute rows of left singular vector and corresponding

rows of left singular vector◦ If I is identity matrix: I = UIUT , for all orthonormal U � “Hypersphere ambiguity”� Related to rotational ambiguity of PCA

16

SVD Properties (I)

X = +u1

v1σ1

u2

v2σ2

+ …

© 2015 Bruno Ribeiro

Page 17: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

} Theorem (Eckart-Young, 1936) ◦ UΣ1VT is best rank 1 approximation of X, that is

|X − UΣ1 V T |2 ≤ |X − Y|2 for every rank 1 matrix Y

◦ UΣ1VT + UΣ2VT is the best rank 2 approximation of X, that is|X − UΣ1 V T − UΣ2 V T|2 ≤ |X − Y| 2

for every rank ≤ 2 matrix Y

◦ also for 3 , 4, …, r

17

SVD Properties (II)

© 2015 Bruno Ribeiro

Page 18: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

18

SVD Properties (III)

U and V are orthonormal(orthogonal & unit norm)

© 2015 Bruno Ribeiro

Page 19: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

19

Singular Value Decomposition

V* is the transpose if V is real-valued (always the case for us)

SVD is significantly more generic:} Applies to matrices of any shape, not just square matrices} Applies to any matrix, not just invertible matrices}

}

• SVD factorization A = U⌃V ?is more general than

eigenvalue / eigenvector factorization A = V ⇤V �1.

⌃ =

2

64�1

. . .�n

3

75

AAT = (U⌃V T )(U⌃V T )T = (U⌃V T )(V ⌃TUT )

= U⌃⌃TV T = Udiag(⌃)2V T

Moore–Penrose pseudoinverse is A+= V ⌃

+UT

Columns of

U are orthonormal

Columns of

V are orthonormal

Page 20: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

} As U and V have orthogonal rows

20

Understanding SVD singular vectors

Now you explain: What do V and U represent?

© 2015 Bruno Ribeiro

Page 21: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

} If X(i,j) = user i buys product j

} What is XTX ?◦ Product-to-product similarity matrix◦ What does V represent?

} What is XXT ?◦ User-to-user similarity matrix◦ What does U represent?

21

My “Help”

2

2

j5

i

© 2015 Bruno Ribeiro

Page 22: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

22

Statistical Inference Review

Page 23: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

Data processing inequality: “No processing can increase the amount of statistical information

already contained in the data”

23

Estimating characteristics from sampling

Worldraw

samples

samplesummarycharacteristic

summary

Data processinginequality

Page 24: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

} In data mining we often work with a sample of data from the population of interest

} Estimation techniques allow inferences about population properties from sample data

} If we had the population we could calculate the properties of interest

Page 25: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

} Elementary units:

◦ Entities (e.g., persons, objects, events) that meet a set of specified criteria

◦ Example: All people who’ve purchased something at Wallmart

} Population:

◦ Aggregate of elementary units (i.e, all items of interest)

} Sampling:

◦ Sub-group of the population

◦ Serves as a reference group for estimating characteristics about the population and drawing conclusions

Page 26: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

} Sampling is the main technique employed for data selection

◦ It is often used for both the preliminary investigation of the data and the final data analysis

} Reasons to sample

◦ Obtaining the entire set of data of interest is too expensive or time consuming

◦ Processing the entire set of data of interest is too expensive or time consuming

◦ Note: Even if you use an entire dataset for analysis, you should be aware of the sampling method that was used to gather the dataset

Tan, Steinbach, Kumar. Introduction to Data Mining, 2004.

Page 27: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

} The key principle for effective sampling is the following:

◦ Using a sample will work almost as well as using the entire dataset, if the sample is representative

◦ A sample is representative if it has approximately the same property (of interest) as the original data

Page 28: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

} Infer properties of an unknown distribution with sample data generated from that distribution

} Parameter estimation◦ Infer the value of a population parameter based on a sample

statistic (e.g., estimate the mean)

} Hypothesis testing◦ Infer the answer to a question about a population parameter

based on a sample statistic (e.g., is the mean non-zero?)

Page 29: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

} Maximum likelihood estimate (MLE)

} Maximum a poseriori probability estimate (MAP)

29

ˆ✓ = argmax

✓P [Data|✓]

s.t. f(✓) = 0

ˆ✓ = argmax

✓P [Data|✓]P [✓]

s.t. f(✓) = 0

Page 30: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

30

Page 31: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

import numpy as np from matplotlib import use # Avoid using the xterminal to create plots (needed if plotting # on Scholar) use("Agg") import matplotlib.pyplot as pltimport scipy as spimport math

p = 0.8 MAX_DEGREE = 101 ECCDF = 1.0 x = [] y = [] for d in xrange(MAX_DEGREE): #Be careful with machine precision

x.append(d) y.append(ECCDF) ECCDF = ECCDF - (1-p)*p**d

plt.xlim([1,max(x)]) plt.xlabel("node degree", fontsize=18) plt.ylabel("ECCDF", fontsize=18) plt.loglog(x,y,"ro") plt.savefig(’ECCDF_plot.pdf’) 31

Page 32: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

# Read CSV into R mydata = read.csv(file=”input.csv", header=FALSE, sep=",")

#Generate a random matrix (elements exponentially distributed)h = 100k = 200M = matrix(rexp(n=(h*k), rate=0.1),ncol=h,nrow=k)

print(max(M))max_per_row = c()for (i in 1:k) {

max_per_row = c(max_per_row,max(M[i,]))}plot(max_per_row)print(max(max_per_row)/min(max_per_row))

32

Page 33: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

33

Page 34: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

} High Performance Computing Cluster

} Jobs must be submitted with qsub(do not use main terminal to run tasks or you will be banished)

} But you can also use your own computer (but Scholar learning curve pays-off later)

34

Scholar Cluster

Page 35: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

#!/bin/bash -l # Example submission file (myjob.sub)# choose queue to use (e.g. standby or scholar-b)#PBS -q standby# FILENAME: myjob.submodule load develmodule load gccmodule load anaconda/4.4.1-py35module load rcd $PBS_O_WORKDIR unset DISPLAY

python matlibplot_example.py

35

Qsub Example File

Page 36: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

} qsub myjob.sub} qstat -u <your Purdue username>

} qstat -u <your Purdue username>

} After job finishes (submission directory has output files): ◦ myjob.sub.o<jobID> (std output)◦ myjob.sub.e<jobID> (error)◦ ECCDF_plot.pdf (your plot)

36

Scholar Cluster Job Submission

Page 37: (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear Algebra Statistical Inference! Introduce useful tools in Python and R! Introduce the

37

Not ideal for someone learning Python and R

} Submission takes a while to run (few minutes)◦ How to iteratively debug code?◦ qsub -I -q standby -l walltime=01:00:00

# asks for 1 hour iterative terminal to debug problems in your code# remember to load the modules you need

} Where to get helphttps://www.rcac.purdue.edu/compute/scholar/guide/