(Review)Linear Algebra
Statistical InferencePython & R
CS57300 - Data MiningSpring 2016
Instructor: Bruno Ribeiro
© 2016 Bruno Ribeiro
} Review some basic concepts◦ Linear Algebra◦ Statistical Inference
} Introduce useful tools in Python and R
} Introduce the Scholar cluster
Goals today
2
} A woman is leading an environmental protest outside PMU today
} What is more likely?a) That she is an investment banker?b) That she is an investment banker studying Environmental
Engineering at Purdue?
3
But before…
P [A] =X
b2B
P [A, b]
4
Linear Algebra Review
5
Why Linear Algebra?
} Why Algebra? ◦ Computing is all about algebra◦ Way to mathematically describe data
} Why Linear? ◦ Fast tools◦ Easy to understand◦ Many non-linear problems can be transformed or
approximated as linear systems
} Combination works well in practice
6
Example of Linear Algebra Application: Explain Relationships
} Marvel character appears on comic book} Described as an adjacency matrix (representing a graph)
Matrix Transpose & Symmetry
3. (AB)T = BTAT
3.1 symmetric matrix: A= AT
3.2 (BT)T = B3.3 (BBT)T =?
Cocitation networks
! C = ATA
(Newman’s notation C = AAT)
Bibliographic coupling
! B = AAT
6
Marvel character
Comic book
7
Important PropertiesLinear algebra
1. Matrix multiplication is not commutative
!
0 10 0
"!
0 01 0
"
=
!
1 00 0
"
=!
0 01 0
"!
0 10 0
"
=
!
0 00 1
"
2. Trace: Tr(AB) = Tr(BA)
2. inner product ⟨x ,y⟩= xTy = ∑∀i xiyi
2.1 ⟨x ,x⟩ ≥ 02.2 ∥x∥2 ≡ ⟨x ,x⟩2.3 ⟨x ,y⟩= 0 iff x ⊥ y2.4 ⟨x ,y⟩= ∥x∥∥y∥cosθ , thus y⟨x ,y⟩= ∥x∥u, where u = y/∥y∥
5
8
Common Matrix RepresentationsAdjacency matrix
Undirected graph:A= AT
Bipartite graph:
A=
[
0 B
BT 0
]
k connected components:
A=
⎡
⎢
⎣
A1 · · · 0
0. . . 0
0 · · · Ak
⎤
⎥
⎦
4
(undirected)
} Consider vectors
}
}
} Orthonormal if both
9
Orthogonal vectors (linearly independent)
Normalized if uiuTi = 1 ,
also defined as kuik = 1, i = 1, . . . , k
Orthogonal if uiuTj = 0 ,
also defined as ui?uj , i 6= j
If U is orthonormal then UU�1= UUT
= I,
where I is the identity matrix
We will eventually use this to show a drawback of
Principal Component Analysis (PCA)10
Orthogonal ProjectionsNote that I = UUT
=
kX
i=1
uiuTi
Thus a = Ia = U(UTa), a 2 Rk
The vector (UTa) is the projection of a onto U
Thus, a =
kX
i=1
(uTi a)ui
} Yes
} For instance:
11
Can we have non-orthogonal basis?
Let U = [u1, u2] be an orthonormal basis and let
v1 = 2u1 + u2
v2 = u1 + 2u2
The vectors v1 and v1 are not orthogonal:
v1vT2 = 4 ,
but still form a basis for R2
Eigenvalues and EigenvectorsA> 0 is n×n full rank, λ eigenvalue
det(A−λ I ) = 0
and x eigenvectorAx = λx ,
we assume ∥x∥= 1.
A[
x1, · · · , xn]
=[
x1, · · · , xn]
⎡
⎢
⎣
λ1. . .
λn
⎤
⎥
⎦
,
where xi is the i-th eigenvector of A ordered s.t. λ1 ≥ · · ·≥ λnIf A has n linearly independent eigenvectors
A= VΛV−1
7Inverse is A�1 = V ⇤�1V �1 , where ⇤�1 =
2
641/�1
. . .1/�n
3
7512
(right eigenvector)
If A is symmetric positive semidefinite V-1 = VT
Square is A2 = V ⇤2V �1
} X(i,j) = value of user i for property j
◦ X(Alice, cholesterol) = 10
◦ X(i,j) = number of times i buys j
◦ X(i,j) = how much i pays j
◦ X(i,j) = 1 if i and j are friends, 0 otherwise
◦ X(i,j) = temperature of sensor j at time i
13
Singular Value Decomposition (SVD)
2
2
j5
i
© 2015 Bruno Ribeiro
σ1σ2
σk
=
14
SVD Dimensions
… …
…
X UΣ
V T
Left singular vectors Right singular vectors
singular values
Data
x1 xmx2 u1 uku2
v1v2
vk
© 2015 Bruno Ribeiro
} SVD gives best rank-k approximation of X in L2 and Frobenius norm
15
SVD Definition
X = +u1
v1σ1
u2
v2σ2
Outer product
+ …
© 2015 Bruno Ribeiro
} “Almost unique” decomposition
} There are two sources of ambiguity◦ Orientation of singular vectors� Permute rows of left singular vector and corresponding
rows of left singular vector◦ If I is identity matrix: I = UIUT , for all orthonormal U � “Hypersphere ambiguity”� Related to rotational ambiguity of PCA
16
SVD Properties (I)
X = +u1
v1σ1
u2
v2σ2
+ …
© 2015 Bruno Ribeiro
} Theorem (Eckart-Young, 1936) ◦ UΣ1VT is best rank 1 approximation of X, that is
|X − UΣ1 V T |2 ≤ |X − Y|2 for every rank 1 matrix Y
◦ UΣ1VT + UΣ2VT is the best rank 2 approximation of X, that is|X − UΣ1 V T − UΣ2 V T|2 ≤ |X − Y| 2
for every rank ≤ 2 matrix Y
◦ also for 3 , 4, …, r
17
SVD Properties (II)
© 2015 Bruno Ribeiro
18
SVD Properties (III)
U and V are orthonormal(orthogonal & unit norm)
© 2015 Bruno Ribeiro
19
Singular Value Decomposition
V* is the transpose if V is real-valued (always the case for us)
SVD is significantly more generic:} Applies to matrices of any shape, not just square matrices} Applies to any matrix, not just invertible matrices}
}
• SVD factorization A = U⌃V ?is more general than
eigenvalue / eigenvector factorization A = V ⇤V �1.
⌃ =
2
64�1
. . .�n
3
75
AAT = (U⌃V T )(U⌃V T )T = (U⌃V T )(V ⌃TUT )
= U⌃⌃TV T = Udiag(⌃)2V T
Moore–Penrose pseudoinverse is A+= V ⌃
+UT
Columns of
U are orthonormal
Columns of
V are orthonormal
} As U and V have orthogonal rows
20
Understanding SVD singular vectors
Now you explain: What do V and U represent?
© 2015 Bruno Ribeiro
} If X(i,j) = user i buys product j
} What is XTX ?◦ Product-to-product similarity matrix◦ What does V represent?
} What is XXT ?◦ User-to-user similarity matrix◦ What does U represent?
21
My “Help”
2
2
j5
i
© 2015 Bruno Ribeiro
22
Statistical Inference Review
Data processing inequality: “No processing can increase the amount of statistical information
already contained in the data”
23
Estimating characteristics from sampling
Worldraw
samples
samplesummarycharacteristic
summary
Data processinginequality
} In data mining we often work with a sample of data from the population of interest
} Estimation techniques allow inferences about population properties from sample data
} If we had the population we could calculate the properties of interest
} Elementary units:
◦ Entities (e.g., persons, objects, events) that meet a set of specified criteria
◦ Example: All people who’ve purchased something at Wallmart
} Population:
◦ Aggregate of elementary units (i.e, all items of interest)
} Sampling:
◦ Sub-group of the population
◦ Serves as a reference group for estimating characteristics about the population and drawing conclusions
} Sampling is the main technique employed for data selection
◦ It is often used for both the preliminary investigation of the data and the final data analysis
} Reasons to sample
◦ Obtaining the entire set of data of interest is too expensive or time consuming
◦ Processing the entire set of data of interest is too expensive or time consuming
◦ Note: Even if you use an entire dataset for analysis, you should be aware of the sampling method that was used to gather the dataset
Tan, Steinbach, Kumar. Introduction to Data Mining, 2004.
} The key principle for effective sampling is the following:
◦ Using a sample will work almost as well as using the entire dataset, if the sample is representative
◦ A sample is representative if it has approximately the same property (of interest) as the original data
} Infer properties of an unknown distribution with sample data generated from that distribution
} Parameter estimation◦ Infer the value of a population parameter based on a sample
statistic (e.g., estimate the mean)
} Hypothesis testing◦ Infer the answer to a question about a population parameter
based on a sample statistic (e.g., is the mean non-zero?)
} Maximum likelihood estimate (MLE)
} Maximum a poseriori probability estimate (MAP)
29
ˆ✓ = argmax
✓P [Data|✓]
s.t. f(✓) = 0
ˆ✓ = argmax
✓P [Data|✓]P [✓]
s.t. f(✓) = 0
30
import numpy as np from matplotlib import use # Avoid using the xterminal to create plots (needed if plotting # on Scholar) use("Agg") import matplotlib.pyplot as pltimport scipy as spimport math
p = 0.8 MAX_DEGREE = 101 ECCDF = 1.0 x = [] y = [] for d in xrange(MAX_DEGREE): #Be careful with machine precision
x.append(d) y.append(ECCDF) ECCDF = ECCDF - (1-p)*p**d
plt.xlim([1,max(x)]) plt.xlabel("node degree", fontsize=18) plt.ylabel("ECCDF", fontsize=18) plt.loglog(x,y,"ro") plt.savefig(’ECCDF_plot.pdf’) 31
# Read CSV into R mydata = read.csv(file=”input.csv", header=FALSE, sep=",")
#Generate a random matrix (elements exponentially distributed)h = 100k = 200M = matrix(rexp(n=(h*k), rate=0.1),ncol=h,nrow=k)
print(max(M))max_per_row = c()for (i in 1:k) {
max_per_row = c(max_per_row,max(M[i,]))}plot(max_per_row)print(max(max_per_row)/min(max_per_row))
32
33
} High Performance Computing Cluster
} Jobs must be submitted with qsub(do not use main terminal to run tasks or you will be banished)
} But you can also use your own computer (but Scholar learning curve pays-off later)
34
Scholar Cluster
#!/bin/bash -l # Example submission file (myjob.sub)# choose queue to use (e.g. standby or scholar-b)#PBS -q standby# FILENAME: myjob.submodule load develmodule load gccmodule load anaconda/4.4.1-py35module load rcd $PBS_O_WORKDIR unset DISPLAY
python matlibplot_example.py
35
Qsub Example File
} qsub myjob.sub} qstat -u <your Purdue username>
} qstat -u <your Purdue username>
} After job finishes (submission directory has output files): ◦ myjob.sub.o<jobID> (std output)◦ myjob.sub.e<jobID> (error)◦ ECCDF_plot.pdf (your plot)
36
Scholar Cluster Job Submission
37
Not ideal for someone learning Python and R
} Submission takes a while to run (few minutes)◦ How to iteratively debug code?◦ qsub -I -q standby -l walltime=01:00:00
# asks for 1 hour iterative terminal to debug problems in your code# remember to load the modules you need
} Where to get helphttps://www.rcac.purdue.edu/compute/scholar/guide/