learningfromdata lecture 27 learningaidesmagdon/courses/lfd-slides/slideslect27.pdflearningfromdata...
TRANSCRIPT
Learning From DataLecture 27
Learning Aides
Input Preprocessing
Dimensionality Reduction and Feature SelectionPrincipal Components Analysis (PCA)
Hints, Data Cleaning, Validation, . . .
M. Magdon-IsmailCSCI 4100/6100
Learning Aides
Additional tools that can be applied to all techniques
Preprocess data to account for arbitrary choices during data collection (input normalization)
Remove irrelevant dimensions that can mislead learning (PCA)
Incorporate known properties of the target function (hints and invariances)
Remove detrimental data (deterministic and stochastic noise)
Better ways to validate (estimate Eout) for model selection
c© AML Creator: Malik Magdon-Ismail Learning Aides: 2 /16 Nearest neighbor −→
Nearest Neighbor
Mr. Good and Mr. Bad were both given credit cards by the Bank of Learning (BoL).
Mr. Good Mr. Bad
(Age in years, Income in $× 1, 000) (47,35) (22,40)
Mr. Unknown who has “coordinates” (21yrs,$36K) applies for credit. Should the BoLgive him credit, according to the nearest neighbor algorithm?
What if, income is measured in dollars instead of “K” (thousands of dollars)?
Age (yrs)
Income(K
)
20 45
25
50
c© AML Creator: Malik Magdon-Ismail Learning Aides: 3 /16 Nearest neighbor uses Euclidean distance−→
Nearest Neighbor Uses Euclidean Distance
Mr. Good and Mr. Bad were both given credit cards by the Bank of Learning (BoL).
Mr. Good Mr. Bad
(Age in years, Income in $) (47,35000) (22,40000)
Mr. Unknown who has “coordinates” (21yrs,$36000) applies for credit. Should theBoL give him credit, according to the nearest neighbor algorithm?
What if, income is measured in dollars instead of “K” (thousands of dollars)?
Age (yrs)
Income($)
-3500 3500
35000
40000
c© AML Creator: Malik Magdon-Ismail Learning Aides: 4 /16 Algorithms treat dimensions uniformly −→
Uniform Treatment of Dimensions
Most learning algorithms treat each dimension equally
Nearear neighbor: d(x,x′) = ||x− x′ ||
Weight Decay: Ω(w) = λwtw
SVM: margin defined using Euclidean distance
RBF: bump function decays with Euclidean distance
Input Preprocessing
Unless you want to emphasize certain dimensions, the data should bepreprocessed to present each dimension on an equal footing
c© AML Creator: Malik Magdon-Ismail Learning Aides: 5 /16 Input preprocessing is a data transform −→
Input Preprocessing is a Data Transform
X =
xt
1
xt
2...xt
n
xn 7→ zn
g(x) = g(Φ(x)).
Raw xn have (for example) arbitrary scalings in each dimension, and zn will not.
c© AML Creator: Malik Magdon-Ismail Learning Aides: 6 /16 Centering −→
Centering
raw data
x1
x2
centered
z1
z2
zn = xn − x zn = Dxn zn = Σ−12xn
Dii =1
σi
Σ =1
N
N∑
n=1
xnxt
n=
1
NXtX
z = 0 σi = 1 Σ = 1NZtZ = I
c© AML Creator: Malik Magdon-Ismail Learning Aides: 7 /16 Normalizing −→
Normalizing
raw data
x1
x2
centered
z1
z2
normalized
z1
z2
zn = xn − x zn = Dxn zn = Σ−12xn
Dii =1
σi
Σ =1
N
N∑
n=1
xnxt
n=
1
NXtX
z = 0 σi = 1 Σ = 1NZtZ = I
c© AML Creator: Malik Magdon-Ismail Learning Aides: 8 /16 Whitening −→
Whitening
raw data
x1
x2
centered
z1
z2
normalized
z1
z2
whitened
z1
z2
zn = xn − x zn = Dxn zn = Σ−12xn
Dii =1
σi
Σ =1
N
N∑
n=1
xnxt
n=
1
NXtX
z = 0 σi = 1 Σ = 1NZtZ = I
c© AML Creator: Malik Magdon-Ismail Learning Aides: 9 /16 Only use training data −→
Only Use Training Data For Preprocessing
WARNING!Transforming data into a more convenient format hasa hidden trap which leads to data snooping.
When using a test set, determine the input transformation from training data only.
Rule: lock away the test data until you have your final hypothesis.
D −→
Dtrain
input preprocessing
z = Φ(x)−→ g(x) = g(Φ(x))
−−−−→
−−−−−−−→
Dtest
−−−−−−−−−−−−−−−−−−−−−−−→
Etest
Day
Cumulative
Profit
%
no snooping
snooping
0 100 200 300 400 500
-10
0
10
20
30
c© AML Creator: Malik Magdon-Ismail Learning Aides: 10 /16 PCA −→
Principal Components Analysis
Original Data
−−−−−−→
Rotated Data
Rotate the data so that it is easy toIdentify the dominant directions (information)
Throw away the smaller dimensions (noise)
[x1x2
]→
[z1z2
]→
[z1]
c© AML Creator: Malik Magdon-Ismail Learning Aides: 11 /16 Projecting the data −→
Projecting the Data to Maximize Variance
(Always center the data first)
z = xt
nv
vOriginal Data
Find v to maximize the variance of z
c© AML Creator: Malik Magdon-Ismail Learning Aides: 12 /16 Maximizing the variance −→
Maximizing the Variance
var[z] =1
N
N∑
n=1
z2n
=1
N
N∑
n=1
vtxnxt
nv
= vt
(1
N
N∑
n=1
xnxt
n
)v
= vtΣv.
vOriginal Data
Choose v as v1, the top eigenvector of Σ — the top principal component (PCA)
c© AML Creator: Malik Magdon-Ismail Learning Aides: 13 /16 −→
The Principal Components
z1 = xtv1
z2 = xtv2
z3 = xtv3...
v1,v2, · · · ,vd are the eigenvectors of Σ with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd
Theorem [Eckart-Young]. These directions give best re-construction of data; also capture maximum variance.
vOriginal Data
c© AML Creator: Malik Magdon-Ismail Learning Aides: 14 /16 PCA features for digits data −→
PCA Features for Digits Data
k
%ReconstructionError
0 50 100 150 2000
20
40
60
80
100
z1z 2
1
not 1
Principal components are automatedCaptures dominant directions of the data.
May not capture dominant dimensions for f .
c© AML Creator: Malik Magdon-Ismail Learning Aides: 15 /16 Other Learning Aides −→
Other Learning Aides
1.Nonlinear dimension reduction:
x1
x2
x1x2
x1
x2
2.Hints (invariances and prior information):rotational invariance, monotonicity, symmetry, . . . .
3.Removing noisy data:
intensity
symmetry
intensity
symmetry
intensity
symmetry
4.Advanced validation techniques: Rademacher and Permutation penaltiesMore efficient than CV, more convenient and accurate than VC.
c© AML Creator: Malik Magdon-Ismail Learning Aides: 16 /16