learningfromdata lecture 27 learningaidesmagdon/courses/lfd-slides/slideslect27.pdflearningfromdata...

16
Learning From Data Lecture 27 Learning Aides Input Preprocessing Dimensionality Reduction and Feature Selection Principal Components Analysis (PCA) Hints, Data Cleaning, Validation, . . . M. Magdon-Ismail CSCI 4100/6100

Upload: donhan

Post on 09-Jun-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LearningFromData Lecture 27 LearningAidesmagdon/courses/LFD-Slides/SlidesLect27.pdfLearningFromData Lecture 27 LearningAides InputPreprocessing Dimensionality Reduction and FeatureSelection

Learning From DataLecture 27

Learning Aides

Input Preprocessing

Dimensionality Reduction and Feature SelectionPrincipal Components Analysis (PCA)

Hints, Data Cleaning, Validation, . . .

M. Magdon-IsmailCSCI 4100/6100

Page 2: LearningFromData Lecture 27 LearningAidesmagdon/courses/LFD-Slides/SlidesLect27.pdfLearningFromData Lecture 27 LearningAides InputPreprocessing Dimensionality Reduction and FeatureSelection

Learning Aides

Additional tools that can be applied to all techniques

Preprocess data to account for arbitrary choices during data collection (input normalization)

Remove irrelevant dimensions that can mislead learning (PCA)

Incorporate known properties of the target function (hints and invariances)

Remove detrimental data (deterministic and stochastic noise)

Better ways to validate (estimate Eout) for model selection

c© AML Creator: Malik Magdon-Ismail Learning Aides: 2 /16 Nearest neighbor −→

Page 3: LearningFromData Lecture 27 LearningAidesmagdon/courses/LFD-Slides/SlidesLect27.pdfLearningFromData Lecture 27 LearningAides InputPreprocessing Dimensionality Reduction and FeatureSelection

Nearest Neighbor

Mr. Good and Mr. Bad were both given credit cards by the Bank of Learning (BoL).

Mr. Good Mr. Bad

(Age in years, Income in $× 1, 000) (47,35) (22,40)

Mr. Unknown who has “coordinates” (21yrs,$36K) applies for credit. Should the BoLgive him credit, according to the nearest neighbor algorithm?

What if, income is measured in dollars instead of “K” (thousands of dollars)?

Age (yrs)

Income(K

)

20 45

25

50

c© AML Creator: Malik Magdon-Ismail Learning Aides: 3 /16 Nearest neighbor uses Euclidean distance−→

Page 4: LearningFromData Lecture 27 LearningAidesmagdon/courses/LFD-Slides/SlidesLect27.pdfLearningFromData Lecture 27 LearningAides InputPreprocessing Dimensionality Reduction and FeatureSelection

Nearest Neighbor Uses Euclidean Distance

Mr. Good and Mr. Bad were both given credit cards by the Bank of Learning (BoL).

Mr. Good Mr. Bad

(Age in years, Income in $) (47,35000) (22,40000)

Mr. Unknown who has “coordinates” (21yrs,$36000) applies for credit. Should theBoL give him credit, according to the nearest neighbor algorithm?

What if, income is measured in dollars instead of “K” (thousands of dollars)?

Age (yrs)

Income($)

-3500 3500

35000

40000

c© AML Creator: Malik Magdon-Ismail Learning Aides: 4 /16 Algorithms treat dimensions uniformly −→

Page 5: LearningFromData Lecture 27 LearningAidesmagdon/courses/LFD-Slides/SlidesLect27.pdfLearningFromData Lecture 27 LearningAides InputPreprocessing Dimensionality Reduction and FeatureSelection

Uniform Treatment of Dimensions

Most learning algorithms treat each dimension equally

Nearear neighbor: d(x,x′) = ||x− x′ ||

Weight Decay: Ω(w) = λwtw

SVM: margin defined using Euclidean distance

RBF: bump function decays with Euclidean distance

Input Preprocessing

Unless you want to emphasize certain dimensions, the data should bepreprocessed to present each dimension on an equal footing

c© AML Creator: Malik Magdon-Ismail Learning Aides: 5 /16 Input preprocessing is a data transform −→

Page 6: LearningFromData Lecture 27 LearningAidesmagdon/courses/LFD-Slides/SlidesLect27.pdfLearningFromData Lecture 27 LearningAides InputPreprocessing Dimensionality Reduction and FeatureSelection

Input Preprocessing is a Data Transform

X =

xt

1

xt

2...xt

n

xn 7→ zn

g(x) = g(Φ(x)).

Raw xn have (for example) arbitrary scalings in each dimension, and zn will not.

c© AML Creator: Malik Magdon-Ismail Learning Aides: 6 /16 Centering −→

Page 7: LearningFromData Lecture 27 LearningAidesmagdon/courses/LFD-Slides/SlidesLect27.pdfLearningFromData Lecture 27 LearningAides InputPreprocessing Dimensionality Reduction and FeatureSelection

Centering

raw data

x1

x2

centered

z1

z2

zn = xn − x zn = Dxn zn = Σ−12xn

Dii =1

σi

Σ =1

N

N∑

n=1

xnxt

n=

1

NXtX

z = 0 σi = 1 Σ = 1NZtZ = I

c© AML Creator: Malik Magdon-Ismail Learning Aides: 7 /16 Normalizing −→

Page 8: LearningFromData Lecture 27 LearningAidesmagdon/courses/LFD-Slides/SlidesLect27.pdfLearningFromData Lecture 27 LearningAides InputPreprocessing Dimensionality Reduction and FeatureSelection

Normalizing

raw data

x1

x2

centered

z1

z2

normalized

z1

z2

zn = xn − x zn = Dxn zn = Σ−12xn

Dii =1

σi

Σ =1

N

N∑

n=1

xnxt

n=

1

NXtX

z = 0 σi = 1 Σ = 1NZtZ = I

c© AML Creator: Malik Magdon-Ismail Learning Aides: 8 /16 Whitening −→

Page 9: LearningFromData Lecture 27 LearningAidesmagdon/courses/LFD-Slides/SlidesLect27.pdfLearningFromData Lecture 27 LearningAides InputPreprocessing Dimensionality Reduction and FeatureSelection

Whitening

raw data

x1

x2

centered

z1

z2

normalized

z1

z2

whitened

z1

z2

zn = xn − x zn = Dxn zn = Σ−12xn

Dii =1

σi

Σ =1

N

N∑

n=1

xnxt

n=

1

NXtX

z = 0 σi = 1 Σ = 1NZtZ = I

c© AML Creator: Malik Magdon-Ismail Learning Aides: 9 /16 Only use training data −→

Page 10: LearningFromData Lecture 27 LearningAidesmagdon/courses/LFD-Slides/SlidesLect27.pdfLearningFromData Lecture 27 LearningAides InputPreprocessing Dimensionality Reduction and FeatureSelection

Only Use Training Data For Preprocessing

WARNING!Transforming data into a more convenient format hasa hidden trap which leads to data snooping.

When using a test set, determine the input transformation from training data only.

Rule: lock away the test data until you have your final hypothesis.

D −→

Dtrain

input preprocessing

z = Φ(x)−→ g(x) = g(Φ(x))

−−−−→

−−−−−−−→

Dtest

−−−−−−−−−−−−−−−−−−−−−−−→

Etest

Day

Cumulative

Profit

%

no snooping

snooping

0 100 200 300 400 500

-10

0

10

20

30

c© AML Creator: Malik Magdon-Ismail Learning Aides: 10 /16 PCA −→

Page 11: LearningFromData Lecture 27 LearningAidesmagdon/courses/LFD-Slides/SlidesLect27.pdfLearningFromData Lecture 27 LearningAides InputPreprocessing Dimensionality Reduction and FeatureSelection

Principal Components Analysis

Original Data

−−−−−−→

Rotated Data

Rotate the data so that it is easy toIdentify the dominant directions (information)

Throw away the smaller dimensions (noise)

[x1x2

]→

[z1z2

]→

[z1]

c© AML Creator: Malik Magdon-Ismail Learning Aides: 11 /16 Projecting the data −→

Page 12: LearningFromData Lecture 27 LearningAidesmagdon/courses/LFD-Slides/SlidesLect27.pdfLearningFromData Lecture 27 LearningAides InputPreprocessing Dimensionality Reduction and FeatureSelection

Projecting the Data to Maximize Variance

(Always center the data first)

z = xt

nv

vOriginal Data

Find v to maximize the variance of z

c© AML Creator: Malik Magdon-Ismail Learning Aides: 12 /16 Maximizing the variance −→

Page 13: LearningFromData Lecture 27 LearningAidesmagdon/courses/LFD-Slides/SlidesLect27.pdfLearningFromData Lecture 27 LearningAides InputPreprocessing Dimensionality Reduction and FeatureSelection

Maximizing the Variance

var[z] =1

N

N∑

n=1

z2n

=1

N

N∑

n=1

vtxnxt

nv

= vt

(1

N

N∑

n=1

xnxt

n

)v

= vtΣv.

vOriginal Data

Choose v as v1, the top eigenvector of Σ — the top principal component (PCA)

c© AML Creator: Malik Magdon-Ismail Learning Aides: 13 /16 −→

Page 14: LearningFromData Lecture 27 LearningAidesmagdon/courses/LFD-Slides/SlidesLect27.pdfLearningFromData Lecture 27 LearningAides InputPreprocessing Dimensionality Reduction and FeatureSelection

The Principal Components

z1 = xtv1

z2 = xtv2

z3 = xtv3...

v1,v2, · · · ,vd are the eigenvectors of Σ with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd

Theorem [Eckart-Young]. These directions give best re-construction of data; also capture maximum variance.

vOriginal Data

c© AML Creator: Malik Magdon-Ismail Learning Aides: 14 /16 PCA features for digits data −→

Page 15: LearningFromData Lecture 27 LearningAidesmagdon/courses/LFD-Slides/SlidesLect27.pdfLearningFromData Lecture 27 LearningAides InputPreprocessing Dimensionality Reduction and FeatureSelection

PCA Features for Digits Data

k

%ReconstructionError

0 50 100 150 2000

20

40

60

80

100

z1z 2

1

not 1

Principal components are automatedCaptures dominant directions of the data.

May not capture dominant dimensions for f .

c© AML Creator: Malik Magdon-Ismail Learning Aides: 15 /16 Other Learning Aides −→

Page 16: LearningFromData Lecture 27 LearningAidesmagdon/courses/LFD-Slides/SlidesLect27.pdfLearningFromData Lecture 27 LearningAides InputPreprocessing Dimensionality Reduction and FeatureSelection

Other Learning Aides

1.Nonlinear dimension reduction:

x1

x2

x1x2

x1

x2

2.Hints (invariances and prior information):rotational invariance, monotonicity, symmetry, . . . .

3.Removing noisy data:

intensity

symmetry

intensity

symmetry

intensity

symmetry

4.Advanced validation techniques: Rademacher and Permutation penaltiesMore efficient than CV, more convenient and accurate than VC.

c© AML Creator: Malik Magdon-Ismail Learning Aides: 16 /16