learningfromdata lecture 27 learningaidesmagdon/courses/lfd-slides/slideslect27.pdflearningfromdata...

Learning From DataLecture 27

Learning Aides

Input Preprocessing

Dimensionality Reduction and Feature SelectionPrincipal Components Analysis (PCA)

Hints, Data Cleaning, Validation, . . .

M. Magdon-IsmailCSCI 4100/6100

Learning Aides

Additional tools that can be applied to all techniques

Preprocess data to account for arbitrary choices during data collection (input normalization)

Remove irrelevant dimensions that can mislead learning (PCA)

Incorporate known properties of the target function (hints and invariances)

Remove detrimental data (deterministic and stochastic noise)

Better ways to validate (estimate Eout) for model selection

c© AML Creator: Malik Magdon-Ismail Learning Aides: 2 /16 Nearest neighbor −→

Nearest Neighbor

Mr. Good and Mr. Bad were both given credit cards by the Bank of Learning (BoL).

Mr. Good Mr. Bad

(Age in years, Income in $× 1, 000) (47,35) (22,40)

Mr. Unknown who has “coordinates” (21yrs,$36K) applies for credit. Should the BoLgive him credit, according to the nearest neighbor algorithm?

What if, income is measured in dollars instead of “K” (thousands of dollars)?

Age (yrs)

Income(K

)

20 45

25

50

c© AML Creator: Malik Magdon-Ismail Learning Aides: 3 /16 Nearest neighbor uses Euclidean distance−→

Nearest Neighbor Uses Euclidean Distance

Mr. Good and Mr. Bad were both given credit cards by the Bank of Learning (BoL).

Mr. Good Mr. Bad

(Age in years, Income in $) (47,35000) (22,40000)

Mr. Unknown who has “coordinates” (21yrs,$36000) applies for credit. Should theBoL give him credit, according to the nearest neighbor algorithm?

What if, income is measured in dollars instead of “K” (thousands of dollars)?

Age (yrs)

Income($)

-3500 3500

35000

40000

c© AML Creator: Malik Magdon-Ismail Learning Aides: 4 /16 Algorithms treat dimensions uniformly −→

Uniform Treatment of Dimensions

Most learning algorithms treat each dimension equally

Nearear neighbor: d(x,x′) = ||x− x′ ||

Weight Decay: Ω(w) = λwtw

SVM: margin defined using Euclidean distance

RBF: bump function decays with Euclidean distance

Input Preprocessing

Unless you want to emphasize certain dimensions, the data should bepreprocessed to present each dimension on an equal footing

c© AML Creator: Malik Magdon-Ismail Learning Aides: 5 /16 Input preprocessing is a data transform −→

Input Preprocessing is a Data Transform

X =

xt

1

xt

2...xt

n

xn 7→ zn

g(x) = g(Φ(x)).

Raw xn have (for example) arbitrary scalings in each dimension, and zn will not.

c© AML Creator: Malik Magdon-Ismail Learning Aides: 6 /16 Centering −→

Centering

raw data

x1

x2

centered

z1

z2

zn = xn − x zn = Dxn zn = Σ−12xn

Dii =1

σi

Σ =1

N

N∑

n=1

xnxt

n=

1

NXtX

z = 0 σi = 1 Σ = 1NZtZ = I

c© AML Creator: Malik Magdon-Ismail Learning Aides: 7 /16 Normalizing −→

Normalizing

raw data

x1

x2

centered

z1

z2

normalized

z1

z2


Dii =1

σi

Σ =1

N

N∑

n=1

xnxt

n=

1

NXtX

z = 0 σi = 1 Σ = 1NZtZ = I

c© AML Creator: Malik Magdon-Ismail Learning Aides: 8 /16 Whitening −→

Whitening

raw data

x1

x2

centered

z1

z2

normalized

z1

z2

whitened

z1

z2


Dii =1

σi

Σ =1

N

N∑

n=1

xnxt

n=

1

NXtX

z = 0 σi = 1 Σ = 1NZtZ = I

c© AML Creator: Malik Magdon-Ismail Learning Aides: 9 /16 Only use training data −→

Only Use Training Data For Preprocessing

WARNING!Transforming data into a more convenient format hasa hidden trap which leads to data snooping.

When using a test set, determine the input transformation from training data only.

Rule: lock away the test data until you have your final hypothesis.

D −→

Dtrain

input preprocessing

z = Φ(x)−→ g(x) = g(Φ(x))

−−−−→

−−−−−−−→

Dtest

−−−−−−−−−−−−−−−−−−−−−−−→

Etest

Day

Cumulative

Profit

%

no snooping

snooping

0 100 200 300 400 500

-10

0

10

20

30

c© AML Creator: Malik Magdon-Ismail Learning Aides: 10 /16 PCA −→

Principal Components Analysis

Original Data

−−−−−−→

Rotated Data

Rotate the data so that it is easy toIdentify the dominant directions (information)

Throw away the smaller dimensions (noise)

[x1x2

]→

[z1z2

]→

[z1]

c© AML Creator: Malik Magdon-Ismail Learning Aides: 11 /16 Projecting the data −→

Projecting the Data to Maximize Variance

(Always center the data first)

z = xt

nv

vOriginal Data

Find v to maximize the variance of z

c© AML Creator: Malik Magdon-Ismail Learning Aides: 12 /16 Maximizing the variance −→

Maximizing the Variance

var[z] =1

N

N∑

n=1

z2n

=1

N

N∑

n=1

vtxnxt

nv

= vt

(1

N

N∑

n=1

xnxt

n

)v

= vtΣv.

vOriginal Data

Choose v as v1, the top eigenvector of Σ — the top principal component (PCA)

c© AML Creator: Malik Magdon-Ismail Learning Aides: 13 /16 −→

The Principal Components

z1 = xtv1

z2 = xtv2

z3 = xtv3...

v1,v2, · · · ,vd are the eigenvectors of Σ with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd

Theorem [Eckart-Young]. These directions give best re-construction of data; also capture maximum variance.

vOriginal Data

c© AML Creator: Malik Magdon-Ismail Learning Aides: 14 /16 PCA features for digits data −→

PCA Features for Digits Data

k

%ReconstructionError

0 50 100 150 2000

20

40

60

80

100

z1z 2

1

not 1

Principal components are automatedCaptures dominant directions of the data.

May not capture dominant dimensions for f .

c© AML Creator: Malik Magdon-Ismail Learning Aides: 15 /16 Other Learning Aides −→

Other Learning Aides

1.Nonlinear dimension reduction:

x1

x2

x1x2

x1

x2

2.Hints (invariances and prior information):rotational invariance, monotonicity, symmetry, . . . .

3.Removing noisy data:

intensity

symmetry

intensity

symmetry

intensity

symmetry

4.Advanced validation techniques: Rademacher and Permutation penaltiesMore efficient than CV, more convenient and accurate than VC.

c© AML Creator: Malik Magdon-Ismail Learning Aides: 16 /16

learningfromdata lecture 27 learningaidesmagdon/courses/lfd-slides/slideslect27.pdflearningfromdata...

Documents