uva cs 4501: machine learning lecture 10: k-nearest ... · • k-nearest neighbor classifier is a...

UVACS4501:MachineLearning

Lecture10:K-nearest-neighborClassifier/Bias-VarianceTradeoff

Dr.Yanjun Qi

UniversityofVirginiaDepartmentofComputerScience

10/6/18

Dr.YanjunQi/UVACS

1

Wherearewe?èFivemajorsectionsofthiscourse

q Regression(supervised)q Classification(supervised)q Unsupervisedmodelsq Learningtheoryq Graphicalmodels

10/6/18 2

Dr.YanjunQi/UVACS

Threemajorsectionsforclassification

• We can divide the large variety of classification approaches into roughly three major types

1. Discriminative- directly estimate a decision rule/boundary- e.g., support vector machine, decision tree, logistic regression, - e.g. neural networks (NN), deep NN

2. Generative:- build a generative statistical model- e.g., Bayesian networks, Naïve Bayes classifier

3. Instance based classifiers- Use observation directly (no models)- e.g. K nearest neighbors

10/6/18 3

Dr.YanjunQi/UVACS

Today:

ü K-nearestneighborü ModelSelection/BiasVarianceTradeoff

ü Bias-Variancetradeoffü Highbias?Highvariance?Howtorespond?

10/6/18 4

Dr.YanjunQi/UVACS

Unknown recordNearest neighbor classifiers

Requiresthree inputs:1. Thesetofstored

trainingsamples2. Distancemetricto

computedistancebetweensamples

3. Thevalueofk,i.e.,thenumberofnearestneighborstoretrieve

10/6/18

Dr.YanjunQi/UVACS

5

Nearest neighbor classifiers

Toclassifyunknownsample:1. Computedistanceto

trainingrecords2. Identifyk nearest

neighbors3. Useclasslabelsofnearest

neighborstodeterminetheclasslabelofunknownrecord(e.g.,bytakingmajorityvote)

Unknown record

10/6/18

Dr.YanjunQi/UVACS

6

Definition of nearest neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

k-nearestneighborsofasamplexaredatapoints thathavethek smallestdistancestox

10/6/18

Dr.YanjunQi/UVACS

7

10/6/18

Dr.YanjunQi/UVACS

8

10/6/18

Dr.YanjunQi/UVACS

9

1-nearest neighbor for classification

Voronoi diagram:partitioning of a

plane into regions based on distance to points in a specific subset of the plane.

10/6/18

Dr.YanjunQi/UVACS

10

• Choosing the value of k:– If k is too small, sensitive to noise points– If k is too large, neighborhood may include (many)

points from other classes

Model Selection for Nearest neighbor classification

X

10/6/18

Dr.YanjunQi/UVACS

11

• Choosing the value of k:– If k is too small, sensitive to noise points– If k is too large, neighborhood may include points

from other classes

Model Selection for Nearest neighbor classification

X

10/6/18

Dr.YanjunQi/UVACS

12

• Scaling issues– Attributes may have to be scaled to prevent

distance measures from being dominated by one of the attributes

– Example:• height of a person may vary from 1.5 m to 1.8 m• weight of a person may vary from 90 lb to 300 lb• income of a person may vary from $10K to $1M

Challenge in Nearest neighbor classification

10/6/18

Dr.YanjunQi/UVACS

13

• k-Nearest neighbor classifier is a lazylearner – Does not build model explicitly.– Classifying unknown samples is relatively

expensive.

• k-Nearest neighbor classifier is a localmodel, vs. global model like linear classifiers.

Nearest neighbor classification

10/6/18

Dr.YanjunQi/UVACS

14

K-NearestNeighborforRegression?

• ForKNN– Decisionisdelayedtillanewinstancearrives– Targetvariablemaybediscreteorreal-valued

• Whentargetiscontinuous,thenaïvepredictionisthemeanvalueoftheknearesttrainingexamples

K=5-NearestNeighbor(1Dinput)forRegression

10/6/18

Dr.YanjunQi/UVACS

16

NearestNeighbor(1Dinput)forRegression

10/6/18

Dr.YanjunQi/UVACS

17

NearestNeighbor(1Dinput)forRegression

10/6/18

Dr.YanjunQi/UVACS

18

WhenK=2

• Compute distance between two points:– For instance, Euclidean distance

• Options for determining the class from nearest neighbor list– Take majority vote of class labels among the

k-nearest neighbors– Weight the votes according to distance

• example: weight factor w = 1 / d2

Nearest neighbor Variations

∑ −=i

ii yxd 2)(),( yx

10/6/18

Dr.YanjunQi/UVACS

19

Variants:Distance-Weightedk-NearestNeighborAlgorithm

• Assignweightstotheneighborsbasedontheir“distance” fromthequerypoint– Weight“may” beinversesquareofthedistances

• Extreme Option: Alltrainingpointsmayinfluenceaparticularinstance– E.g.,Shepard’smethod/ModifiedShepard,… byGeospatialAnalysis

Instance-basedRegressionvs.LinearRegression

• LinearRegressionLearning– Explicitdescriptionoftargetfunctiononthewholetrainingset

• Instance-basedLearning– Learning=storingalltraininginstances– Referredtoas“Lazy” learning

ComputationalTimeCost

10/6/18

Dr.YanjunQi/UVACS

22

K-Nearest Neighbor

Regression/ Classification

Local Smoothness

NA

NA

Training Samples

Task

Representation

Score Function

Search/Optimization

Models, Parameters

10/6/18 23

Dr.YanjunQi/UVACS

Decision boundaries in global vs. local models

Linear classification

• global• stable• can be inaccurate

15-nearest neighbor 1-nearest neighbor

• local• accurate• unstable

What ultimately matters: GENERALIZATION

10/6/18 24

Dr.YanjunQi/UVACS

K=15 K=1

• Kactsasasmoother

Today:



10/6/18 25

Dr.YanjunQi/UVACS

e.g. Training Error from KNN,Lesson Learned

• When k = 1, • No misclassifications (on

training): Overfit

• Minimizing training error is not always good (e.g., 1-NN)

X1

X2

1-nearest neighbor averaging

5.0ˆ =Y

5.0ˆ <Y

5.0ˆ >Y

10/6/18 26

Dr.YanjunQi/UVACS

Review:MeanandVarianceofRandomVariable(RV)

• Mean(Expectation):– DiscreteRVs:

– ContinuousRVs:

( )XEµ =

E X( ) = vi *P X = vi( )vi∑

E X( ) = x* p x( )dx−∞

+∞

∫

10/6/18 27

Dr.YanjunQi/UVACS

AdaptFromCarols’prob tutorial

X: random variables written in capital letter

Review:MeanandVarianceofRandomVariable(RV)

• Mean(Expectation):– DiscreteRVs:

– ContinuousRVs:

( )XEµ =

( ) ( )X P Xii iv

E v v= =∑

E X( ) = x* p x( )dx−∞

+∞

∫€

E(g(X)) = g(vi)P(X = vi)vi∑

E(g(X)) = g(x)* p(x)dx−∞

+∞

∫10/6/18 28

Dr.YanjunQi/UVACS


Review:MeanandVarianceofRV• Variance:

– DiscreteRVs:

– ContinuousRVs:

( ) ( ) ( )2X P Xi

i ivV v vµ= − =∑

V X( ) = x − µ( )2 p x( )−∞

+∞

∫ dx

Var(X) = E((X −µ)2 )

10/6/18 29

Dr.YanjunQi/UVACS


BIAS AND VARIANCE TRADE-OFF

• Bias– measures accuracy or quality of the estimator– low bias implies on average we will accurately

estimate true parameter from training data• Variance

– Measures precision or specificity of the estimator– Low variance implies the estimator does not change

much as the training set varies

30

10/6/18

Dr.YanjunQi/UVACS

10/6/18

Dr.YanjunQi/UVACS

31

BIAS AND VARIANCE TRADE-OFF for Mean Squared Error of

parameter estimation

Error due to incorrect

assumptions

Error due to variance of training

samples

E()=

Bias-Variance Tradeoff / Model Selection

32

underfit regionoverfit region

10/6/18

Dr.YanjunQi/UVACS

(1)Trainingvs TestError

ExpectedTestError

ExpectedTrainingError

3310/6/18

Dr.YanjunQi/UVACS

(1)Trainingvs TestError• Trainingerror

canalwaysbereducedwhenincreasingmodelcomplexity, ExpectedTestError


3410/6/18

Dr.YanjunQi/UVACS

10/6/18

Dr.YanjunQi/UVACS

35

(1)Trainingvs TestError

ExpectedTestError


• Trainingerrorcanalwaysbereducedwhenincreasingmodelcomplexity,

• ExpectedTesterrorandCVerrorè goodapproximationofofgeneralization(seeextraslidesaboutEPE)

Test Error to EPE: (Extra)

• Random input vector: X• Random output variable:Y• Joint distribution: Pr(X,Y )• Loss function L(Y, f(X))

• Expected prediction error (EPE):

€

EPE( f ) = E(L(Y, f (X))) = L(y, f (x))∫ Pr(dx,dy)

e.g. = (y − f (x))2∫ Pr(dx,dy)

One way to consider

generalization: by

considering the joint

population distribution

e.g. Squared error loss (also called L2 loss )10/6/18 36

Dr.YanjunQi/UVACS

(2)Bias-VarianceTrade-off

• Modelswithtoofewparametersareinaccuratebecauseofalargebias(notenoughflexibility).

• Modelswithtoomanyparametersareinaccuratebecauseofalargevariance(toomuchsensitivitytothesamplerandomness).

Slidecredit:D.Hoiem10/6/18 37

Dr.YanjunQi/UVACS

(2.1) Regression: Complexity versus Goodness of Fit

HighestBiasLowestvarianceModelcomplexity=low

MediumBiasMediumVariance

Modelcomplexity=medium

SmallestBiasHighestvarianceModelcomplexity=high

3810/6/18

Dr.YanjunQi/UVACS

LowVariance/HighBias

LowBias/HighVariance

(2.2) Classification, Decision boundaries in global vs. local models

linear regression• global• stable• can be inaccurate

15-nearest neighbor 1-nearest neighbor

KNN• local• accurate• unstable

What ultimately matters: GENERALIZATION10/6/18 39

LowBias/HighVariance

LowVariance/HighBias

Dr.YanjunQi/UVACS

Bias-Variance Tradeoff / Model Selection

40

underfit regionoverfit region

10/6/18

Dr.YanjunQi/UVACS

Model “bias” & Model “variance”

• Middle RED: – TRUE function

• Error due to bias:– How far off in general

from the middle red

• Error due to variance:– How wildly the blue

points spread

10/6/18 41

Dr.YanjunQi/UVACS

Model “bias” & Model “variance”

• Middle RED: – TRUE function

• Error due to bias:– How far off in general

from the middle red

• Error due to variance:– How wildly the blue

points spread

10/6/18 42

Dr.YanjunQi/UVACS

10/6/18

Dr.YanjunQi/UVACS

43

RandomnessofTrainSet=>VarianceofModels,e.g.,

RandomnessofTrainSet=>VarianceofModels,e.g.,

10/6/18

Dr.YanjunQi/UVACS

44

needtomakeassumptionsthatareabletogeneralize

• Components– Bias: how much the average model over all training sets differ from

the true model?• Error due to inaccurate assumptions/simplifications made by the

model– Variance: how much models estimated from different training sets

differ from each other

Slidecredit:L.Lazebnik10/6/18 45

Dr.YanjunQi/UVACS

TodayExtra:



(EXTRA)

10/6/18 46

Dr.YanjunQi/UVACS

10/6/18

Dr.YanjunQi/UVACS

47Credit:StanfordMachineLearning

References

q Prof.Tan,Steinbach,Kumar’s“IntroductiontoDataMining”slide

q Prof.AndrewMoore’sslidesq Prof.EricXing’sslidesq Hastie,Trevor,etal.Theelementsofstatisticallearning.Vol.2.No.1.NewYork:Springer,2009.

10/6/18 48

Dr.YanjunQi/UVACS

TodayExtra:


ü Bias-Variancetradeoffü Highbias?Highvariance?Howtorespond?ü EPE

10/6/18 49

Dr.YanjunQi/UVACS

needtomakeassumptionsthatareabletogeneralize

• Underfitting: model is too “simple” to represent all the relevant class characteristics– High bias and low variance– High training error and high test error

• Overfitting: model is too “complex” and fits irrelevant characteristics (noise) in the data– Low bias and high variance– Low training error and high test error

Slidecredit:L.Lazebnik10/6/18 50

Dr.YanjunQi/UVACS

(1)Highvariance

10/6/18 51

• Testerrorstilldecreasingasmincreases.Suggestslargertrainingsetwillhelp.

• Largegapbetweentrainingandtesterror.

• Low training error and high test error Slidecredit:A.Ng

Dr.YanjunQi/UVACS

CouldalsouseCrossV

Error

Howtoreducevariance?

• Chooseasimplerclassifier– MoreBias

• Regularizetheparameters– MoreBias

• Getmoretrainingdata

• Trysmallersetoffeatures– MoreBias


Dr.YanjunQi/UVACS

(2)Highbias

10/6/18 53

•Eventrainingerrorisunacceptablyhigh.•Smallgapbetweentrainingandtesterror.

High training error and high test errorSlidecredit:A.Ng

Dr.YanjunQi/UVACS

CouldalsouseCrossV

Error

HowtoreduceBias?

54

• E.g.

- Getadditionalfeatures

- Tryaddingbasisexpansions,e.g.polynomial

- Trymorecomplexlearner

10/6/18

Dr.YanjunQi/UVACS

(3)Forinstance,iftryingtosolve“spamdetection”using(Extra)

10/6/18 55

L2-

Slidecredit:A.Ng

Dr.YanjunQi/UVACS

Ifperformance isnotasdesired

(4)ModelSelectionandAssessment

• ModelSelection– Estimatingperformancesofdifferentmodelstochoosethebestone

• ModelAssessment– Havingchosenamodel,estimatingthepredictionerroronnewdata

5610/6/18

Dr.YanjunQi/UVACS

ModelSelectionandAssessment(Extra)

• WhenDataRichScenario:Splitthedataset

• WhenInsufficientdatatosplitinto3parts– Approximatevalidationstepanalytically

• AIC,BIC,MDL,SRM– Efficientreuseofsamples

• Crossvalidation,bootstrap

Model Selection Model assessment

Train Validation Test

5710/6/18

Dr.YanjunQi/UVACS

TodayExtra:


ü Bias-Variancetradeoffü Highbias?Highvariance?Howtorespond?ü EPE

10/6/18 58

Dr.YanjunQi/UVACS

Statistical Decision Theory (Extra)

• Random input vector: X• Random output variable:Y• Joint distribution: Pr(X,Y )• Loss function L(Y, f(X))

• Expected prediction error (EPE):•

€

EPE( f ) = E(L(Y, f (X))) = L(y, f (x))∫ Pr(dx,dy)

e.g. = (y − f (x))2∫ Pr(dx,dy)Consider

population distribution

e.g. Squared error loss (also called L2 loss )10/6/18 59

Dr.YanjunQi/UVACS

Expected prediction error (EPE)

• For L2 loss:

• For 0-1 loss: L(k, ℓ) = 1-δkl

EPE( f ) = E(L(Y, f (X))) = L(y, f (x))∫ Pr(dx,dy)Consider joint

distribution

!!

f̂ (X )=C k !if!Pr(C k |X = x)=max

g∈CPr(g|X = x)Bayes Classifier

e.g. = (y− f (x))2∫ Pr(dx,dy)

Conditional mean !! f̂ (x)=E(Y |X = x)

under L2 loss, best estimator for EPE (Theoretically) is :

e.g. KNN NN methods are the direct implementation (approximation )

10/6/18 60

Dr.YanjunQi/UVACS

61

EXPECTED PREDICTION ERROR for L2 Loss

• Expected prediction error (EPE) for L2 Loss:

• Since Pr(X,Y )=Pr(Y |X )Pr(X ), EPE can also be written as

• Thus it suffices to minimize EPE pointwise

EPE( f ) = E(Y − f (X))2 = (y− f (x))2∫ Pr(dx,dy)

)|)](([EE)(EPE 2| XXfYf XYX −=

)|]([Eminarg)( 2| xXcYxf XYc =−=

)|(E)( xXYxf ==Solution for Regression:

Best estimator under L2 loss: conditional expectation

Conditional mean

Solution for kNN:

KNN FOR MINIMIZING EPE• We know under L2 loss, best estimator for

EPE (theoretically) is :

– Nearest neighbors assumes that f(x) is well approximated by a locally constant function.

62

Conditional mean )|(E)( xXYxf ==

10/6/18

Dr.YanjunQi/UVACS

Review : WHEN EPE USES DIFFERENT LOSS

Loss Function Estimator

L2

L1

0-1(Bayes classifier / MAP)

6310/6/18 Dr.YanjunQi/UVACS

Decomposition of EPE– When additive error model: – Notations

• Output random variable:• Prediction function: • Prediction estimator:

Irreducible / Bayes error

64

MSE component of f-hat in estimating f

10/6/18

Dr.YanjunQi/UVACS

Bias-VarianceTrade-offforEPE:

EPE (x_0) = noise2 + bias2 + variance

Unavoidable error

Error due to incorrect

assumptions

Error due to variance of training

samples


Dr.YanjunQi/UVACS

10/6/18

Dr.YanjunQi/UVACS

66http://alex.smola.org/teaching/10-701-15/homework/hw3_sol.pdf

10/6/18

Dr.YanjunQi/UVACS

67http://alex.smola.org/teaching/10-701-15/homework/hw3_sol.pdf

MSEofModel,aka,Risk

10/6/18

Dr.YanjunQi/UVACS

68http://www.stat.cmu.edu/~ryantibs/statml/review/modelbasics.pdf

CrossValidationandVarianceEstimation

10/6/18

Dr.YanjunQi/UVACS

69

http://www.stat.cmu.edu/~ryantibs/statml/review/modelbasics.pdf

10/6/18

Dr.YanjunQi/UVACS

70

http://www.stat.cmu.edu/~ryantibs/statml/review/modelbasics.pdf

uva cs 4501: machine learning lecture 10: k-nearest ... · • k-nearest neighbor classifier is a...

Documents