uva cs 4501: machine learning lecture 10: k-nearest ... · • k-nearest neighbor classifier is a...
TRANSCRIPT
UVACS4501:MachineLearning
Lecture10:K-nearest-neighborClassifier/Bias-VarianceTradeoff
Dr.Yanjun Qi
UniversityofVirginiaDepartmentofComputerScience
10/6/18
Dr.YanjunQi/UVACS
1
Wherearewe?èFivemajorsectionsofthiscourse
q Regression(supervised)q Classification(supervised)q Unsupervisedmodelsq Learningtheoryq Graphicalmodels
10/6/18 2
Dr.YanjunQi/UVACS
Threemajorsectionsforclassification
• We can divide the large variety of classification approaches into roughly three major types
1. Discriminative- directly estimate a decision rule/boundary- e.g., support vector machine, decision tree, logistic regression, - e.g. neural networks (NN), deep NN
2. Generative:- build a generative statistical model- e.g., Bayesian networks, Naïve Bayes classifier
3. Instance based classifiers- Use observation directly (no models)- e.g. K nearest neighbors
10/6/18 3
Dr.YanjunQi/UVACS
Today:
ü K-nearestneighborü ModelSelection/BiasVarianceTradeoff
ü Bias-Variancetradeoffü Highbias?Highvariance?Howtorespond?
10/6/18 4
Dr.YanjunQi/UVACS
Unknown recordNearest neighbor classifiers
Requiresthree inputs:1. Thesetofstored
trainingsamples2. Distancemetricto
computedistancebetweensamples
3. Thevalueofk,i.e.,thenumberofnearestneighborstoretrieve
10/6/18
Dr.YanjunQi/UVACS
5
Nearest neighbor classifiers
Toclassifyunknownsample:1. Computedistanceto
trainingrecords2. Identifyk nearest
neighbors3. Useclasslabelsofnearest
neighborstodeterminetheclasslabelofunknownrecord(e.g.,bytakingmajorityvote)
Unknown record
10/6/18
Dr.YanjunQi/UVACS
6
Definition of nearest neighbor
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
k-nearestneighborsofasamplexaredatapoints thathavethek smallestdistancestox
10/6/18
Dr.YanjunQi/UVACS
7
10/6/18
Dr.YanjunQi/UVACS
8
10/6/18
Dr.YanjunQi/UVACS
9
1-nearest neighbor for classification
Voronoi diagram:partitioning of a
plane into regions based on distance to points in a specific subset of the plane.
10/6/18
Dr.YanjunQi/UVACS
10
• Choosing the value of k:– If k is too small, sensitive to noise points– If k is too large, neighborhood may include (many)
points from other classes
Model Selection for Nearest neighbor classification
X
10/6/18
Dr.YanjunQi/UVACS
11
• Choosing the value of k:– If k is too small, sensitive to noise points– If k is too large, neighborhood may include points
from other classes
Model Selection for Nearest neighbor classification
X
10/6/18
Dr.YanjunQi/UVACS
12
• Scaling issues– Attributes may have to be scaled to prevent
distance measures from being dominated by one of the attributes
– Example:• height of a person may vary from 1.5 m to 1.8 m• weight of a person may vary from 90 lb to 300 lb• income of a person may vary from $10K to $1M
Challenge in Nearest neighbor classification
10/6/18
Dr.YanjunQi/UVACS
13
• k-Nearest neighbor classifier is a lazylearner – Does not build model explicitly.– Classifying unknown samples is relatively
expensive.
• k-Nearest neighbor classifier is a localmodel, vs. global model like linear classifiers.
Nearest neighbor classification
10/6/18
Dr.YanjunQi/UVACS
14
K-NearestNeighborforRegression?
• ForKNN– Decisionisdelayedtillanewinstancearrives– Targetvariablemaybediscreteorreal-valued
• Whentargetiscontinuous,thenaïvepredictionisthemeanvalueoftheknearesttrainingexamples
K=5-NearestNeighbor(1Dinput)forRegression
10/6/18
Dr.YanjunQi/UVACS
16
NearestNeighbor(1Dinput)forRegression
10/6/18
Dr.YanjunQi/UVACS
17
NearestNeighbor(1Dinput)forRegression
10/6/18
Dr.YanjunQi/UVACS
18
WhenK=2
• Compute distance between two points:– For instance, Euclidean distance
• Options for determining the class from nearest neighbor list– Take majority vote of class labels among the
k-nearest neighbors– Weight the votes according to distance
• example: weight factor w = 1 / d2
Nearest neighbor Variations
∑ −=i
ii yxd 2)(),( yx
10/6/18
Dr.YanjunQi/UVACS
19
Variants:Distance-Weightedk-NearestNeighborAlgorithm
• Assignweightstotheneighborsbasedontheir“distance” fromthequerypoint– Weight“may” beinversesquareofthedistances
• Extreme Option: Alltrainingpointsmayinfluenceaparticularinstance– E.g.,Shepard’smethod/ModifiedShepard,… byGeospatialAnalysis
Instance-basedRegressionvs.LinearRegression
• LinearRegressionLearning– Explicitdescriptionoftargetfunctiononthewholetrainingset
• Instance-basedLearning– Learning=storingalltraininginstances– Referredtoas“Lazy” learning
ComputationalTimeCost
10/6/18
Dr.YanjunQi/UVACS
22
K-Nearest Neighbor
Regression/ Classification
Local Smoothness
NA
NA
Training Samples
Task
Representation
Score Function
Search/Optimization
Models, Parameters
10/6/18 23
Dr.YanjunQi/UVACS
Decision boundaries in global vs. local models
Linear classification
• global• stable• can be inaccurate
15-nearest neighbor 1-nearest neighbor
• local• accurate• unstable
What ultimately matters: GENERALIZATION
10/6/18 24
Dr.YanjunQi/UVACS
K=15 K=1
• Kactsasasmoother
Today:
ü K-nearestneighborü ModelSelection/BiasVarianceTradeoff
ü Bias-Variancetradeoffü Highbias?Highvariance?Howtorespond?
10/6/18 25
Dr.YanjunQi/UVACS
e.g. Training Error from KNN,Lesson Learned
• When k = 1, • No misclassifications (on
training): Overfit
• Minimizing training error is not always good (e.g., 1-NN)
X1
X2
1-nearest neighbor averaging
5.0ˆ =Y
5.0ˆ <Y
5.0ˆ >Y
10/6/18 26
Dr.YanjunQi/UVACS
Review:MeanandVarianceofRandomVariable(RV)
• Mean(Expectation):– DiscreteRVs:
– ContinuousRVs:
( )XEµ =
E X( ) = vi *P X = vi( )vi∑
E X( ) = x* p x( )dx−∞
+∞
∫
10/6/18 27
Dr.YanjunQi/UVACS
AdaptFromCarols’prob tutorial
X: random variables written in capital letter
Review:MeanandVarianceofRandomVariable(RV)
• Mean(Expectation):– DiscreteRVs:
– ContinuousRVs:
( )XEµ =
( ) ( )X P Xii iv
E v v= =∑
E X( ) = x* p x( )dx−∞
+∞
∫€
E(g(X)) = g(vi)P(X = vi)vi∑
E(g(X)) = g(x)* p(x)dx−∞
+∞
∫10/6/18 28
Dr.YanjunQi/UVACS
AdaptFromCarols’prob tutorial
Review:MeanandVarianceofRV• Variance:
– DiscreteRVs:
– ContinuousRVs:
( ) ( ) ( )2X P Xi
i ivV v vµ= − =∑
V X( ) = x − µ( )2 p x( )−∞
+∞
∫ dx
Var(X) = E((X −µ)2 )
10/6/18 29
Dr.YanjunQi/UVACS
AdaptFromCarols’prob tutorial
BIAS AND VARIANCE TRADE-OFF
• Bias– measures accuracy or quality of the estimator– low bias implies on average we will accurately
estimate true parameter from training data• Variance
– Measures precision or specificity of the estimator– Low variance implies the estimator does not change
much as the training set varies
30
10/6/18
Dr.YanjunQi/UVACS
10/6/18
Dr.YanjunQi/UVACS
31
BIAS AND VARIANCE TRADE-OFF for Mean Squared Error of
parameter estimation
Error due to incorrect
assumptions
Error due to variance of training
samples
E()=
Bias-Variance Tradeoff / Model Selection
32
underfit regionoverfit region
10/6/18
Dr.YanjunQi/UVACS
(1)Trainingvs TestError
ExpectedTestError
ExpectedTrainingError
3310/6/18
Dr.YanjunQi/UVACS
(1)Trainingvs TestError• Trainingerror
canalwaysbereducedwhenincreasingmodelcomplexity, ExpectedTestError
ExpectedTrainingError
3410/6/18
Dr.YanjunQi/UVACS
10/6/18
Dr.YanjunQi/UVACS
35
(1)Trainingvs TestError
ExpectedTestError
ExpectedTrainingError
• Trainingerrorcanalwaysbereducedwhenincreasingmodelcomplexity,
• ExpectedTesterrorandCVerrorè goodapproximationofofgeneralization(seeextraslidesaboutEPE)
Test Error to EPE: (Extra)
• Random input vector: X• Random output variable:Y• Joint distribution: Pr(X,Y )• Loss function L(Y, f(X))
• Expected prediction error (EPE):
€
EPE( f ) = E(L(Y, f (X))) = L(y, f (x))∫ Pr(dx,dy)
e.g. = (y − f (x))2∫ Pr(dx,dy)
One way to consider
generalization: by
considering the joint
population distribution
e.g. Squared error loss (also called L2 loss )10/6/18 36
Dr.YanjunQi/UVACS
(2)Bias-VarianceTrade-off
• Modelswithtoofewparametersareinaccuratebecauseofalargebias(notenoughflexibility).
• Modelswithtoomanyparametersareinaccuratebecauseofalargevariance(toomuchsensitivitytothesamplerandomness).
Slidecredit:D.Hoiem10/6/18 37
Dr.YanjunQi/UVACS
(2.1) Regression: Complexity versus Goodness of Fit
HighestBiasLowestvarianceModelcomplexity=low
MediumBiasMediumVariance
Modelcomplexity=medium
SmallestBiasHighestvarianceModelcomplexity=high
3810/6/18
Dr.YanjunQi/UVACS
LowVariance/HighBias
LowBias/HighVariance
(2.2) Classification, Decision boundaries in global vs. local models
linear regression• global• stable• can be inaccurate
15-nearest neighbor 1-nearest neighbor
KNN• local• accurate• unstable
What ultimately matters: GENERALIZATION10/6/18 39
LowBias/HighVariance
LowVariance/HighBias
Dr.YanjunQi/UVACS
Bias-Variance Tradeoff / Model Selection
40
underfit regionoverfit region
10/6/18
Dr.YanjunQi/UVACS
Model “bias” & Model “variance”
• Middle RED: – TRUE function
• Error due to bias:– How far off in general
from the middle red
• Error due to variance:– How wildly the blue
points spread
10/6/18 41
Dr.YanjunQi/UVACS
Model “bias” & Model “variance”
• Middle RED: – TRUE function
• Error due to bias:– How far off in general
from the middle red
• Error due to variance:– How wildly the blue
points spread
10/6/18 42
Dr.YanjunQi/UVACS
10/6/18
Dr.YanjunQi/UVACS
43
RandomnessofTrainSet=>VarianceofModels,e.g.,
RandomnessofTrainSet=>VarianceofModels,e.g.,
10/6/18
Dr.YanjunQi/UVACS
44
needtomakeassumptionsthatareabletogeneralize
• Components– Bias: how much the average model over all training sets differ from
the true model?• Error due to inaccurate assumptions/simplifications made by the
model– Variance: how much models estimated from different training sets
differ from each other
Slidecredit:L.Lazebnik10/6/18 45
Dr.YanjunQi/UVACS
TodayExtra:
ü K-nearestneighborü ModelSelection/BiasVarianceTradeoff
ü Bias-Variancetradeoffü Highbias?Highvariance?Howtorespond?
(EXTRA)
10/6/18 46
Dr.YanjunQi/UVACS
10/6/18
Dr.YanjunQi/UVACS
47Credit:StanfordMachineLearning
References
q Prof.Tan,Steinbach,Kumar’s“IntroductiontoDataMining”slide
q Prof.AndrewMoore’sslidesq Prof.EricXing’sslidesq Hastie,Trevor,etal.Theelementsofstatisticallearning.Vol.2.No.1.NewYork:Springer,2009.
10/6/18 48
Dr.YanjunQi/UVACS
TodayExtra:
ü K-nearestneighborü ModelSelection/BiasVarianceTradeoff
ü Bias-Variancetradeoffü Highbias?Highvariance?Howtorespond?ü EPE
10/6/18 49
Dr.YanjunQi/UVACS
needtomakeassumptionsthatareabletogeneralize
• Underfitting: model is too “simple” to represent all the relevant class characteristics– High bias and low variance– High training error and high test error
• Overfitting: model is too “complex” and fits irrelevant characteristics (noise) in the data– Low bias and high variance– Low training error and high test error
Slidecredit:L.Lazebnik10/6/18 50
Dr.YanjunQi/UVACS
(1)Highvariance
10/6/18 51
• Testerrorstilldecreasingasmincreases.Suggestslargertrainingsetwillhelp.
• Largegapbetweentrainingandtesterror.
• Low training error and high test error Slidecredit:A.Ng
Dr.YanjunQi/UVACS
CouldalsouseCrossV
Error
Howtoreducevariance?
• Chooseasimplerclassifier– MoreBias
• Regularizetheparameters– MoreBias
• Getmoretrainingdata
• Trysmallersetoffeatures– MoreBias
Slidecredit:D.Hoiem10/6/18 52
Dr.YanjunQi/UVACS
(2)Highbias
10/6/18 53
•Eventrainingerrorisunacceptablyhigh.•Smallgapbetweentrainingandtesterror.
High training error and high test errorSlidecredit:A.Ng
Dr.YanjunQi/UVACS
CouldalsouseCrossV
Error
HowtoreduceBias?
54
• E.g.
- Getadditionalfeatures
- Tryaddingbasisexpansions,e.g.polynomial
- Trymorecomplexlearner
10/6/18
Dr.YanjunQi/UVACS
(3)Forinstance,iftryingtosolve“spamdetection”using(Extra)
10/6/18 55
L2-
Slidecredit:A.Ng
Dr.YanjunQi/UVACS
Ifperformance isnotasdesired
(4)ModelSelectionandAssessment
• ModelSelection– Estimatingperformancesofdifferentmodelstochoosethebestone
• ModelAssessment– Havingchosenamodel,estimatingthepredictionerroronnewdata
5610/6/18
Dr.YanjunQi/UVACS
ModelSelectionandAssessment(Extra)
• WhenDataRichScenario:Splitthedataset
• WhenInsufficientdatatosplitinto3parts– Approximatevalidationstepanalytically
• AIC,BIC,MDL,SRM– Efficientreuseofsamples
• Crossvalidation,bootstrap
Model Selection Model assessment
Train Validation Test
5710/6/18
Dr.YanjunQi/UVACS
TodayExtra:
ü K-nearestneighborü ModelSelection/BiasVarianceTradeoff
ü Bias-Variancetradeoffü Highbias?Highvariance?Howtorespond?ü EPE
10/6/18 58
Dr.YanjunQi/UVACS
Statistical Decision Theory (Extra)
• Random input vector: X• Random output variable:Y• Joint distribution: Pr(X,Y )• Loss function L(Y, f(X))
• Expected prediction error (EPE):•
€
EPE( f ) = E(L(Y, f (X))) = L(y, f (x))∫ Pr(dx,dy)
e.g. = (y − f (x))2∫ Pr(dx,dy)Consider
population distribution
e.g. Squared error loss (also called L2 loss )10/6/18 59
Dr.YanjunQi/UVACS
Expected prediction error (EPE)
• For L2 loss:
• For 0-1 loss: L(k, ℓ) = 1-δkl
EPE( f ) = E(L(Y, f (X))) = L(y, f (x))∫ Pr(dx,dy)Consider joint
distribution
!!
f̂ (X )=C k !if!Pr(C k |X = x)=max
g∈CPr(g|X = x)Bayes Classifier
e.g. = (y− f (x))2∫ Pr(dx,dy)
Conditional mean !! f̂ (x)=E(Y |X = x)
under L2 loss, best estimator for EPE (Theoretically) is :
e.g. KNN NN methods are the direct implementation (approximation )
10/6/18 60
Dr.YanjunQi/UVACS
61
EXPECTED PREDICTION ERROR for L2 Loss
• Expected prediction error (EPE) for L2 Loss:
• Since Pr(X,Y )=Pr(Y |X )Pr(X ), EPE can also be written as
• Thus it suffices to minimize EPE pointwise
EPE( f ) = E(Y − f (X))2 = (y− f (x))2∫ Pr(dx,dy)
)|)](([EE)(EPE 2| XXfYf XYX −=
)|]([Eminarg)( 2| xXcYxf XYc =−=
)|(E)( xXYxf ==Solution for Regression:
Best estimator under L2 loss: conditional expectation
Conditional mean
Solution for kNN:
KNN FOR MINIMIZING EPE• We know under L2 loss, best estimator for
EPE (theoretically) is :
– Nearest neighbors assumes that f(x) is well approximated by a locally constant function.
62
Conditional mean )|(E)( xXYxf ==
10/6/18
Dr.YanjunQi/UVACS
Review : WHEN EPE USES DIFFERENT LOSS
Loss Function Estimator
L2
L1
0-1(Bayes classifier / MAP)
6310/6/18 Dr.YanjunQi/UVACS
Decomposition of EPE– When additive error model: – Notations
• Output random variable:• Prediction function: • Prediction estimator:
Irreducible / Bayes error
64
MSE component of f-hat in estimating f
10/6/18
Dr.YanjunQi/UVACS
Bias-VarianceTrade-offforEPE:
EPE (x_0) = noise2 + bias2 + variance
Unavoidable error
Error due to incorrect
assumptions
Error due to variance of training
samples
Slidecredit:D.Hoiem10/6/18 65
Dr.YanjunQi/UVACS
10/6/18
Dr.YanjunQi/UVACS
66http://alex.smola.org/teaching/10-701-15/homework/hw3_sol.pdf
10/6/18
Dr.YanjunQi/UVACS
67http://alex.smola.org/teaching/10-701-15/homework/hw3_sol.pdf
MSEofModel,aka,Risk
10/6/18
Dr.YanjunQi/UVACS
68http://www.stat.cmu.edu/~ryantibs/statml/review/modelbasics.pdf
CrossValidationandVarianceEstimation
10/6/18
Dr.YanjunQi/UVACS
69
http://www.stat.cmu.edu/~ryantibs/statml/review/modelbasics.pdf
10/6/18
Dr.YanjunQi/UVACS
70
http://www.stat.cmu.edu/~ryantibs/statml/review/modelbasics.pdf