modernizing k-nearest neighbor software...k-nearest neighbor software robin elizabeth yancey bochao...
TRANSCRIPT
![Page 1: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/1.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Modernizing k-Nearest Neighbor Software
Robin Elizabeth Yancey
Bochao Xin
Norm Matloff
Dept. of Computer ScienceUniversity of California, Davis
June 4, 2020; updated June 5
URL for these slides (repeated on final slide):http://heather.cs.ucdavis.edu/SDSSslidesKNN.pdf
![Page 2: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/2.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Notation and Acronyms
• n: number of data points in our training cata
• p: number of predictors/features
• ML: machine learning (= nonparametric regression)
• k-NN: k-nearest neighbor method
• RFs: random forests
• SVMs: Support Vector Machines
• NNs: neural networks
![Page 3: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/3.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Notation and Acronyms
• n: number of data points in our training cata
• p: number of predictors/features
• ML: machine learning (= nonparametric regression)
• k-NN: k-nearest neighbor method
• RFs: random forests
• SVMs: Support Vector Machines
• NNs: neural networks
![Page 4: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/4.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Notation and Acronyms
• n: number of data points in our training cata
• p: number of predictors/features
• ML: machine learning (= nonparametric regression)
• k-NN: k-nearest neighbor method
• RFs: random forests
• SVMs: Support Vector Machines
• NNs: neural networks
![Page 5: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/5.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Notation and Acronyms
• n: number of data points in our training cata
• p: number of predictors/features
• ML: machine learning (= nonparametric regression)
• k-NN: k-nearest neighbor method
• RFs: random forests
• SVMs: Support Vector Machines
• NNs: neural networks
![Page 6: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/6.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Notation and Acronyms
• n: number of data points in our training cata
• p: number of predictors/features
• ML: machine learning (= nonparametric regression)
• k-NN: k-nearest neighbor method
• RFs: random forests
• SVMs: Support Vector Machines
• NNs: neural networks
![Page 7: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/7.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Notation and Acronyms
• n: number of data points in our training cata
• p: number of predictors/features
• ML: machine learning (= nonparametric regression)
• k-NN: k-nearest neighbor method
• RFs: random forests
• SVMs: Support Vector Machines
• NNs: neural networks
![Page 8: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/8.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Notation and Acronyms
• n: number of data points in our training cata
• p: number of predictors/features
• ML: machine learning (= nonparametric regression)
• k-NN: k-nearest neighbor method
• RFs: random forests
• SVMs: Support Vector Machines
• NNs: neural networks
![Page 9: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/9.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Notation and Acronyms
• n: number of data points in our training cata
• p: number of predictors/features
• ML: machine learning (= nonparametric regression)
• k-NN: k-nearest neighbor method
• RFs: random forests
• SVMs: Support Vector Machines
• NNs: neural networks
![Page 10: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/10.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Overview of k-NN
• Like all ML methods, does smoothing. E (Y | X = t) =average Y among the k-nearest datapoints to t.
• Earliest ML method, e.g. (Fix and Hodges, 1951).
• Later, largely displaced in popularity by RFs, SVMs, NNs.
• Still common in some apps., e.g. recommender systems,outlier detection.
• And has some real advantages:
![Page 11: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/11.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Overview of k-NN
• Like all ML methods, does smoothing. E (Y | X = t) =average Y among the k-nearest datapoints to t.
• Earliest ML method, e.g. (Fix and Hodges, 1951).
• Later, largely displaced in popularity by RFs, SVMs, NNs.
• Still common in some apps., e.g. recommender systems,outlier detection.
• And has some real advantages:
![Page 12: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/12.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Overview of k-NN
• Like all ML methods, does smoothing. E (Y | X = t) =average Y among the k-nearest datapoints to t.
• Earliest ML method, e.g. (Fix and Hodges, 1951).
• Later, largely displaced in popularity by RFs, SVMs, NNs.
• Still common in some apps., e.g. recommender systems,outlier detection.
• And has some real advantages:
![Page 13: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/13.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Overview of k-NN
• Like all ML methods, does smoothing. E (Y | X = t) =average Y among the k-nearest datapoints to t.
• Earliest ML method, e.g. (Fix and Hodges, 1951).
• Later, largely displaced in popularity by RFs, SVMs, NNs.
• Still common in some apps., e.g. recommender systems,outlier detection.
• And has some real advantages:
![Page 14: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/14.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Comparison of Various MLMethods
method tuningpars.(fewerbetter)
iterative?(no better)
uniquesol’n.?(yesbetter)
k-NN k no yes
RFs depth,leaf size,split crit.etc.
yes no
SVM d , C yes yes
NNs “∞” yes no
![Page 15: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/15.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Comparison of Various MLMethods
method tuningpars.(fewerbetter)
iterative?(no better)
uniquesol’n.?(yesbetter)
k-NN k no yes
RFs depth,leaf size,split crit.etc.
yes no
SVM d , C yes yes
NNs “∞” yes no
![Page 16: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/16.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Improved k-NN
• So, k-NN has the virtues of being simple, e.g. only 1tuning parameter, and computationally attractive.
• We believe that, with improvements, k-NN can be quitecompetitive with other methods.
• Two Innovations, one methodological and one diagnostic:
• Assigning different distance weights to different predictors.• Exploring locally-determined values of k.• This talk will focus on the first innovation.
![Page 17: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/17.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Improved k-NN
• So, k-NN has the virtues of being simple, e.g. only 1tuning parameter, and computationally attractive.
• We believe that, with improvements, k-NN can be quitecompetitive with other methods.
• Two Innovations, one methodological and one diagnostic:
• Assigning different distance weights to different predictors.• Exploring locally-determined values of k.• This talk will focus on the first innovation.
![Page 18: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/18.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Improved k-NN
• So, k-NN has the virtues of being simple, e.g. only 1tuning parameter, and computationally attractive.
• We believe that, with improvements, k-NN can be quitecompetitive with other methods.
• Two Innovations, one methodological and one diagnostic:
• Assigning different distance weights to different predictors.• Exploring locally-determined values of k.• This talk will focus on the first innovation.
![Page 19: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/19.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Improved k-NN
• So, k-NN has the virtues of being simple, e.g. only 1tuning parameter, and computationally attractive.
• We believe that, with improvements, k-NN can be quitecompetitive with other methods.
• Two Innovations, one methodological and one diagnostic:
• Assigning different distance weights to different predictors.• Exploring locally-determined values of k.• This talk will focus on the first innovation.
![Page 20: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/20.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Different Distance Weigts forDifferent Predictors
• E.g. done in (Han et al, 2001) for cosine “distance” fortext clasification. Optimization is performed.
• Here we’ll use (weighted) Euclidean distance.
![Page 21: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/21.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Different Distance Weigts forDifferent Predictors
• E.g. done in (Han et al, 2001) for cosine “distance” fortext clasification. Optimization is performed.
• Here we’ll use (weighted) Euclidean distance.
![Page 22: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/22.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Empirical Examples
• Will use the regtools package (on CRAN, but latest atgithub.com/matloff ).
• Over 50 tools for regression, classification and ML.
• Will use kNN() and fineTuning().
![Page 23: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/23.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Empirical Examples
• Will use the regtools package (on CRAN, but latest atgithub.com/matloff ).
• Over 50 tools for regression, classification and ML.
• Will use kNN() and fineTuning().
![Page 24: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/24.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
The fineTuning() Function
• Advanced grid search tool for tuning parameter selection.
• Motivation: The reported “best” parameter combinationmay not really be best. Avoid p-hacking problem.
• The tool allows exploring various good parametercombinations. Bonferroni CIs.
• Includes a plotting facility.
![Page 25: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/25.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
The fineTuning() Function
• Advanced grid search tool for tuning parameter selection.
• Motivation: The reported “best” parameter combinationmay not really be best. Avoid p-hacking problem.
• The tool allows exploring various good parametercombinations. Bonferroni CIs.
• Includes a plotting facility.
![Page 26: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/26.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
The fineTuning() Function
• Advanced grid search tool for tuning parameter selection.
• Motivation: The reported “best” parameter combinationmay not really be best.
Avoid p-hacking problem.
• The tool allows exploring various good parametercombinations. Bonferroni CIs.
• Includes a plotting facility.
![Page 27: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/27.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
The fineTuning() Function
• Advanced grid search tool for tuning parameter selection.
• Motivation: The reported “best” parameter combinationmay not really be best. Avoid p-hacking problem.
• The tool allows exploring various good parametercombinations. Bonferroni CIs.
• Includes a plotting facility.
![Page 28: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/28.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
The fineTuning() Function
• Advanced grid search tool for tuning parameter selection.
• Motivation: The reported “best” parameter combinationmay not really be best. Avoid p-hacking problem.
• The tool allows exploring various good parametercombinations.
Bonferroni CIs.
• Includes a plotting facility.
![Page 29: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/29.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
The fineTuning() Function
• Advanced grid search tool for tuning parameter selection.
• Motivation: The reported “best” parameter combinationmay not really be best. Avoid p-hacking problem.
• The tool allows exploring various good parametercombinations. Bonferroni CIs.
• Includes a plotting facility.
![Page 30: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/30.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
The fineTuning() Function
• Advanced grid search tool for tuning parameter selection.
• Motivation: The reported “best” parameter combinationmay not really be best. Avoid p-hacking problem.
• The tool allows exploring various good parametercombinations. Bonferroni CIs.
• Includes a plotting facility.
![Page 31: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/31.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Example: Major League BaseballData
• For convenience, a very simple example: Predict weightfrom height, age.
• Dataset from regtools package.
• n = 1023, p = 2 (plus others not used here)
![Page 32: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/32.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Example: Major League BaseballData
• For convenience, a very simple example: Predict weightfrom height, age.
• Dataset from regtools package.
• n = 1023, p = 2 (plus others not used here)
![Page 33: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/33.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
MLB, cont’d.
> data (mlb ) # in r e g t o o l s pkg
> mlb ← mlb [ , c ( 4 , 6 , 5 ) ]> mlb [ 1 , ]
He ight Age Weight1 74 22 .99 180> args (kNN)f u n c t i o n ( x , y , newx=x , kmax , s ca l eX=TRUE,
PCAcomps=0, expandVars=NULL , expandVa l s=NULL ,smooth ingFtn=mean , a l l K=FALSE , l e a v e 1ou t=FALSE ,c l a s s i f=FALSE)
![Page 34: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/34.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
MLB, cont’d.
> data (mlb ) # in r e g t o o l s pkg
> mlb ← mlb [ , c ( 4 , 6 , 5 ) ]> mlb [ 1 , ]
He ight Age Weight1 74 22 .99 180> args (kNN)f u n c t i o n ( x , y , newx=x , kmax , s ca l eX=TRUE,
PCAcomps=0, expandVars=NULL , expandVa l s=NULL ,smooth ingFtn=mean , a l l K=FALSE , l e a v e 1ou t=FALSE ,c l a s s i f=FALSE)
![Page 35: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/35.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
MLB, cont’d
The fineTuning() function calls a user-defined function thatdoes the work:
# f i n eTun i n g ( ) f o rms c u r r e n t t r a i n i n g t e s t s e t s ,
# d t rn and d t s t , and c u r r e n t pa r ame t e r c omb i na t i o n
# ‘Mcmbi
knnCa l l ← f u n c t i o n ( dtrn , d t s t , cmbi ) {knnOut ← kNN( d t rn [ , 1 : 2 ] , d t r n [ , 3 ] , d t s t [ , 1 : 2 ] ,
cmbi$k , expandVars=1, expandVa l s=cmbi$expandHt )mean( abs ( d t s t [ , 3 ] − knnOut$ r e g e s t s ) )
}
And the call:
f t ← f i n eTun i ng (mlb , pa r s= l i s t ( k=c (5 , 20 , 50 , 100 ) ,expandHt=c ( 1 . 8 , 1 . 5 , 1 . 2 , 1 , 0 . 8 , 0 . 5 , 0 . 2 ) ) ,r e g C a l l=knnCa l l , nTst=500 , nXval=100)
![Page 36: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/36.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
MLB, cont’d
The fineTuning() function calls a user-defined function thatdoes the work:
# f i n eTun i n g ( ) f o rms c u r r e n t t r a i n i n g t e s t s e t s ,
# d t rn and d t s t , and c u r r e n t pa r ame t e r c omb i na t i o n
# ‘Mcmbi
knnCa l l ← f u n c t i o n ( dtrn , d t s t , cmbi ) {knnOut ← kNN( d t rn [ , 1 : 2 ] , d t r n [ , 3 ] , d t s t [ , 1 : 2 ] ,
cmbi$k , expandVars=1, expandVa l s=cmbi$expandHt )mean( abs ( d t s t [ , 3 ] − knnOut$ r e g e s t s ) )
}
And the call:
f t ← f i n eTun i ng (mlb , pa r s= l i s t ( k=c (5 , 20 , 50 , 100 ) ,expandHt=c ( 1 . 8 , 1 . 5 , 1 . 2 , 1 , 0 . 8 , 0 . 5 , 0 . 2 ) ) ,r e g C a l l=knnCa l l , nTst=500 , nXval=100)
![Page 37: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/37.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
MLB Output
> f t$ ou td f
k expandHt meanAcc seAcc bonfAcc1 50 1 .8 13.81726 0.03721619 0.116253512 20 1 .8 13.84013 0.03122950 0.097552663 100 1 .8 13.87238 0.03471346 0.108435634 20 0 .8 13.87528 0.03619783 0.113072425 100 1 .2 13.89429 0.03805532 0.11887472. . .. . .24 5 1 .2 14.84733 0.03666898 0.1145441725 5 1 .5 14.89271 0.03242414 0.1012844126 5 0 .2 14.89479 0.03801763 0.1187570027 5 0 .5 14.90646 0.04020769 0.1255981628 100 0 .2 15.14842 0.03691466 0.11531160
![Page 38: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/38.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
MLB Output
> f t$ ou td f
k expandHt meanAcc seAcc bonfAcc1 50 1 .8 13.81726 0.03721619 0.116253512 20 1 .8 13.84013 0.03122950 0.097552663 100 1 .8 13.87238 0.03471346 0.108435634 20 0 .8 13.87528 0.03619783 0.113072425 100 1 .2 13.89429 0.03805532 0.11887472. . .. . .24 5 1 .2 14.84733 0.03666898 0.1145441725 5 1 .5 14.89271 0.03242414 0.1012844126 5 0 .2 14.89479 0.03801763 0.1187570027 5 0 .5 14.90646 0.04020769 0.1255981628 100 0 .2 15.14842 0.03691466 0.11531160
![Page 39: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/39.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
MLB Comments
• As expected, the largest expansion value for Height seemsbest; Height is more important than Age.
• Further investigation with even larger expansion seemswarranted.
• But beware of p-hacking!
• All results subject to sample variation.• Thus fineTuning() displays radii of Bonferroni CIs.• An earlier run with nXval (cross val. folds) at 25 had
ambiguous results; 100 works well here.
![Page 40: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/40.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
MLB Comments
• As expected, the largest expansion value for Height seemsbest; Height is more important than Age.
• Further investigation with even larger expansion seemswarranted.
• But beware of p-hacking!
• All results subject to sample variation.• Thus fineTuning() displays radii of Bonferroni CIs.• An earlier run with nXval (cross val. folds) at 25 had
ambiguous results; 100 works well here.
![Page 41: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/41.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
MLB Comments
• As expected, the largest expansion value for Height seemsbest; Height is more important than Age.
• Further investigation with even larger expansion seemswarranted.
• But beware of p-hacking!
• All results subject to sample variation.• Thus fineTuning() displays radii of Bonferroni CIs.• An earlier run with nXval (cross val. folds) at 25 had
ambiguous results; 100 works well here.
![Page 42: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/42.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
MLB Comments
• As expected, the largest expansion value for Height seemsbest; Height is more important than Age.
• Further investigation with even larger expansion seemswarranted.
• But beware of p-hacking!
• All results subject to sample variation.• Thus fineTuning() displays radii of Bonferroni CIs.• An earlier run with nXval (cross val. folds) at 25 had
ambiguous results; 100 works well here.
![Page 43: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/43.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
MLB Comments
• As expected, the largest expansion value for Height seemsbest; Height is more important than Age.
• Further investigation with even larger expansion seemswarranted.
• But beware of p-hacking!
• All results subject to sample variation.• Thus fineTuning() displays radii of Bonferroni CIs.• An earlier run with nXval (cross val. folds) at 25 had
ambiguous results; 100 works well here.
![Page 44: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/44.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
MLB Plot
• The fineTuning() function has an associated generic plotfunction.
• Use the parallel coordinates graphical method (Inselberg,1997).
• View multidimensional data in 2-D.
• Implemented in cdparcoord (“categorical and discreteparallel coordinates”) package.
• Latter uses Plotly, so can drag columns to change orderetc.
![Page 45: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/45.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
MLB Plot
• The fineTuning() function has an associated generic plotfunction.
• Use the parallel coordinates graphical method (Inselberg,1997).
• View multidimensional data in 2-D.
• Implemented in cdparcoord (“categorical and discreteparallel coordinates”) package.
• Latter uses Plotly, so can drag columns to change orderetc.
![Page 46: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/46.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Plot
> p l o t ( f t )
![Page 47: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/47.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Plot
> p l o t ( f t )
![Page 48: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/48.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Plot, Column Dragged
Can rotate columns by dragging.
![Page 49: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/49.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Plot, Column Dragged
Can rotate columns by dragging.
![Page 50: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/50.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Plot, Zoomed in
Can zoom in, isolating only the best combinations.
> p l o t ( f t ,−10)
![Page 51: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/51.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Plot, Zoomed inCan zoom in, isolating only the best combinations.
> p l o t ( f t ,−10)
![Page 52: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/52.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Example: Prog/Engr Census Data
• Dataset from regtools package.
• Predict occupation, among 6 programmer/engineer jobtitles. X = age, MS indicator, PhD indicator, gender (M),wage income, weeks worked.
• n = 20070, p = 6
![Page 53: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/53.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Example: Prog/Engr Census Data
• Dataset from regtools package.
• Predict occupation, among 6 programmer/engineer jobtitles. X = age, MS indicator, PhD indicator, gender (M),wage income, weeks worked.
• n = 20070, p = 6
![Page 54: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/54.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Census cont’d.
knnCa l l ← f u n c t i o n ( dtrn , d t s t , cmbi ) {d t rn ← as . matr ix ( d t rn )d t s t ← as . matr ix ( d t s t )knnOut ← kNN(
d t rn [ , − ( 4 : 9 ) ] , d t r n [ , 4 : 9 ] , d t s t [ , − ( 4 : 9 ) ] ,cmbi$k ,expandVars=c ( 1 : 6 ) ,expandVa l s=c ( cmbi$age , cmbi$e14 , cmbi$e16 ,
cmbi$gend , cmbi$wks , cmbi$wage ) ,c l a s s i f=TRUE)
p r ed s ← apply ( knnOut$ r e g e s t s , 1 , which .max)newy ← apply ( d t s t [ , 4 : 9 ] , 1 , which .max)mean( p r ed s == newy )
}
![Page 55: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/55.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Census cont’d.
knnCa l l ← f u n c t i o n ( dtrn , d t s t , cmbi ) {d t rn ← as . matr ix ( d t rn )d t s t ← as . matr ix ( d t s t )knnOut ← kNN(
d t rn [ , − ( 4 : 9 ) ] , d t r n [ , 4 : 9 ] , d t s t [ , − ( 4 : 9 ) ] ,cmbi$k ,expandVars=c ( 1 : 6 ) ,expandVa l s=c ( cmbi$age , cmbi$e14 , cmbi$e16 ,
cmbi$gend , cmbi$wks , cmbi$wage ) ,c l a s s i f=TRUE)
p r ed s ← apply ( knnOut$ r e g e s t s , 1 , which .max)newy ← apply ( d t s t [ , 4 : 9 ] , 1 , which .max)mean( p r ed s == newy )
}
![Page 56: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/56.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Census cont’d.
f t ← f i n eTun i ng ( ped ,pa r s= l i s t ( k=c (10 , 50 ) , age=c ( 0 . 5 , 2 ) ,e14=c ( 0 . 5 , 2 ) , e16=c ( 0 . 5 , 2 ) , gend=c ( 0 . 5 , 2 ) ,wks=c ( 0 . 5 , 2 ) , wage=c ( 0 . 5 , 2 ) ) ,r e g C a l l=knnCa l l , nTst=500 , nXval=100)
![Page 57: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/57.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Census cont’d.
f t ← f i n eTun i ng ( ped ,pa r s= l i s t ( k=c (10 , 50 ) , age=c ( 0 . 5 , 2 ) ,e14=c ( 0 . 5 , 2 ) , e16=c ( 0 . 5 , 2 ) , gend=c ( 0 . 5 , 2 ) ,wks=c ( 0 . 5 , 2 ) , wage=c ( 0 . 5 , 2 ) ) ,r e g C a l l=knnCa l l , nTst=500 , nXval=100)
![Page 58: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/58.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Census cont;d,
> f t $ ou td fk age e14 e16 gend wks wage meanAcc seAcc
bonfAcc1 10 0 .5 2 .0 0 .5 2 .0 0 .5 0 .5 0 .33602 0.002248141 0.0079726672 10 0 .5 0 .5 0 .5 0 .5 2 .0 0 .5 0 .33792 0.002365906 0.0083903023 10 0 .5 2 .0 2 .0 2 .0 0 .5 0 .5 0 .33810 0.002216809 0.0078615544 10 2 .0 0 .5 2 .0 0 .5 0 .5 0 .5 0 .33812 0.002026455 0.0071864955 10 0 .5 2 .0 2 .0 0 .5 2 .0 0 .5 0 .33820 0.002267647 0.008041842. . .. . .124 50 0 .5 2 .0 0 .5 0 .5 0 .5 2 .0 0 .37990 0.002038493 0.007229186125 50 2 .0 0 .5 2 .0 2 .0 0 .5 0 .5 0 .38038 0.002260365 0.008016017126 50 2 .0 0 .5 0 .5 2 .0 0 .5 2 .0 0 .38042 0.002094205 0.007426757127 50 0 .5 0 .5 0 .5 0 .5 0 .5 2 .0 0 .38100 0.002340767 0.008301152128 50 0 .5 0 .5 2 .0 2 .0 0 .5 2 .0 0 .38248 0.002202867 0.007812110
![Page 59: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/59.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Census cont;d,
> f t $ ou td fk age e14 e16 gend wks wage meanAcc seAcc
bonfAcc1 10 0 .5 2 .0 0 .5 2 .0 0 .5 0 .5 0 .33602 0.002248141 0.0079726672 10 0 .5 0 .5 0 .5 0 .5 2 .0 0 .5 0 .33792 0.002365906 0.0083903023 10 0 .5 2 .0 2 .0 2 .0 0 .5 0 .5 0 .33810 0.002216809 0.0078615544 10 2 .0 0 .5 2 .0 0 .5 0 .5 0 .5 0 .33812 0.002026455 0.0071864955 10 0 .5 2 .0 2 .0 0 .5 2 .0 0 .5 0 .33820 0.002267647 0.008041842. . .. . .124 50 0 .5 2 .0 0 .5 0 .5 0 .5 2 .0 0 .37990 0.002038493 0.007229186125 50 2 .0 0 .5 2 .0 2 .0 0 .5 0 .5 0 .38038 0.002260365 0.008016017126 50 2 .0 0 .5 0 .5 2 .0 0 .5 2 .0 0 .38042 0.002094205 0.007426757127 50 0 .5 0 .5 0 .5 0 .5 0 .5 2 .0 0 .38100 0.002340767 0.008301152128 50 0 .5 0 .5 2 .0 2 .0 0 .5 2 .0 0 .38248 0.002202867 0.007812110
![Page 60: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/60.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Census cont’d.
![Page 61: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/61.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Census cont’d.
![Page 62: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/62.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Census cont’d.
Worth looking at for a specific value of k , chosen to be 10 here.Let’s consider the 10 combinations with the best accuracy:
> f t $ ou td fk age e14 e16 gend wks wage meanAcc seAcc
bonfAcc1 10 0 .5 2 .0 2 .0 2 .0 2 . 0 0 . 5 0 .33692 0.002241288 0.0075292782 10 2 .0 2 .0 0 .5 2 .0 2 . 0 2 . 0 0 .33702 0.002071352 0.0069584063 10 2 .0 0 .5 2 .0 2 .0 2 . 0 0 . 5 0 .33780 0.002042676 0.0068620714 10 0 .5 2 .0 2 .0 0 .5 0 . 5 2 . 0 0 .33796 0.002126120 0.0071423905 10 0 .5 2 .0 2 .0 0 .5 0 . 5 0 . 5 0 .33798 0.002066861 0.0069433186 10 2 .0 2 .0 2 .0 2 .0 0 . 5 2 . 0 0 .33830 0.002033731 0.0068320217 10 0 .5 2 .0 0 .5 2 .0 0 . 5 0 . 5 0 .33882 0.001888850 0.0063453158 10 0 .5 2 .0 2 .0 2 .0 0 . 5 2 . 0 0 .33968 0.001868986 0.0062785849 10 0 .5 0 .5 0 .5 2 .0 0 . 5 0 . 5 0 .33972 0.002121562 0.00712707810 10 2 .0 2 .0 2 .0 0 .5 0 . 5 2 . 0 0 .33988 0.002134057 0.007169051
Remember, we are predicting occupation. It seems that theimportant predictors are MS, PhD and gender.
![Page 63: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/63.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Census cont’d.
Worth looking at for a specific value of k , chosen to be 10 here.Let’s consider the 10 combinations with the best accuracy:
> f t $ ou td fk age e14 e16 gend wks wage meanAcc seAcc
bonfAcc1 10 0 .5 2 .0 2 .0 2 .0 2 . 0 0 . 5 0 .33692 0.002241288 0.0075292782 10 2 .0 2 .0 0 .5 2 .0 2 . 0 2 . 0 0 .33702 0.002071352 0.0069584063 10 2 .0 0 .5 2 .0 2 .0 2 . 0 0 . 5 0 .33780 0.002042676 0.0068620714 10 0 .5 2 .0 2 .0 0 .5 0 . 5 2 . 0 0 .33796 0.002126120 0.0071423905 10 0 .5 2 .0 2 .0 0 .5 0 . 5 0 . 5 0 .33798 0.002066861 0.0069433186 10 2 .0 2 .0 2 .0 2 .0 0 . 5 2 . 0 0 .33830 0.002033731 0.0068320217 10 0 .5 2 .0 0 .5 2 .0 0 . 5 0 . 5 0 .33882 0.001888850 0.0063453158 10 0 .5 2 .0 2 .0 2 .0 0 . 5 2 . 0 0 .33968 0.001868986 0.0062785849 10 0 .5 0 .5 0 .5 2 .0 0 . 5 0 . 5 0 .33972 0.002121562 0.00712707810 10 2 .0 2 .0 2 .0 0 .5 0 . 5 2 . 0 0 .33988 0.002134057 0.007169051
Remember, we are predicting occupation. It seems that theimportant predictors are MS, PhD and gender.
![Page 64: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/64.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Further Comments
• Can be done for any value of p.
• Larger p means: (a) More potential for p-hacking. (b)More columns in plot.
• Optimization not easy in k-NN case, due to lack ofderivatives, though could be done for kernel-basedsmoothing.
![Page 65: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/65.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Further Comments
• Can be done for any value of p.
• Larger p means: (a) More potential for p-hacking. (b)More columns in plot.
• Optimization not easy in k-NN case, due to lack ofderivatives, though could be done for kernel-basedsmoothing.
![Page 66: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/66.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Connection to the Curse ofDimensionality
• The Curse of Dimensionality says, roughly, that in the caseof large p, all X data points are approximately equidistantfrom each other, rendering k-NN of lesser value.
• One way to see the equidistance is to consider a simplemodel in which the p components of an X vector are i.i.d.Then the squared distance between two data points, X1
and X2 is the sum of i.i.d. random variables, and will havemean O(p) and variance O(p), i.e. is nearly constant. Themeans the ratio of standard deviation to mean of thesquared distance is O(1/
√p).
• (Matloff, 2016) has suggested that the CoD be counteredwith a weighted distance, which is what we are using here.
![Page 67: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/67.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Connection to the Curse ofDimensionality
• The Curse of Dimensionality says, roughly, that in the caseof large p, all X data points are approximately equidistantfrom each other, rendering k-NN of lesser value.
• One way to see the equidistance is to consider a simplemodel in which the p components of an X vector are i.i.d.Then the squared distance between two data points, X1
and X2 is the sum of i.i.d. random variables, and will havemean O(p) and variance O(p), i.e. is nearly constant. Themeans the ratio of standard deviation to mean of thesquared distance is O(1/
√p).
• (Matloff, 2016) has suggested that the CoD be counteredwith a weighted distance, which is what we are using here.
![Page 68: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/68.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Locally-Adaptive Choice of k
• Classic relation:
MSE = variance + bias2 (1)
• If E (Y | X = t) has a large gradient at a point t, biasmay be large, especially on fringes of X .
• It thus may be worth sacrificing on variance, i.e. worthusing a smaller k.
• Thus locally-adaptive choice of k .
![Page 69: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/69.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Locally-Adaptive Choice of k
• Classic relation:
MSE = variance + bias2 (1)
• If E (Y | X = t) has a large gradient at a point t, biasmay be large, especially on fringes of X .
• It thus may be worth sacrificing on variance, i.e. worthusing a smaller k.
• Thus locally-adaptive choice of k .
![Page 70: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/70.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Locally-Adaptive, cont’d.
• There have been a number of theoretical treatments, butthey do not appear in common software packages.
• The regtools package has the function bestKperPoint()
• At each Xi , asks, “Which k would have best predictedYi?”
> args ( r e g t o o l s : : : b e s tKpe rPo in t )f u n c t i o n ( kNNout , y )
where kNNout is an object returned by kNN() and y is theoriginal Y vector.
> knnOut ← kNN(mlb [ , 1 : 2 ] , mlb [ , 3 ] , mlb [ , 1 : 2 ] , 5 0 ,expandVars=1, expandVa l s =1.8)
> ks ← be s tKpe rPo in t ( knnOut , mlb [ , 3 ] )
![Page 71: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/71.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Locally-Adaptive, cont’d.
• There have been a number of theoretical treatments, butthey do not appear in common software packages.
• The regtools package has the function bestKperPoint()
• At each Xi , asks, “Which k would have best predictedYi?”
> args ( r e g t o o l s : : : b e s tKpe rPo in t )f u n c t i o n ( kNNout , y )
where kNNout is an object returned by kNN() and y is theoriginal Y vector.
> knnOut ← kNN(mlb [ , 1 : 2 ] , mlb [ , 3 ] , mlb [ , 1 : 2 ] , 5 0 ,expandVars=1, expandVa l s =1.8)
> ks ← be s tKpe rPo in t ( knnOut , mlb [ , 3 ] )
![Page 72: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/72.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Just started on this, plan to develop into a diagnostic tool.
![Page 73: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/73.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Just started on this, plan to develop into a diagnostic tool.
![Page 74: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/74.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Future Work
• Comparisons of “improved” k-NN and other ML methods,in accuracy and comp time.
• Development of locally-adaptive approach.
![Page 75: Modernizing k-Nearest Neighbor Software...k-Nearest Neighbor Software Robin Elizabeth Yancey Bochao Xin Norm Matlo Dept. of Computer Science University of California, Davis Overview](https://reader034.vdocuments.mx/reader034/viewer/2022051322/601b04201a007f08685def45/html5/thumbnails/75.jpg)
Modernizingk-NearestNeighborSoftware
RobinElizabethYancey
Bochao Xin
Norm Matloff
Dept. ofComputer
ScienceUniversity of
California,Davis
Future Work
• Comparisons of “improved” k-NN and other ML methods,in accuracy and comp time.
• Development of locally-adaptive approach.