machine learning: chenhao tan university of colorado boulder … · 2017-08-31 · supervised...

61
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 2 Slides adapted from Jordan Boyd-Graber, Thorsten Joachims, Kilian Weinberger Machine Learning: Chenhao Tan | Boulder | 1 of 31

Upload: others

Post on 12-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

Machine Learning: Chenhao TanUniversity of Colorado BoulderLECTURE 2

Slides adapted from Jordan Boyd-Graber, Thorsten Joachims, Kilian Weinberger

Machine Learning: Chenhao Tan | Boulder | 1 of 31

Page 2: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

Logistics

• Piazza: https://piazza.com/colorado/fall2017/csci5622/• Moodle:https://moodle.cs.colorado.edu/course/view.php?id=507

• Prerequisite quiz• Final project• iCliker

Machine Learning: Chenhao Tan | Boulder | 2 of 31

Page 3: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

Outline

Supervised Learning

Data representation

K-nearest neighborsOverviewPerformance GuaranteeCurse of Dimensionality

Machine Learning: Chenhao Tan | Boulder | 3 of 31

Page 4: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

Supervised Learning

Outline

Supervised Learning

Data representation

K-nearest neighborsOverviewPerformance GuaranteeCurse of Dimensionality

Machine Learning: Chenhao Tan | Boulder | 4 of 31

Page 5: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

Supervised Learning

Supervised Learning

Data

XLabels

Y

• Supervised methods find patterns in fully observed data and then try topredict something from partially observed data.

• For example, in sentiment analysis, after learning something from annotatedreviews, we want to take new reviews and automatically identify sentiments.

Machine Learning: Chenhao Tan | Boulder | 5 of 31

Page 6: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

Supervised Learning

Formal Definitions

• Labels Y, e.g., binary labels y ∈ {+1,−1}• Instance space X, all the possible instances (based on data representation)• Target function f : X → Y (f is unknown)

• Example/instance (x, y)• Training data Strain: collection of examples observed by the algorithm

Machine Learning: Chenhao Tan | Boulder | 6 of 31

Page 7: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

Supervised Learning

Formal Definitions

• Labels Y, e.g., binary labels y ∈ {+1,−1}• Instance space X, all the possible instances (based on data representation)• Target function f : X → Y (f is unknown)• Example/instance (x, y)• Training data Strain: collection of examples observed by the algorithm

Machine Learning: Chenhao Tan | Boulder | 6 of 31

Page 8: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

Supervised Learning

Formal Definitions

• Goal of a learning algorithm:

Find a function h : X → Y from training data Strain so that h approximates f

Machine Learning: Chenhao Tan | Boulder | 7 of 31

Page 9: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

Supervised Learning

Supervised learning in a nutshell

Strain = {(x, y)} → h

Machine Learning: Chenhao Tan | Boulder | 8 of 31

Page 10: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

Supervised Learning

No Free Lunch Theorems

• No free lunch for supervised machine learning [Wolpert, 1996]:in a noise-free scenario where the loss function is the misclassification rate, ifone is interested in off-training-set error, then there are no a priori distinctionsbetween learning algorithms.

Corollary I: there is no single ML algorithm that works for everything.Corollary II: every successful ML algorithm makes assumptions.

• No free lunch for search/optimization [Wolpert and Macready, 1997]: Allalgorithms that search for an extremum of a cost function perform exactly thesame when averaged over all possible cost functions.

Machine Learning: Chenhao Tan | Boulder | 9 of 31

Page 11: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

Supervised Learning

No Free Lunch Theorems

• No free lunch for supervised machine learning [Wolpert, 1996]:in a noise-free scenario where the loss function is the misclassification rate, ifone is interested in off-training-set error, then there are no a priori distinctionsbetween learning algorithms.Corollary I: there is no single ML algorithm that works for everything.Corollary II: every successful ML algorithm makes assumptions.

• No free lunch for search/optimization [Wolpert and Macready, 1997]: Allalgorithms that search for an extremum of a cost function perform exactly thesame when averaged over all possible cost functions.

Machine Learning: Chenhao Tan | Boulder | 9 of 31

Page 12: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

Supervised Learning

No Free Lunch Theorems

• No free lunch for supervised machine learning [Wolpert, 1996]:in a noise-free scenario where the loss function is the misclassification rate, ifone is interested in off-training-set error, then there are no a priori distinctionsbetween learning algorithms.Corollary I: there is no single ML algorithm that works for everything.Corollary II: every successful ML algorithm makes assumptions.

• No free lunch for search/optimization [Wolpert and Macready, 1997]: Allalgorithms that search for an extremum of a cost function perform exactly thesame when averaged over all possible cost functions.

Machine Learning: Chenhao Tan | Boulder | 9 of 31

Page 13: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

Data representation

Outline

Supervised Learning

Data representation

K-nearest neighborsOverviewPerformance GuaranteeCurse of Dimensionality

Machine Learning: Chenhao Tan | Boulder | 10 of 31

Page 14: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

Data representation

Data representation

Republican nominee George Bush said he felt nervous as he voted today in his adopted home state of Texas, where he ended...

(

(From Chris Harrison's WikiViz)

Machine Learning: Chenhao Tan | Boulder | 11 of 31

Page 15: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

Data representation

Data representation

Let us have an interactive example to think through data representation!

Auto insurance quotes

id rent income urban state car value car year

1 yes 50,000 no CO 20,000 20102 yes 70,000 no CO 30,000 20123 no 250,000 yes CO 55,000 20174 yes 200,000 yes NY 50,000 2016

Machine Learning: Chenhao Tan | Boulder | 12 of 31

Page 16: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

Data representation

Data representation

Let us have an interactive example to think through data representation!Auto insurance quotes

id rent income urban state car value car year

1 yes 50,000 no CO 20,000 20102 yes 70,000 no CO 30,000 20123 no 250,000 yes CO 55,000 20174 yes 200,000 yes NY 50,000 2016

Machine Learning: Chenhao Tan | Boulder | 12 of 31

Page 17: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

Data representation

Understanding assumptions in data representationAkaike information criterion - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Akaike_information_criterion

1 of 3 2/1/08 2:47 PM

Akaike information criterion

From Wikipedia, the free encyclopedia

Akaike's information criterion, developed by Hirotsugu Akaike under the name of "an information

criterion" (AIC) in 1971 and proposed in Akaike (1974), is a measure of the goodness of fit of an

estimated statistical model. It is grounded in the concept of entropy. The AIC is an operational way of

trading off the complexity of an estimated model against how well the model fits the data.

Contents

1 Definition2 AICc and AICu3 QAIC4 References5 See also6 External links

Definition

In the general case, the AIC is

where k is the number of parameters in the statistical model, and L is the likelihood function.

Over the remainder of this entry, it will be assumed that the model errors are normally and independently

distributed. Let n be the number of observations and RSS be

the residual sum of squares. Then AIC becomes

Increasing the number of free parameters to be estimated improves the goodness of fit, regardless of the

number of free parameters in the data generating process. Hence AIC not only rewards goodness of fit, but

also includes a penalty that is an increasing function of the number of estimated parameters. This penalty

discourages overfitting. The preferred model is the one with the lowest AIC value. The AIC methodology

attempts to find the model that best explains the data with a minimum of free parameters. By contrast, more

traditional approaches to modeling start from a null hypothesis. The AIC penalizes free parameters less

Cat - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Cat

1 of 13 2/1/08 2:47 PM

Cat[1]

other images of cats

Conservation status

Domesticated

Scientific classification

Kingdom: Animalia

Phylum: Chordata

Class: Mammalia

Order: Carnivora

Family: Felidae

Genus: Felis

Species: F. silvestris

Subspecies: F. s. catus

Trinomial name

Felis silvestris catus

(Linnaeus, 1758)

Synonyms

Felis lybica invalid junior synonym

Felis catus invalid junior synonym[2]

Cats Portal

Cat

From Wikipedia, the free encyclopedia

The Cat (Felis silvestris catus), also known as the Domestic Cat or House Cat to distinguish it from other felines, is a

small carnivorous species of crepuscular mammal that is often valued by humans for its companionship and its ability to

hunt vermin. It has been associated with humans for at least 9,500 years.[3]

A skilled predator, the cat is known to hunt over 1,000 species for food. It is intelligent and can be trained to obey simple

commands. Individual cats have also been known to learn to manipulate simple mechanisms, such as doorknobs. Cats use

a variety of vocalizations and types of body language for communication, including meowing, purring, hissing, growling,

squeaking, chirping, clicking, and grunting.[4] Cats are popular pets and are also bred and shown as registered pedigree

pets. This hobby is known as the "Cat Fancy".

Until recently the cat was commonly believed to have been domesticated in ancient Egypt, where it was a cult animal.[5]

But a study by the National Cancer Institute published in the journal Science says that all house cats are descended from a

group of self-domesticating desert wildcats Felis silvestris lybica circa 10,000 years ago, in the Near East. All wildcat

subspecies can interbreed, but domestic cats are all genetically contained within F. s. lybica.[6]

Contents

1 Physiology1.1 Size1.2 Skeleton1.3 Mouth1.4 Ears1.5 Legs1.6 Skin1.7 Senses1.8 Metabolism1.9 Genetics1.10 Feeding and diet

1.10.1 Toxic sensitivity2 Behavior

2.1 Sociability2.2 Cohabitation2.3 Fighting2.4 Play2.5 Hunting2.6 Reproduction2.7 Hygiene2.8 Scratching2.9 Fondness for heights

3 Ecology3.1 Habitat3.2 Impact of hunting

4 House cats4.1 Domestication4.2 Interaction with humans

4.2.1 Allergens4.2.2 Trainability

4.3 Indoor scratching4.3.1 Declawing

4.4 Waste4.5 Domesticated varieties

4.5.1 Coat patterns4.5.2 Body types

5 Feral cats5.1 Environmental effects5.2 Ethical and humane concerns over feral cats

6 Etymology and taxonomic history6.1 Scientific classification6.2 Nomenclature6.3 Etymology

7 History and mythology7.1 Nine Lives

8 See also9 References10 External links

10.1 Anatomy10.2 Articles10.3 Veterinary related

Physiology

S i z e

Princeton University - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Princeton_university

1 of 8 2/1/08 2:46 PM

Princeton University

Motto: Dei sub numine viget(Latin for "Under God's powershe flourishes")

Established 1746

Type: Private

Endowment: US$15.8 billion[1]

President: Shirley M. Tilghman

Staff: 1,103

Undergraduates: 4,923[2]

Postgraduates: 1,975

Location Borough of Princeton,Princeton Township,and West Windsor Township,New Jersey, USA

Campus: Suburban, 600 acres (2.4 km!)(Princeton Borough andTownship)

Athletics: 38 sports teams

Colors: Orange and Black

Mascot:Tigers

Website: www.princeton.edu(http://www.princeton.edu)

Princeton University

From Wikipedia, the free encyclopedia (Redirected from Princeton university)

Princeton University is a private coeducational research university located in Princeton, New Jersey. It is one of eight

universities that belong to the Ivy League.

Originally founded at Elizabeth, New Jersey, in 1746 as the College of New Jersey, it relocated to Princeton in 1756 and

was renamed “Princeton University” in 1896.[3] Princeton was the fourth institution of higher education in the U.S. to

conduct classes.[4][5] Princeton has never had any official religious affiliation, rare among American universities of its age.

At one time, it had close ties to the Presbyterian Church, but today it is nonsectarian and makes no religious demands on

its students.[6][7] The university has ties with the Institute for Advanced Study, Princeton Theological Seminary and the

Westminster Choir College of Rider University.[8]

Princeton has traditionally focused on undergraduate education and academic research, though in recent decades it has

increased its focus on graduate education and offers a large number of professional master's degrees and doctoral programsin a range of subjects. The Princeton University Library holds over six million books. Among many others, areas of

research include anthropology, geophysics, entomology, and robotics, while the Forrestal Campus has special facilities forthe study of plasma physics and meteorology.

Contents

1 History2 Campus

2.1 Cannon Green2.2 Buildings

2.2.1 McCarter Theater2.2.2 Art Museum2.2.3 University Chapel

3 Organization4 Academics

4.1 Rankings5 Student life and culture6 Athletics7 Old Nassau8 Notable alumni and faculty9 In fiction10 See also11 References12 External links

History

The history of Princeton goes back to its establishment by "New Light" Presbyterians, Princeton was originally intended to train

Presbyterian ministers. It opened at Elizabeth, New Jersey, under the presidency of Jonathan Dickinson as the College of New Jersey.Its second president was Aaron Burr, Sr.; the third was Jonathan Edwards. In 1756, the college moved to Princeton, New Jersey.

Between the time of the move to Princeton in 1756 and the construction of Stanhope Hall in 1803, the college's sole building was

Nassau Hall, named for William III of England of the House of Orange-Nassau. (A proposal was made to name it for the colonialGovernor, Jonathan Belcher, but he declined.) The college also got one of its colors, orange, from William III. During the American

Revolution, Princeton was occupied by both sides, and the college's buildings were heavily damaged. The Battle of Princeton, fought

in a nearby field in January of 1777, proved to be a decisive victory for General George Washington and his troops. Two ofPrinceton's leading citizens signed the United States Declaration of Independence, and during the summer of 1783, the Continental

Congress met in Nassau Hall, making Princeton the country's capital for four months. The much-abused landmark survivedbombardment with cannonballs in the Revolutionary War when General Washington struggled to wrest the building from British

control, as well as later fires that left only its walls standing in 1802 and 1855. Rebuilt by Joseph Henry Latrobe, John Notman, and

John Witherspoon, the modern Nassau Hall has been much revised and expanded from the original designed by Robert Smith. Overthe centuries, its role shifted from an all-purpose building, comprising office, dormitory, library, and classroom space, to classrooms

only, to its present role as the administrative center of the university. Originally, the sculptures in front of the building were lions, as a

gift in 1879. These were later replaced with tigers in 1911.[9]

The Princeton Theological Seminary broke off from the college in 1812, since the Presbyterians wanted their ministers to have moretheological training, while the faculty and students would have been content with less. This reduced the student body and the external

support for Princeton for some time. The two institutions currently enjoy a close relationship based on common history and shared

resources.

Sculpture by J. Massey Rhind(1892), Alexander Hall, Princeton

University

Coordinates: 40.34873, -74.65931

Dog - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Dogs

1 of 16 2/1/08 2:53 PM

Domestic dogFossil range: Late Pleistocene - Recent

Conservation status

Domesticated

Scientif ic classif ication

Domain: Eukaryota

Kingdom: Animalia

Phylum: Chordata

Class: Mammalia

Order: Carnivora

Family: Canidae

Genus: Canis

Species: C. lupus

Subspecies: C. l. familiaris

Trinomial name

Canis lupus familiaris(Linnaeus, 1758)

Dogs Portal

Dog

From Wikipedia, the free encyclopedia

(Redirected from Dogs)

The dog (Canis lupus familiaris) is a domesticated subspecies of the wolf, a mammal of

the Canidae family of the order Carnivora. The term encompasses both feral and pet

varieties and is also sometimes used to describe wild canids of other subspecies or

species. The domestic dog has been (and continues to be) one of the most widely-kept

working and companion animals in human history, as well as being a food source in

some cultures. There are estimated to be 400,000,000 dogs in the world.[1]

The dog has developed into hundreds of varied breeds. Height measured to the withers

ranges from a few inches in the Chihuahua to a few feet in the Irish Wolfhound; color

varies from white through grays (usually called blue) to black, and browns from light

(tan) to dark ("red" or "chocolate") in a wide variation of patterns; and, coats can be

very short to many centimeters long, from coarse hair to something akin to wool,

straight or curly, or smooth.

Contents

1 Etymology and taxonomy

2 Origin and evolution

2.1 Origins

2.2 Ancestry and history of domestication

2.3 Development of dog breeds

2.3.1 Breed popularity

3 Physical characteristics

3.1 Differences from other canids

3.2 Sight

3.3 Hearing

3.4 Smell

3.5 Coat color

3.6 Sprint metabolism

4 Behavior and Intelligence

4.1 Differences from other canids

4.2 Intelligence

4.2.1 Evaluation of a dog's intelligence

4.3 Human relationships

4.4 Dog communication

4.5 Laughter in dogs

5 Reproduction

5.1 Differences from other canids

5.2 Life cycle

5.3 Spaying and neutering

5.4 Overpopulation

5.4.1 United States

6 Working, utility and assistance dogs

7 Show and sport (competition) dogs

8 Dog health

8.1 Morbidity (Illness)

8.1.1 Diseases

8.1.2 Parasites

8.1.3 Common physical disorders

Akaike information criterion - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Akaike_information_criterion

1 of 3 2/1/08 2:47 PM

Akaike information criterion

From Wikipedia, the free encyclopedia

Akaike's information criterion, developed by Hirotsugu Akaike under the name of "an information

criterion" (AIC) in 1971 and proposed in Akaike (1974), is a measure of the goodness of fit of an

estimated statistical model. It is grounded in the concept of entropy. The AIC is an operational way of

trading off the complexity of an estimated model against how well the model fits the data.

Contents

1 Definition2 AICc and AICu3 QAIC4 References5 See also6 External links

Definition

In the general case, the AIC is

where k is the number of parameters in the statistical model, and L is the likelihood function.

Over the remainder of this entry, it will be assumed that the model errors are normally and independently

distributed. Let n be the number of observations and RSS be

the residual sum of squares. Then AIC becomes

Increasing the number of free parameters to be estimated improves the goodness of fit, regardless of the

number of free parameters in the data generating process. Hence AIC not only rewards goodness of fit, but

also includes a penalty that is an increasing function of the number of estimated parameters. This penalty

discourages overfitting. The preferred model is the one with the lowest AIC value. The AIC methodology

attempts to find the model that best explains the data with a minimum of free parameters. By contrast, more

traditional approaches to modeling start from a null hypothesis. The AIC penalizes free parameters less

• The methods we’ll study make assumptions about the data on which they areapplied. E.g.,◦ Documents can be analyzed as a sequence of words;◦ or, as a “bag” of words.◦ Independent of each other;◦ or, as connected to each other

• What are the assumptions behind the methods?• When/why are they appropriate?• Much of this is an art, and it is inherently dynamic

Machine Learning: Chenhao Tan | Boulder | 13 of 31

Page 18: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors

Outline

Supervised Learning

Data representation

K-nearest neighborsOverviewPerformance GuaranteeCurse of Dimensionality

Machine Learning: Chenhao Tan | Boulder | 14 of 31

Page 19: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Overview

K-nearest neighbors

Find the K-nearest neighbors of x in training data and predict the majority label ofthose K points.

h(x) = arg maxy∈{+1,−1}

∑(x′,y′)∈NN(x,Strain,k)

I(y = y′)

Assumptions in the algorithm: nearby instances share similar labels.

Machine Learning: Chenhao Tan | Boulder | 15 of 31

Page 20: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Overview

K-nearest neighbors

Find the K-nearest neighbors of x in training data and predict the majority label ofthose K points.

h(x) = arg maxy∈{+1,−1}

∑(x′,y′)∈NN(x,Strain,k)

I(y = y′)

Assumptions in the algorithm: nearby instances share similar labels.

Machine Learning: Chenhao Tan | Boulder | 15 of 31

Page 21: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Overview

K-nearest neighbors

Find the K-nearest neighbors of x in training data and predict the majority label ofthose K points.

h(x) = arg maxy∈{+1,−1}

∑(x′,y′)∈NN(x,Strain,k)

I(y = y′)

Assumptions in the algorithm: nearby instances share similar labels.

Machine Learning: Chenhao Tan | Boulder | 15 of 31

Page 22: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Overview

A Simple Example

• Suppose you’re a big company monitoring the web• Someone says something about your product (x)• You want to know whether they’re positive (y = +1) or negative (y = −1)

Machine Learning: Chenhao Tan | Boulder | 16 of 31

Page 23: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Overview

k = 1

TrainApple makes great laptops→ (+1)

TestApple makes great laptops

Machine Learning: Chenhao Tan | Boulder | 17 of 31

Page 24: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Overview

k = 1

TrainApple makes great laptops→ (+1)

TestApple makes great laptops

Machine Learning: Chenhao Tan | Boulder | 17 of 31

Page 25: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Overview

k = 1

TrainApple makes great laptops→ (+1)

TestApple really makes great laptops

Machine Learning: Chenhao Tan | Boulder | 17 of 31

Page 26: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Overview

Parameters in the algorithm

• Distance function

Discrete

d(x1, x2) = 1− |x1 ∩ x2||x1 ∪ x2|

(1)

ContinuousEuclidean distance

d(x1, x2) = ‖~x1 −~x2‖2 (2)

Manhattan distance

d(x1, x2) = ‖~x1 −~x2‖1 (3)• Number of nearest neighbors k

Machine Learning: Chenhao Tan | Boulder | 18 of 31

Page 27: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Overview

Parameters in the algorithm

• Distance function

Discrete

d(x1, x2) = 1− |x1 ∩ x2||x1 ∪ x2|

(1)

ContinuousEuclidean distance

d(x1, x2) = ‖~x1 −~x2‖2 (2)

Manhattan distance

d(x1, x2) = ‖~x1 −~x2‖1 (3)• Number of nearest neighbors k

Machine Learning: Chenhao Tan | Boulder | 18 of 31

Page 28: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Overview

Parameters in the algorithm

• Distance function

Discrete

d(x1, x2) = 1− |x1 ∩ x2||x1 ∪ x2|

(1)

ContinuousEuclidean distance

d(x1, x2) = ‖~x1 −~x2‖2 (2)

Manhattan distance

d(x1, x2) = ‖~x1 −~x2‖1 (3)

• Number of nearest neighbors k

Machine Learning: Chenhao Tan | Boulder | 18 of 31

Page 29: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Overview

Parameters in the algorithm

• Distance function

Discrete

d(x1, x2) = 1− |x1 ∩ x2||x1 ∪ x2|

(1)

ContinuousEuclidean distance

d(x1, x2) = ‖~x1 −~x2‖2 (2)

Manhattan distance

d(x1, x2) = ‖~x1 −~x2‖1 (3)• Number of nearest neighbors k

Machine Learning: Chenhao Tan | Boulder | 18 of 31

Page 30: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Overview

KNN Classification

K = 1

What is the predictionof y1?Closest points:

Prediction:

1

2

3

4

Machine Learning: Chenhao Tan | Boulder | 19 of 31

Page 31: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Overview

KNN Classification

K = 1

What is the predictionof y2?Closest points:

Prediction:

1

2

3

4

Machine Learning: Chenhao Tan | Boulder | 19 of 31

Page 32: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Overview

KNN Classification

K = 1

What is the predictionof y3?Closest points:

Prediction:

1

2

3

4

Machine Learning: Chenhao Tan | Boulder | 19 of 31

Page 33: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Overview

KNN Classification

K = 1

What is the predictionof y4?Closest points:

Prediction:

1

2

3

4

Machine Learning: Chenhao Tan | Boulder | 19 of 31

Page 34: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Overview

KNN Classification

K = 2

What is the predictionof y1?Closest points:

Prediction:

1

2

3

4

Machine Learning: Chenhao Tan | Boulder | 19 of 31

Page 35: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Overview

KNN Classification

K = 2

What is the predictionof y2?Closest points:

Prediction:

1

2

3

4

Machine Learning: Chenhao Tan | Boulder | 19 of 31

Page 36: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Overview

KNN Classification

K = 2

What is the predictionof y3?Closest points:

Prediction:

1

2

3

4

Machine Learning: Chenhao Tan | Boulder | 19 of 31

Page 37: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Overview

KNN Classification

K = 2

What is the predictionof y4?Closest points:

Prediction:

1

2

3

4

Machine Learning: Chenhao Tan | Boulder | 19 of 31

Page 38: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Overview

KNN Classification

K = 3

What is the predictionof y1?Closest points:

Prediction:

1

2

3

4

Machine Learning: Chenhao Tan | Boulder | 19 of 31

Page 39: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Overview

KNN Classification

K = 3

What is the predictionof y2?Closest points:

Prediction:

1

2

3

4

Machine Learning: Chenhao Tan | Boulder | 19 of 31

Page 40: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Overview

KNN Classification

K = 3

What is the predictionof y3?Closest points:

Prediction:

1

2

3

4

Machine Learning: Chenhao Tan | Boulder | 19 of 31

Page 41: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Overview

KNN Classification

K = 3

What is the predictionof y4?Closest points:

Prediction:

1

2

3

4

Machine Learning: Chenhao Tan | Boulder | 19 of 31

Page 42: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Overview

Machine Learning: Chenhao Tan | Boulder | 20 of 31

Page 43: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Performance Guarantee

How good is K-NN in theory? (Performance guarantee)

Machine Learning: Chenhao Tan | Boulder | 21 of 31

Page 44: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Performance Guarantee

Performance Guarantee for 1-NN

What if we have close to infinite training data (n→∞), how well can 1-NN perform?

Bayes optimal classifierAssuming that we know P(y|x), y ∈ {+1,−1},best prediction: y∗ = argmaxy P(y|x)Example: P(+1|x) = 0.9,P(−1|x) = 0.1, what do you predict for x?

ErrBayesOpt = 1− P(y∗|x)

Machine Learning: Chenhao Tan | Boulder | 22 of 31

Page 45: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Performance Guarantee

Performance Guarantee for 1-NN

What if we have close to infinite training data (n→∞), how well can 1-NN perform?Bayes optimal classifierAssuming that we know P(y|x), y ∈ {+1,−1},best prediction: y∗ = argmaxy P(y|x)

Example: P(+1|x) = 0.9,P(−1|x) = 0.1, what do you predict for x?

ErrBayesOpt = 1− P(y∗|x)

Machine Learning: Chenhao Tan | Boulder | 22 of 31

Page 46: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Performance Guarantee

Performance Guarantee for 1-NN

What if we have close to infinite training data (n→∞), how well can 1-NN perform?Bayes optimal classifierAssuming that we know P(y|x), y ∈ {+1,−1},best prediction: y∗ = argmaxy P(y|x)Example: P(+1|x) = 0.9,P(−1|x) = 0.1, what do you predict for x?

ErrBayesOpt = 1− P(y∗|x)

Machine Learning: Chenhao Tan | Boulder | 22 of 31

Page 47: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Performance Guarantee

Performance Guarantee for 1-NN

What if we have close to infinite training data (n→∞), how well can 1-NN perform?Bayes optimal classifierAssuming that we know P(y|x), y ∈ {+1,−1},best prediction: y∗ = argmaxy P(y|x)Example: P(+1|x) = 0.9,P(−1|x) = 0.1, what do you predict for x?

ErrBayesOpt = 1− P(y∗|x)

Machine Learning: Chenhao Tan | Boulder | 22 of 31

Page 48: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Performance Guarantee

Performance Guarantee for 1-NN

TheoremAs N = |Strain| → ∞, the 1-NN error is no more than twice the error of the BayesOptimal Classifier. [Cover and Hart, 1967]

Proof.Let xNN be the nearest neighbor of our test point x. As N →∞, dist(xNN, x)→ 0,thus P(y∗|xNN)→ P(y∗|x).

Errnn = P(yx 6= yxNN)

= P(y∗|x)(1− P(y∗|xNN)) + P(y∗|xNN)(1− P(y∗|x))≤ (1− P(y∗|xNN)) + (1− P(y∗|x))= 2 ∗ (1− P(y∗|x)) = 2ErrBayesOpt

Machine Learning: Chenhao Tan | Boulder | 23 of 31

Page 49: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Performance Guarantee

Performance Guarantee for 1-NN

TheoremAs N = |Strain| → ∞, the 1-NN error is no more than twice the error of the BayesOptimal Classifier. [Cover and Hart, 1967]

Proof.Let xNN be the nearest neighbor of our test point x. As N →∞, dist(xNN, x)→ 0,thus P(y∗|xNN)→ P(y∗|x).

Errnn = P(yx 6= yxNN)

= P(y∗|x)(1− P(y∗|xNN)) + P(y∗|xNN)(1− P(y∗|x))≤ (1− P(y∗|xNN)) + (1− P(y∗|x))= 2 ∗ (1− P(y∗|x)) = 2ErrBayesOpt

Machine Learning: Chenhao Tan | Boulder | 23 of 31

Page 50: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Curse of Dimensionality

How does the algorithm scale? (Curse of Dimensionality)

Machine Learning: Chenhao Tan | Boulder | 24 of 31

Page 51: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Curse of Dimensionality

Curse of Dimensionality

Given N points in [0, 1], what is the size of the smallest interval to contain k-nearestneighbor for x?

x

l

N ∗ l ≈ k⇒ l ≈ kN

Machine Learning: Chenhao Tan | Boulder | 25 of 31

Page 52: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Curse of Dimensionality

Curse of Dimensionality

Given N points in [0, 1], what is the size of the smallest interval to contain k-nearestneighbor for x?

x

l

N ∗ l ≈ k⇒ l ≈ kN

Machine Learning: Chenhao Tan | Boulder | 25 of 31

Page 53: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Curse of Dimensionality

Curse of Dimensionality

Given N points in [0, 1], what is the size of the smallest interval to contain k-nearestneighbor for x?

x

l

N ∗ l ≈ k⇒ l ≈ kN

Machine Learning: Chenhao Tan | Boulder | 25 of 31

Page 54: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Curse of Dimensionality

Curse of Dimensionality

In general, for d dimensions, what is the length of the smallest hypercube tocontain k-nearest neighbor for x?

N ∗ ld ≈ k⇒ l ≈(

kN

) 1d

Machine Learning: Chenhao Tan | Boulder | 26 of 31

Page 55: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Curse of Dimensionality

Curse of Dimensionality

In general, for d dimensions, what is the length of the smallest hypercube tocontain k-nearest neighbor for x?

N ∗ ld ≈ k⇒ l ≈(

kN

) 1d

Machine Learning: Chenhao Tan | Boulder | 26 of 31

Page 56: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Curse of Dimensionality

Curse of Dimensionality

If N = 1000, k = 10,

d l2 0.1

10 0.63100 0.9551000 0.9954

We almost need the entire space to find 10 nearest neighbors.

Machine Learning: Chenhao Tan | Boulder | 27 of 31

Page 57: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Curse of Dimensionality

How does the algorithm scale? (Memory and Efficiency of the naiveimplementation)

Training: N/ATesting• memory: O(Nd)• time: O(Nd)

Machine Learning: Chenhao Tan | Boulder | 28 of 31

Page 58: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Curse of Dimensionality

How does the algorithm scale? (Memory and Efficiency of the naiveimplementation)Training: N/ATesting• memory: O(Nd)• time: O(Nd)

Machine Learning: Chenhao Tan | Boulder | 28 of 31

Page 59: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Curse of Dimensionality

Summary

• Supervised learning: learn h from Strain

• Data representation can be tricky in many cases• K-NN gives a simple classifier with nice guarantee

Machine Learning: Chenhao Tan | Boulder | 29 of 31

Page 60: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Curse of Dimensionality

First Homework

• Implement k-nearest neighbors• Cross validation to search hyperparameters• Acclimate you to the Python programming environment• Introduce you to using Moodle to submit assignments

Machine Learning: Chenhao Tan | Boulder | 30 of 31

Page 61: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-08-31 · Supervised Learning Formal Definitions Labels Y, e.g., binary labels y 2f+1; 1g Instance space

K-nearest neighbors | Curse of Dimensionality

References (1)

Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE transactions on information theory,13(1):21–27, 1967.

David H Wolpert. The lack of a priori distinctions between learning algorithms. Neural computation, 8(7):1341–1390, 1996.

David H Wolpert and William G Macready. No free lunch theorems for optimization. IEEE transactions onevolutionary computation, 1(1):67–82, 1997.

Machine Learning: Chenhao Tan | Boulder | 31 of 31