introduction to machine learning - duke university

Introduction to Machine Learning

COMPSCI 371D — Machine Learning

COMPSCI 371D — Machine Learning Introduction to Machine Learning 1 / 18

Outline

1 Classification, Regression, Unsupervised Learning

2 About Dimensionality

3 Drawings and Intuition in Higher Dimensions

4 Classification through Regression

5 Linear Separability


About Slides

• By popular demand, lecture slides will be made availableonline

• They will show up just before a lecture starts• Slides are grouped by topic, not by lecture• Slides are not for studying• Class notes and homework assignments are the materials

of record


Classification, Regression, Unsupervised Learning

Parenthesis: Supervised vs Unsupervised

• Supervised: Train with (x, y)• Classification:

Hand-written digitrecognition

• Regression: Median ageof YouTube viewers foreach video

• Unsupervised: Train with x• Clustering: Color

compression• Distances matter!

• We will not cover unsupervisedlearning


Classification, Regression, Unsupervised Learning

Machine Learning Terminology• Predictor h : X → Y (the signature of h)• X ⊆ Rd is the data space• Y (categorical) is the label space for a classifier• Y (⊆ Re) is the value space for a regressor• A target is either a label or a value• H is the hypothesis space (all h we can choose from)• A training set is a subset T of X × Y

• T def= 2X×Y is the class of all possible training sets

• Learner λ : T → H, so that λ(T ) = h• `(y , y) is the loss incurred for estimating y when the true

prediction is y• LT (h) = 1

N

∑Nn=1 `(yn,h(xn)) is the empirical risk of h on T


About Dimensionality

H is Typically Parametric

• For polynomials, h↔ c• We write LT (c) instead of LT (h)

• “Searching H” means “find the parameters”c ∈ arg minc∈Rm ‖Ac− b‖2

• This is common in machine learning: h(x) = hθ(x),• θ: a vector of parameters• Abstract view: h ∈ arg minh∈H LT (h)

• Concrete view: θ ∈ arg minθ∈Rm LT (θ)

• Minimize a function of real variables, rather than of“functions”



Curb your Dimensions

• For polynomials, hc(x) : X → Yx ∈ X ⊆ Rd and c ∈ Rm

• We saw that d > 1 and degree k > 1 ⇒ m� d• Specifically, m(d , k) =

(d+kk

)• Things blow up when k and d grow• More generally, hθ(x) : X → Y

x ∈ X ⊆ Rd and θ ∈ Rm

• Which dimension(s) do we want to curb? m? d?• Both, for different but related reasons



Problem with m Large

• Even just for data fitting, we generally want N � m, i.e.,(possibly many) more samples than parameters to estimate

• For instance, in Ac = b, we want A to have more rows thancolumns

• Remember that annotating training data is costly• So we want to curb m: We want a small H



Problems with d Large

• We do machine learning, not just data fitting!• We want h to generalize to new data• During training, we would like the learner to see a good

sampling of all possible x (“fill X nicely”)• With large d , this is impossible: The curse of dimensionality


Drawings and Intuition in Higher Dimensions

Drawings Help Intuition


Drawings and Intuition in Higher Dimensions

Intuition Often Fails in Many Dimensions

1 11−ε/2

• Gray parts dominate when d →∞• Distance from center to corners diverges when d →∞


Classification through Regression

Classifiers as Partitions of XXy

def= h−1(y) partitions X (not just T !)

• Classifier = partition• S = h−1(red square), C = h−1(blue circle)



Classification, Geometry, and Regression

• Classification partitions X ⊂ Rd into sets• How do we represent sets ⊂ Rd? How do we work with

them?• We’ll see a couple of ways: nearest-neighbor classifier,

decision trees• These methods have a strong geometric flavor• Beware of our intuition!• Another technique: score-based classifiers

i.e., classification through regression



Score-Based Classifiers

Score Function

s=0

s > 0

s < 0

[Figure adapted from Wei et al., Structural and Multidisciplinary Optimization, 58:831–849, 2018]

• s = 0 defines the decision boundaries• s > 0 and s < 0 defines the (two) decision regions



Score-Based Classifiers• Threshold some score function s(x):• Example: 's'(red squares) and 'c'(blue circles)

• Correspond to two sets S ⊆ X and C = X \ SIf we can estimate something like s(x) = P[x ∈ S]

h(x) =

{'s' if s(x) > 1/2'c' otherwise




• If you prefer 0 as a threshold, lets(x) = 2P[x ∈ S]− 1 ∈ [−1,1]

h(x) =

{'s' if s(x) > 0'c' otherwise

• Scores are convenient even without probabilities, becausethey are easy to work with

• We implement a classifier h by building a regressor s• Example: Logistic-regression classifiers


Linear Separability

Linearly Separable Training Sets

• Some line (hyperplane in Rd ) separates C, S• Requires much smaller H• Simplest score: s(x) = b + wT x. The line is s(x) = 0

h(x) =

{'s' if s(x) > 0'c' otherwise


Linear Separability

Data Representation?• Linear separability is a property of the data in a given

representation

r

Δr

• Xform 1: z = x21 + x2

2 implies x ∈ S ⇔ a ≤ z ≤ b• Xform 2: z = |

√x2

1 + x22 − r | implies linear separability:

x ∈ S ⇔ z ≤ ∆r


introduction to machine learning - duke university

Documents