introduction to machine learning - duke university
TRANSCRIPT
![Page 1: Introduction to Machine Learning - Duke University](https://reader031.vdocuments.mx/reader031/viewer/2022012023/6169cd5f11a7b741a34b8660/html5/thumbnails/1.jpg)
Introduction to Machine Learning
COMPSCI 371D — Machine Learning
COMPSCI 371D — Machine Learning Introduction to Machine Learning 1 / 18
![Page 2: Introduction to Machine Learning - Duke University](https://reader031.vdocuments.mx/reader031/viewer/2022012023/6169cd5f11a7b741a34b8660/html5/thumbnails/2.jpg)
Outline
1 Classification, Regression, Unsupervised Learning
2 About Dimensionality
3 Drawings and Intuition in Higher Dimensions
4 Classification through Regression
5 Linear Separability
COMPSCI 371D — Machine Learning Introduction to Machine Learning 2 / 18
![Page 3: Introduction to Machine Learning - Duke University](https://reader031.vdocuments.mx/reader031/viewer/2022012023/6169cd5f11a7b741a34b8660/html5/thumbnails/3.jpg)
About Slides
• By popular demand, lecture slides will be made availableonline
• They will show up just before a lecture starts• Slides are grouped by topic, not by lecture• Slides are not for studying• Class notes and homework assignments are the materials
of record
COMPSCI 371D — Machine Learning Introduction to Machine Learning 3 / 18
![Page 4: Introduction to Machine Learning - Duke University](https://reader031.vdocuments.mx/reader031/viewer/2022012023/6169cd5f11a7b741a34b8660/html5/thumbnails/4.jpg)
Classification, Regression, Unsupervised Learning
Parenthesis: Supervised vs Unsupervised
• Supervised: Train with (x, y)• Classification:
Hand-written digitrecognition
• Regression: Median ageof YouTube viewers foreach video
• Unsupervised: Train with x• Clustering: Color
compression• Distances matter!
• We will not cover unsupervisedlearning
COMPSCI 371D — Machine Learning Introduction to Machine Learning 4 / 18
![Page 5: Introduction to Machine Learning - Duke University](https://reader031.vdocuments.mx/reader031/viewer/2022012023/6169cd5f11a7b741a34b8660/html5/thumbnails/5.jpg)
Classification, Regression, Unsupervised Learning
Machine Learning Terminology• Predictor h : X → Y (the signature of h)• X ⊆ Rd is the data space• Y (categorical) is the label space for a classifier• Y (⊆ Re) is the value space for a regressor• A target is either a label or a value• H is the hypothesis space (all h we can choose from)• A training set is a subset T of X × Y
• T def= 2X×Y is the class of all possible training sets
• Learner λ : T → H, so that λ(T ) = h• `(y , y) is the loss incurred for estimating y when the true
prediction is y• LT (h) = 1
N
∑Nn=1 `(yn,h(xn)) is the empirical risk of h on T
COMPSCI 371D — Machine Learning Introduction to Machine Learning 5 / 18
![Page 6: Introduction to Machine Learning - Duke University](https://reader031.vdocuments.mx/reader031/viewer/2022012023/6169cd5f11a7b741a34b8660/html5/thumbnails/6.jpg)
About Dimensionality
H is Typically Parametric
• For polynomials, h↔ c• We write LT (c) instead of LT (h)
• “Searching H” means “find the parameters”c ∈ arg minc∈Rm ‖Ac− b‖2
• This is common in machine learning: h(x) = hθ(x),• θ: a vector of parameters• Abstract view: h ∈ arg minh∈H LT (h)
• Concrete view: θ ∈ arg minθ∈Rm LT (θ)
• Minimize a function of real variables, rather than of“functions”
COMPSCI 371D — Machine Learning Introduction to Machine Learning 6 / 18
![Page 7: Introduction to Machine Learning - Duke University](https://reader031.vdocuments.mx/reader031/viewer/2022012023/6169cd5f11a7b741a34b8660/html5/thumbnails/7.jpg)
About Dimensionality
Curb your Dimensions
• For polynomials, hc(x) : X → Yx ∈ X ⊆ Rd and c ∈ Rm
• We saw that d > 1 and degree k > 1 ⇒ m� d• Specifically, m(d , k) =
(d+kk
)• Things blow up when k and d grow• More generally, hθ(x) : X → Y
x ∈ X ⊆ Rd and θ ∈ Rm
• Which dimension(s) do we want to curb? m? d?• Both, for different but related reasons
COMPSCI 371D — Machine Learning Introduction to Machine Learning 7 / 18
![Page 8: Introduction to Machine Learning - Duke University](https://reader031.vdocuments.mx/reader031/viewer/2022012023/6169cd5f11a7b741a34b8660/html5/thumbnails/8.jpg)
About Dimensionality
Problem with m Large
• Even just for data fitting, we generally want N � m, i.e.,(possibly many) more samples than parameters to estimate
• For instance, in Ac = b, we want A to have more rows thancolumns
• Remember that annotating training data is costly• So we want to curb m: We want a small H
COMPSCI 371D — Machine Learning Introduction to Machine Learning 8 / 18
![Page 9: Introduction to Machine Learning - Duke University](https://reader031.vdocuments.mx/reader031/viewer/2022012023/6169cd5f11a7b741a34b8660/html5/thumbnails/9.jpg)
About Dimensionality
Problems with d Large
• We do machine learning, not just data fitting!• We want h to generalize to new data• During training, we would like the learner to see a good
sampling of all possible x (“fill X nicely”)• With large d , this is impossible: The curse of dimensionality
COMPSCI 371D — Machine Learning Introduction to Machine Learning 9 / 18
![Page 10: Introduction to Machine Learning - Duke University](https://reader031.vdocuments.mx/reader031/viewer/2022012023/6169cd5f11a7b741a34b8660/html5/thumbnails/10.jpg)
Drawings and Intuition in Higher Dimensions
Drawings Help Intuition
COMPSCI 371D — Machine Learning Introduction to Machine Learning 10 / 18
![Page 11: Introduction to Machine Learning - Duke University](https://reader031.vdocuments.mx/reader031/viewer/2022012023/6169cd5f11a7b741a34b8660/html5/thumbnails/11.jpg)
Drawings and Intuition in Higher Dimensions
Intuition Often Fails in Many Dimensions
1 11−ε/2
• Gray parts dominate when d →∞• Distance from center to corners diverges when d →∞
COMPSCI 371D — Machine Learning Introduction to Machine Learning 11 / 18
![Page 12: Introduction to Machine Learning - Duke University](https://reader031.vdocuments.mx/reader031/viewer/2022012023/6169cd5f11a7b741a34b8660/html5/thumbnails/12.jpg)
Classification through Regression
Classifiers as Partitions of XXy
def= h−1(y) partitions X (not just T !)
• Classifier = partition• S = h−1(red square), C = h−1(blue circle)
COMPSCI 371D — Machine Learning Introduction to Machine Learning 12 / 18
![Page 13: Introduction to Machine Learning - Duke University](https://reader031.vdocuments.mx/reader031/viewer/2022012023/6169cd5f11a7b741a34b8660/html5/thumbnails/13.jpg)
Classification through Regression
Classification, Geometry, and Regression
• Classification partitions X ⊂ Rd into sets• How do we represent sets ⊂ Rd? How do we work with
them?• We’ll see a couple of ways: nearest-neighbor classifier,
decision trees• These methods have a strong geometric flavor• Beware of our intuition!• Another technique: score-based classifiers
i.e., classification through regression
COMPSCI 371D — Machine Learning Introduction to Machine Learning 13 / 18
![Page 14: Introduction to Machine Learning - Duke University](https://reader031.vdocuments.mx/reader031/viewer/2022012023/6169cd5f11a7b741a34b8660/html5/thumbnails/14.jpg)
Classification through Regression
Score-Based Classifiers
Score Function
s=0
s > 0
s < 0
[Figure adapted from Wei et al., Structural and Multidisciplinary Optimization, 58:831–849, 2018]
• s = 0 defines the decision boundaries• s > 0 and s < 0 defines the (two) decision regions
COMPSCI 371D — Machine Learning Introduction to Machine Learning 14 / 18
![Page 15: Introduction to Machine Learning - Duke University](https://reader031.vdocuments.mx/reader031/viewer/2022012023/6169cd5f11a7b741a34b8660/html5/thumbnails/15.jpg)
Classification through Regression
Score-Based Classifiers• Threshold some score function s(x):• Example: 's'(red squares) and 'c'(blue circles)
• Correspond to two sets S ⊆ X and C = X \ SIf we can estimate something like s(x) = P[x ∈ S]
h(x) =
{'s' if s(x) > 1/2'c' otherwise
COMPSCI 371D — Machine Learning Introduction to Machine Learning 15 / 18
![Page 16: Introduction to Machine Learning - Duke University](https://reader031.vdocuments.mx/reader031/viewer/2022012023/6169cd5f11a7b741a34b8660/html5/thumbnails/16.jpg)
Classification through Regression
Classification through Regression
• If you prefer 0 as a threshold, lets(x) = 2P[x ∈ S]− 1 ∈ [−1,1]
h(x) =
{'s' if s(x) > 0'c' otherwise
• Scores are convenient even without probabilities, becausethey are easy to work with
• We implement a classifier h by building a regressor s• Example: Logistic-regression classifiers
COMPSCI 371D — Machine Learning Introduction to Machine Learning 16 / 18
![Page 17: Introduction to Machine Learning - Duke University](https://reader031.vdocuments.mx/reader031/viewer/2022012023/6169cd5f11a7b741a34b8660/html5/thumbnails/17.jpg)
Linear Separability
Linearly Separable Training Sets
• Some line (hyperplane in Rd ) separates C, S• Requires much smaller H• Simplest score: s(x) = b + wT x. The line is s(x) = 0
h(x) =
{'s' if s(x) > 0'c' otherwise
COMPSCI 371D — Machine Learning Introduction to Machine Learning 17 / 18
![Page 18: Introduction to Machine Learning - Duke University](https://reader031.vdocuments.mx/reader031/viewer/2022012023/6169cd5f11a7b741a34b8660/html5/thumbnails/18.jpg)
Linear Separability
Data Representation?• Linear separability is a property of the data in a given
representation
r
Δr
• Xform 1: z = x21 + x2
2 implies x ∈ S ⇔ a ≤ z ≤ b• Xform 2: z = |
√x2
1 + x22 − r | implies linear separability:
x ∈ S ⇔ z ≤ ∆r
COMPSCI 371D — Machine Learning Introduction to Machine Learning 18 / 18