support vector machines - nbimathies/nbi_svm2019.pdf · support vector machines joachim mathiesen,...

dato, og ”Enhedens

Support Vector Machines

Joachim Mathiesen, Niels Bohr Institute

(Over)simplified history

1960-1970s

Predominantly linear decision

boundaries/classifiers

1980s

Boom in neural networks and decision

trees

1990-2000s

Kernel machines/methods

outperformed neural networks

2010s

Revival of neural networks, and boosted

decision trees.

Classification

Generalized Linear Model (Logistic Regression) 1st Order Terms

Generalized Linear Model (Logistic Regression) 4th Order Terms

Random Forest


Efficient separation of non-linear regions based on kernel methods – we only have to know the dot product between data points.

No problems with convergence and no trapping in local minima – “simple” quadratic optimization

Classification SVM

“A complex pattern-classification problem, cast

in a high-dimensional space nonlinearly, is more

likely to be linearly separable than in a low-

dimensional space, provided that the space is

not densely populated.”

— Cover, T.M., Geometrical and Statistical

properties of systems of linear inequalities with

applications in pattern recognition, (1965)

Cover’s theorem

wikipedia.org

By mapping to a simplex, it is apparent that “Every partition of a samples into two sets is separable by a linear separator”

Basic SVM

The support vectors machine finds an optimal separation of the points, whereas a linear classifier, y=a+bx would have an infinite number of possible parameters a and b which would give a working decision boundary.

SVM maximizes the margin between the two classes.

Basic SVM Example: Radial Kernel

Beware of overfitting!

SVM

For a data set 𝑥1, 𝑥2, … , 𝑥𝑁 with target values {𝑦1, 𝑦2, … , 𝑦𝑁}, we aim to minimize

1

2𝑤

2

Subject to𝑦𝑖 𝑤 ⋅ 𝑥𝑖 + 𝑏 = 1

Not optimal if you have over-lapping points belonging to dif-ferent classes near thedecision boundary.

Slack variables in SVM

In order not to be too sensitive to fuzziness close to separation boundary slack variables 𝜉 that allow for some misclassification. For a data set 𝑥1, 𝑥2, … , 𝑥𝑁 with target values {𝑦1, 𝑦2, … , 𝑦𝑁}, we now aim to minimize

1

2𝑤

2+ 𝐶

𝑖

𝜉𝑖

Subject to𝑦𝑖 𝑤 ⋅ 𝑥𝑖 + 𝑏 ≥ 1 − 𝜉𝑖

𝜉𝑖 ≥ 0The cost C is the penalty you pay to do not classify correctly

regression

regressionThe -insensitive loss function, where predictions have to be within an distance of the true value.

regression

Regression then works similarly to classification with slack-variables. For a data set 𝑥1, 𝑥2, … , 𝑥𝑁 with target values {𝑣1, 𝑣2, … , 𝑣𝑁}, we now aim to minimize

1

2𝑤

2+ 𝐶

𝑖

(𝜉𝑖+𝜉𝑖∗)

Subject to𝑣𝑖 −𝑤 ⋅ 𝑥𝑖 − 𝑏 ≤ 𝜖 + 𝜉𝑖−𝑣𝑖 + 𝑤 ⋅ 𝑥𝑖 + 𝑏 ≤ 𝜖 + 𝜉𝑖

∗

𝜉𝑖 ≥ 0, 𝜉𝑖∗ ≥ 0

SVMs are great general purpose models … with limited possibility for model interpretation

Main advantages of SVM:

Efficient separation of non-linear regions (based on basic dot-products).

Model follows from a quadratic optimization problem, i.e. no risk of ending in a local minimum – in contrast to for example neural networks.

Data science competitions –crowdsourcing platform for companies to pose problems and offer prices for the best predictive model on uploaded data.

https://www.kaggle.com/c/house-prices-advanced-regression-techniques

SVM as a model for house prices in areas with

1000 < Zip Code < 2500

Basic estimate of square meter price as function of coordinates.

Radia

l Kern

al

Poly

nom

ial Kern

al

Change o

f gam

ma in r

adia

l kern

el,

changes p

erf

orm

ance

Lab exercise:

• Build a model for house/apartment prices at Østerbro.

• Filter dataI. Zip code = 2100II. Remove entries with sales

price lower than taxation value

III. Only use entries with sqmprice in the range 10kkr to 100kkr.

IV. Remove entries without UTM coordinates.

• Split data in training and validation sets

• Estimate error

support vector machines - nbimathies/nbi_svm2019.pdf · support vector machines joachim mathiesen,...

Documents