data mining taylor statistics 202: data mining -...
Post on 01-Oct-2018
219 Views
Preview:
TRANSCRIPT
Statistics 202:Data Mining
c©JonathanTaylor
Statistics 202: Data MiningSupport Vector Machine, Random Forests, Boosting
Based in part on slides from textbook, slides of Susan Holmes
c©Jonathan Taylor
December 2, 2012
1 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Neural networks
Neural network
Another classifier (or regression technique) for 2-classproblems.
Like logistic regression, it tries to predict labels yyy from xxx .
Tries to find
f = argminf ∈C
n∑i=1
(YYY i − f (XXX i ))2
whereC = {“neural net functions”} .
2 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Neural networks: single layer
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 54
Artificial Neural Networks (ANN)
3 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Neural networks
Single layer neural nets
With p inputs, there are p + 1 weights
f (xxx1, . . . ,xxxp) = S(α + xxxTβ
)Here, S is a sigmoid function like the logistic curve.
S(η) =eη
1 + eη.
The algorithm must estimate (α, β) from data.
Classifier would assign +1 to anything withS(α + xxxT β) > t for some threshold like 0.5.
4 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Neural networks: double layer
5 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Neural networks
Single layer neural nets
In the previous figure, there are 15 weights and 3 interceptterms (α’s) for input → hidden layer.
There are 3 weights and one intercept term for hidden →output layer.
The outputs from the hidden layer have the form
S
(αj +
5∑i=1
wijxxx i
)1 ≤ j ≤ 3.
So, the final output is of the form
S
α0 +3∑
j=1
vjS
(αj +
5∑i=1
wijxxx i
) .
6 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Neural networks
Single layer neural nets
Fitting this model is done via nonlinear least squares.
Computationally difficult due to the highly nonlinear formof the regression function.
7 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Support vector machines
Support vector machine
Another classifier (or regression technique) for 2-classproblems.
Like logistic regression, it tries to predict labels yyy from xxxusing a linear function.
A linear function with an intercept α determines an affinefunction fα,β(xxx) = xxxTβ + α and a hyperplane
Hα,β ={xxx : xxxTβ + α = 0
}.
A good classifier will “separate the classes” into the sideswhere fα,β(xxx) > 0, fα,β(xxx) < 0.
It seeks to find the hyperplane that “separates the classesthe most.”
8 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Support vector machine
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 59
Support Vector Machines
! Find a linear hyperplane (decision boundary) that will separate the data
9 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Support vector machine
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 60
Support Vector Machines
! One Possible Solution
10 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Support vector machine
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 61
Support Vector Machines
! Another possible solution
11 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Support vector machine
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 62
Support Vector Machines
! Other possible solutions
12 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Support vector machine
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 63
Support Vector Machines
! Which one is better? B1 or B2? ! How do you define better?
13 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Support vector machine
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 64
Support Vector Machines
! Find hyperplane maximizes the margin => B1 is better than B2
14 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Support vector machines
Support vector machine
The reciprocal maximum margin can be written as
infβ,α‖β‖2
subject to yyy i (xxxTi β + α) ≥ 1.
Therefore, the support vector machine problem is
minimizeβ,α‖β‖2
subject to yyy i (xxxTi β + α) ≥ 1.
15 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Support vector machine
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 67
Support Vector Machines
! What if the problem is not linearly separable?
16 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Support vector machines
Non-separable problems
If we cannot completely separate the classes it is commonto modify the problem to
minimizeβ,α,ξ‖β‖2
subject to yyy i (xxxTi β + α) ≥ 1− ξi , ξi ≥ 0,
∑ni=1 ξi ≤ C
Or the Lagrange version
minimizeβ,α,ξ‖β‖22 + γ
n∑i=1
ξi
subject to yyy i (xxxTi β + α) ≥ 1− ξi , ξi ≥ 0.
17 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Support vector machines
Non-separable problems
The ξi ’s can be removed from this problem, yielding
minimizeβ,α‖β‖22 + γ
n∑i=1
(1− yi fα,β(xxx i ))+
where (z)+ = max(z , 0) is the positive part function.
Or,
minimizeβ,α
n∑i=1
(1− yi fα,β(xxx i ))+ + λ‖β‖22
18 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Logistic vs. SVM
3 2 1 0 1 2 30.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
LogisticSVM
19 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Iris data: sepal.length, sepal.width
3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.02.0
2.5
3.0
3.5
4.0
4.5
5.0
20 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Iris data: sepal.length, sepal.width
21 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Iris data: sepal.length, sepal.width
22 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Support vector machines
Kernel trick
The term ‖β‖22 can be thought of as a complexity penaltyon the function fα,β so we could write
minimizefβ ,α
n∑i=1
(1− yi (α + fβ(xxx i ))+ + λ‖fβ‖22
We could also add such a complexity penalty to logisticregression
minimizefβ ,α
n∑i=1
DEV(α + fβ(xxx i ),yyy i ) + λ‖fβ‖22
So, in both SVM and penalized logistic, we are looking forthe “best” linear function where “best” is determined byeither logistic loss (deviance) or the hinge loss (SVM). 23 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Support vector machines
Kernel trick
If the decision boundary is nonlinear, one might look forfunctions other than linear.
There are classes of functions called Reproducing KernelHilbert Spaces (RKHS) for which it is theoretically (andcomputationally) possible to solve
minimizef ∈H
n∑i=1
(1− yyy i (α + f (xxx i )))+ + λ‖f ‖2H.
or
minimizef ∈H
n∑i=1
DEV(α + f (xxx i ),yyy i ) + λ‖f ‖2H.
The svm function in R allows some such “kernels” forSVM.
24 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Iris data: sepal.length, sepal.width
3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.02.0
2.5
3.0
3.5
4.0
4.5
5.0
25 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Iris data: sepal.length, sepal.width
26 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Iris data: sepal.length, sepal.width
27 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Iris data: sepal.length, sepal.width
3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.02.0
2.5
3.0
3.5
4.0
4.5
5.0
28 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Iris data: sepal.length, sepal.width
29 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Iris data: sepal.length, sepal.width
30 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Ensemble methods
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 74
General Idea
31 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Ensemble methods
Why use ensemble methods?
Basic idea: weak learners / classifiers are noisy learners.
Aggregating over a large number of weak learners shouldimprove the output.
Assumes some “independence” across weak learners,otherwise aggregation does nothing.
That is, if every learner is the same, then any sensibleagrregation technique should return this classifier.
These are often some of the best classifiers in terms ofperformance.
Downside: they have lost interpretability in theaggregation stage.
32 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Ensemble methods
Bagging
In this method, one takes several bootstrap samples(samples with replacement) of the data.
For each bootstrap sample Sb, 1 ≤ b ≤ B, fit a model,retaining the classifier f ∗,b.
After all models have been fit, use majority vote
f (xxx) = majority vote of (f ∗,b(xxx))1≤i≤B .
33 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Ensemble methods
Bagging
Approximately 1/3 of data is not selected in eachbootstrap sample (recall 0.632 rule).
Using this observation, we can define the Out-of-bag(OOB) estimate of accuracy based on OOB prediction
fOOB(xxx i ) = majority vote of (f b(xixixi ))1≤b≤B(i)
where B(i) = {all trees not built with xxx i}.Total OOB error is
OOB =1
n
n∑i=1
(fOOB(xxx i ) 6= yyy i )
34 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Ensemble methods
Random forests: bagging trees
If the weak learners are trees, this method is called aRandom Forest.
Some variations on random forests inject more randomnessthan just bagging.
For example:
Taking a random subset of features.Random linear combinations of features.
35 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Iris data: sepal.length, sepal.width usingrandomForest
36 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Ensemble methods
Boosting
Unlike bagging, boosting does not insert randomizatoninto the algorithm.
Boosting works by iteratively reweighting each observationand refitting a given classifier.
To begin, suppose we have a classification algorithm thataccepts case weights w = (w1, . . . ,wn)
fw = argminf
n∑i=1
wiL(yyy i , f (xxx i )).
and returns labels of {+1,−1}.
37 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Ensemble methods
Boosting algorithm (AdaBoost.M1)
Step 0 Initiailize weights w(0)i = 1
n , 1 ≤ i ≤ n.
Step 1 For 1 ≤ t ≤ T , fit a classifier with weights w (t−1)
yielding
g (t) = fw (t−1) = argminf
n∑i=1
w(t−1)i L(yyy i , f (xxx i )).
Step 2 Compute the total error rate
err(t) =
∑ni=1 w
(t−1)i (yyy i 6= g (t)(xxx i ))∑n
i=1 w(t−1)i
α(t) = log
(1− err(t)
err(t)
)38 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Ensemble methods
Boosting algorithm (AdaBoost.M1)
Step 3 Produce new weights
w(t)i = w
(t−1)i · exp(α(t) · (yyy i 6= g (t)(xxx i ))
Step 4 Repeat steps 1-3 until termination.
Step 5 Output the classifier
f (xxx) = sign
(T∑t=1
α(t)g (t)(xxx)
)
39 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Ensemble methods
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 83
Illustrating AdaBoost
Data points for training
Initial weights for each data point
40 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Ensemble methods
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 84
Illustrating AdaBoost
41 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Ensemble methods
Boosting as gradient descent
It turns out that boosting can be thought of as somethinglike gradient descent.
Above, we assumed we were solving
argminf
n∑i=1
w(t−1)i L(yyy i , f (xxx i ))
but we didn’t say what type of f we considered.
42 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Ensemble methods
Boosting as gradient descent
Call the space of all possible classifiers above F0, and alllinear combinations of such classifiers
F =
∑j
αj fj , fj ∈ F0
In some sense, the boosting algorithm is a “steepestdescent” algorithm to find
argminf ∈F
n∑i=1
L(yyy i , f (xxx i )).
For more details, see ESL, but we will discuss some ofthese ideas. In R, this idea is implemented in the gbm
package. 43 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Ensemble methods
Boosting as gradient descent
Consider the problem
minimizef ∈F =n∑
i=1
L(yi , f (xi ))
= L(f )
This can be solved by “gradient descent” where the“gradient” ∇L(f ) is in F0 and is the direction of steepestdescent for some stepsize α
argming∈F0L(f + α · g)
44 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Ensemble methods
Boosting as gradient descent
We might also optimize the stepsize
argming∈F0,αL(f + α · g)
45 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Ensemble methods
Boosting as gradient descent
In this context, the forward stagewise algorithm forsteepest descent is:
1 Initialize: f (0)(x) = 0;2 For i = 1, . . . ,T : solve
α(t), g (t) = argminα,g∈F0
L(f (t−1) + α · g)
and update f (t) = f (t−1) + α(t) · g (t).3 Return the final classifier f (T )
When L(y , f (x)) = e−y ·f (x) this corresponds toAdaBoost.M1
Changing the loss function yields new gradient boostingalgorithms.
46 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Iris data: sepal.length, sepal.width, 1000trees
{Versicolor} vs. {Setosa, Virginica} 47 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Iris data: sepal.length, sepal.width, 5000trees
{Versicolor} vs. {Setosa, Virginica} 48 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Iris data: sepal.length, sepal.width, 10000trees
{Versicolor} vs. {Setosa, Virginica} 49 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Boosting fits an additive model (by default)
50 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Boosting fits an additive model (by default)
51 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Out-of-bag error improvement
52 / 1
Statistics 202:Data Mining
c©JonathanTaylor
53 / 1
top related