scaling up learning via sgd - university of washington › courses › cse446 › ... · cse 446:...
TRANSCRIPT
1/27/2017
1
CSE 446: Machine Learning
Linear classifiers: Scaling up learning via SGD
©2017 Emily Fox
This image cannot currently be displayed.
CSE 446: Machine LearningEmily FoxUniversity of WashingtonJanuary 27, 2017
CSE 446: Machine Learning
Stochastic gradient descent:Learning, one data point at a time
©2017 Emily Fox
1/27/2017
2
CSE 446: Machine Learning3
Data
Stochastic gradient ascent
©2017 Emily Fox
w(t)
Compute gradientUse only
small subsets of data
w(t+1)Update
coefficients
w(t+2)Update
coefficients
w(t+3)Update
coefficients
w(t+4)Update
coefficients
Many updates for each pass
over data
CSE 446: Machine Learning4
Stochastic gradient ascent for logistic regression
©2017 Emily Fox
init w(1)=0, t=1
until convergedfor j=0,…,D
partial[j] =
wj(t+1) wj
(t) + η partial[j]
t t + 1
for i=1,…,N Sum over data points
Each time, pick different data point i
1/27/2017
3
CSE 446: Machine Learning
Why would stochastic gradient ever work???
©2017 Emily Fox
CSE 446: Machine Learning6
Gradient is direction of steepest ascent
©2017 Emily Fox
Gradient is “best” direction, butany direction that goes “up” would be useful
1/27/2017
4
CSE 446: Machine Learning7
In ML, steepest direction is sum of “little directions” from each data point
©2017 Emily Fox
Sum over data points
For most data points,contribution points “up”
CSE 446: Machine Learning8
Stochastic gradient: Pick a data point and move in direction
©2017 Emily Fox
Most of the time, total likelihood will increase
≈
1/27/2017
5
CSE 446: Machine Learning9
Stochastic gradient ascent:Most iterations increase likelihood, but sometimes decrease itOn average, make progress
©2017 Emily Fox
until converged
for i=1,…,N
for j=0,…,D
wj(t+1) wj
(t) + ηt t + 1
CSE 446: Machine Learning
Convergence path
©2017 Emily Fox
1/27/2017
6
CSE 446: Machine Learning11
Convergence paths
©2017 Emily Fox
Gradient
Stochastic gradient
CSE 446: Machine Learning12
Stochastic gradient convergence is “noisy”
©2017 Emily Fox
Bett
er
Avg
. lo
g li
kelih
oo
d
Gradient usually increases likelihood smoothly
Stochastic gradient makes “noisy” progress
Total time proportional to # passes over data
Stochastic gradient achieves higher likelihood sooner, but it’s noisier
1/27/2017
7
CSE 446: Machine Learning13
Eventually, gradient catches up
©2017 Emily Fox
Avg
. lo
g li
kelih
oo
dB
ett
er
Gradient
Stochastic gradient
Note: should only trust “average” quality of stochastic gradient
CSE 446: Machine Learning14
The last coefficients may be really good or really bad!!
©2017 Emily Fox
w(1000) was bad
w(1005) was good
How do we minimize risk of
picking bad coefficients
Stochastic gradient will eventually oscillate around a solution
Minimize noise: don’t return last learned coefficients
Output average:
ŵ= 1 w(t)
T
1/27/2017
8
CSE 446: Machine Learning15
Summary of why stochastic gradient works
Gradient finds direction of steepest ascent
Gradient is sum of contributions from each data point
Stochastic gradient uses direction from 1 data point
On average increases likelihood, sometimes decreases
Stochastic gradient has “noisy” convergence
©2017 Emily Fox
CSE 446: Machine Learning
Online learning: Fitting models from streaming data
©2017 Emily Fox
1/27/2017
9
CSE 446: Machine Learning
Batch vs online learning
Batch learning• All data is available at
start of training time
Online learning• Data arrives (streams in) over time
- Must train model as data arrives!
©2017 Emily Fox
DataML
algorithm
time
ŵ
Data
t=1
Data
t=2
ML algorithm
ŵ(1)
Data
t=3
Data
t=4
ŵ(2)ŵ(3)ŵ(4)
CSE 446: Machine Learning18
Online learning example: Ad targeting
©2017 Emily Fox
Website….......….......….......….......….......….......….......….......….......….......….......….......….......….......….......….......
Input: xt
User info, page text
ML algorithm
ŵ(t)
Ad1
Ad2
Ad3
ŷ= Suggested ads
User clicked on Ad2
yt=Ad2
ŵ(t+1)
1/27/2017
10
CSE 446: Machine Learning19
Online learning problem• Data arrives over each time step t:
- Observe input xt
• Info of user, text of webpage
- Make a predictionŷt
• Which ad to show
- Observe true output yt
• Which ad user clicked on
©2017 Emily Fox
Need ML algorithm to update coefficients each time step!
CSE 446: Machine Learning20
Stochastic gradient ascent can be used for online learning!!!
©2017 Emily Fox
• init w(1)=0, t=1
• Each time step t:- Observe input xt
- Make a predictionŷt
- Observe true output yt
- Update coefficients:
for j=0,…,D
wj(t+1) wj
(t) + η
1/27/2017
11
CSE 446: Machine Learning21
Summary of online learning
Data arrives over time
Must make a prediction every time new data point arrives
Observe true class after prediction made
Want to update parameters immediately
©2017 Emily Fox
CSE 446: Machine Learning
Summary of stochastic gradient descent
©2017 Emily Fox
1/27/2017
12
CSE 446: Machine Learning23
What you can do now…
• Significantly speedup learning algorithm using stochastic gradient
• Describe intuition behind why stochastic gradient works
• Apply stochastic gradient in practice
• Describe online learning problems
• Relate stochastic gradient to online learning
©2017 Emily Fox
CSE 446: Machine Learning
Decision Trees
©2017 Emily Fox
CSE 446: Machine LearningEmily FoxUniversity of WashingtonJanuary 27, 2017
1/27/2017
13
CSE 446: Machine Learning
Predicting potential loan defaults
©2017 Emily Fox
CSE 446: Machine Learning26
What makes a loan risky?
©2017 Emily Fox
I want a to buy a new house! Credit History
★★★★
Income★★★
Term★★★★★
Personal Info★★★
Loan Application
1/27/2017
14
CSE 446: Machine Learning27
Credit history explained
©2017 Emily Fox
Credit History ★★★★
Income★★★
Term★★★★★
Personal Info★★★
Did I pay previous loans on time?
Example:excellent, good, or fair
CSE 446: Machine Learning28
Income
©2017 Emily Fox
Credit History ★★★★
Income★★★
Term★★★★★
Personal Info★★★
What’s my income?
Example:$80K per year
1/27/2017
15
CSE 446: Machine Learning29
Loan terms
©2017 Emily Fox
Credit History ★★★★
Income★★★
Term★★★★★
Personal Info★★★
How soon do I need to pay the loan?
Example: 3 years, 5 years,…
CSE 446: Machine Learning30
Personal information
©2017 Emily Fox
Credit History ★★★★
Income★★★
Term★★★★★
Personal Info★★★
Age, reason for the loan, marital status,…
Example: Home loan for a married couple
1/27/2017
16
CSE 446: Machine Learning31
Intelligent application
©2017 Emily Fox
Safe✓
Risky✘
Risky✘
Intelligent loan application review system
Loan Applications
CSE 446: Machine Learning32
Classifier review
©2017 Emily Fox
Loan Application
ClassifierMODEL
Input: xi
Output: ŷPredicted class
Safe
ŷi = +1
Risky
ŷi = -1
1/27/2017
17
CSE 446: Machine Learning33
This module ... decision trees
©2017 Emily Fox
Start
Credit?
Safe
excellent
Income?
poor
Term?
Risky Safe
fair
5 years3 years
Risky
Low
Term?
Risky Safe
high
5 years3 years
CSE 446: Machine Learning34
Scoring a loan application
©2017 Emily Fox
xi = (Credit = poor, Income = high, Term = 5 years)
Credit?
Safe Term?
Risky Safe
Income?
Term?
Risky Safe
Risky
Start
excellent poor
fair
5 years3 yearsLowhigh
5 years3 years
Credit?
Safe Term?
Risky Safe
Income?
Term?
Risky Safe
Risky
Start
excellent poor
fair
5 years3 yearsLowhigh
5 years3 years
ŷi = Safe
1/27/2017
18
CSE 446: Machine Learning
Decision tree learning task
©2017 Emily Fox
CSE 446: Machine Learning36
Decision tree learning problem
©2017 Emily Fox
Optimize quality metricon training data
Training data: N observations (xi,yi)
Credit Term Income y
excellent 3 yrs high safe
fair 5 yrs low risky
fair 3 yrs high safe
poor 5 yrs high risky
excellent 3 yrs low risky
fair 5 yrs low safe
poor 3 yrs high risky
poor 5 yrs low safe
fair 3 yrs high safe
T(X)
1/27/2017
19
CSE 446: Machine Learning37
Quality metric: Classification error
• Error measures fraction of mistakes
- Best possible value : 0.0
- Worst possible value: 1.0
©2017 Emily Fox
Error = # incorrect predictions # examples
CSE 446: Machine Learning38
How do we find the best tree?
©2017 Emily Fox
Exponentially large number of possible trees makes decision tree learning hard!
T1(X) T2(X) T3(X)
T4(X) T5(X) T6(X)
Learning the smallest decision tree is an NP-hard problem[Hyafil & Rivest ’76]
1/27/2017
20
CSE 446: Machine Learning
Greedy decision tree learning
©2017 Emily Fox
CSE 446: Machine Learning40
Our training data table
©2017 Emily Fox
Assume N = 40, 3 features
Credit Term Income y
excellent 3 yrs high safe
fair 5 yrs low risky
fair 3 yrs high safe
poor 5 yrs high risky
excellent 3 yrs low risky
fair 5 yrs low safe
poor 3 yrs high risky
poor 5 yrs low safe
fair 3 yrs high safe
1/27/2017
21
CSE 446: Machine Learning41
(all data)
Start with all the data
©2017 Emily Fox
Loan status: Safe Risky
N = 40 examples
# of Safe loans
22# of Risky loans
18
CSE 446: Machine Learning42
Root22 18
Compact visual notation: Root node
©2017 Emily Fox
Loan status: Safe Risky
N = 40 examples
# of Safe loans
# of Risky loans
1/27/2017
22
CSE 446: Machine Learning43
Decision stump: Single level tree
©2017 Emily Fox
Root22 18
Loan status:Safe Risky
poor4 14
fair9 4
excellent9 0
Credit?
Split on Credit
Subset of data with Credit = excellent
Subset of data with Credit = fair
Subset of data with Credit = poor
CSE 446: Machine Learning44
Visual notation: Intermediate nodes
©2017 Emily Fox
Root22 18
excellent9 0
fair9 4
poor4 14
Loan status:Safe Risky
Credit?
Intermediate nodes
1/27/2017
23
CSE 446: Machine Learning45
Making predictions with a decision stump
©2017 Emily Fox
root22 18
excellent9 0
fair9 4
poor4 14
Loan status:Safe Risky
credit?For each intermediate node, set ŷ= majority value
Safe Safe Risky
CSE 446: Machine Learning
Selecting best feature to split on
©2017 Emily Fox
1/27/2017
24
CSE 446: Machine Learning47
How do we learn a decision stump?
©2017 Emily Fox
Root22 18
excellent9 0
fair9 4
poor4 14
Loan status:Safe Risky
Credit?
Find the “best” feature to split on!
CSE 446: Machine Learning48
How do we select the best feature?
©2017 Emily Fox
Root22 18
excellent9 0
fair9 4
poor4 14
Loan status:Safe Risky
Credit?
Choice 1: Split on Credit
Root22 18
3 years16 4
5 years6 14
Loan status:Safe Risky
Term?
Choice 2: Split on Term
1/27/2017
25
CSE 446: Machine Learning49
How do we measure effectiveness of a split?
©2017 Emily Fox
Error = # mistakes # data points
Root22 18
poor4 14
Loan status:Safe Risky
Credit?
excellent9 0
fair9 4
Idea: Calculate classification error of this decision stump
CSE 446: Machine Learning50
Calculating classification error
©2017 Emily Fox
• Step 1:ŷ= class of majority of data in node
• Step 2: Calculate classification error of predicting ŷfor this data
Root22 18
Loan status:Safe Risky Error = .
=18 mistakes22 correct
ŷ= majority class
Safe Tree Classification error
(root) 0.45
1/27/2017
26
CSE 446: Machine Learning51
Choice 1: Split on Credit history?
©2017 Emily Fox
Does a split on Credit reduce classification error below 0.45?
Root22 18
excellent9 0
fair9 4
poor4 14
Loan status:Safe Risky
Credit?
Choice 1: Split on Credit
CSE 446: Machine Learning52
Split on Credit: Classification error
©2017 Emily Fox
Root22 18
excellent9 0
fair9 4
poor4 14
Loan status:Safe Risky
Credit?
0 mistakes 4 mistakes 4 mistakes
Safe Safe Risky
Choice 1: Split on Credit
Error = .
=
Tree Classification error
(root) 0.45
Split on credit 0.2
1/27/2017
27
CSE 446: Machine Learning53
Choice 2: Split on Term?
©2017 Emily Fox
Root22 18
3 years16 4
5 years6 14
Loan status:Safe Risky
Term?
Safe Risky
Choice 2: Split on Term
CSE 446: Machine Learning54
Evaluating the split on Term
©2017 Emily Fox
Root22 18
3 years16 4
5 years6 14
Loan status:Safe Risky
Term?
4 mistakes 6 mistakes
Safe Risky
Error = .
=
Tree Classification error
(root) 0.45
Split on credit 0.2
Split on term 0.25
Choice 2: Split on Term
1/27/2017
28
CSE 446: Machine Learning55
Choice 1 vs Choice 2:Comparing split on Credit vs Term
©2017 Emily Fox
Root22 18
excellent9 0
fair8 4
poor4 14
Loan status:Safe Risky
Root22 18
3 years16 4
5 years6 14
Loan status:Safe Risky
Credit? Term?
Tree Classification error
(root) 0.45
split on credit 0.2
split on loan term 0.25
Choice 2: Split on TermChoice 1: Split on Credit
CSE 446: Machine Learning56
Feature split selection algorithm
©2017 Emily Fox
• Given a subset of data M (a node in a tree)
• For each feature hi(x):
1. Split data of M according to feature hi(x)
2. Compute classification error split
• Chose feature h*(x) with lowest classification error
1/27/2017
29
CSE 446: Machine Learning
Recursion & Stopping conditions
©2017 Emily Fox
CSE 446: Machine Learning58
We’ve learned a decision stump, what next?
©2017 Emily Fox
Root22 18
excellent9 0
fair9 4
poor4 14
Loan status:Safe Risky
Credit?
Safe All data points are Safenothing else to do with this subset of data
Leaf node
1/27/2017
30
CSE 446: Machine Learning59
Tree learning = Recursive stump learning
©2017 Emily Fox
Root22 18
excellent9 0
fair9 4
poor4 14
Loan status:Safe Risky
Credit?
SafeBuild decision stump with subset of data
where Credit = poor
Build decision stump with subset of data where Credit = fair
CSE 446: Machine Learning60
Second level
©2017 Emily Fox
Root22 18
Loan status:Safe Risky
Credit?
excellent9 0
fair9 4
poor4 14
Safe
3 years0 4
5 years9 0
Term?
Risky Safe
Build another stumpthese data points
high4 5
Low0 9
Income?
Risky
1/27/2017
31
CSE 446: Machine Learning61
Final decision tree
©2017 Emily Fox
Root22 18
excellent9 0
Fair9 4
poor4 14
Loan status:Safe Risky
Credit?
Safe
5 years9 0
3 years0 4
Term?
Risky Safe
low0 9
high4 5
Income?
5 years4 3
3 years0 2
Term?
Risky Safe
Risky
CSE 446: Machine Learning62
Simple greedy decision tree learning
Pick best feature to split on
Learn decision stump with this split
For each leaf of decision stump, recurse
©2017 Emily Fox
When do we stop???
1/27/2017
32
CSE 446: Machine Learning63
Stopping condition 1: All data agrees on y
©2017 Emily Fox
Root22 18
excellent9 0
Fair9 4
poor4 14
Loan status:Safe Risky
Credit?
Safe
5 years9 0
3 years0 4
Term?
Risky Safe
low0 9
high4 5
Income?
5 years4 3
Term?
Risky Safe
Risky
3 years0 2
3 years0 2
All data in these nodes have same
y value Nothing to do
excellent9 0
5 years9 0
3 years0 4
low0 9
CSE 446: Machine Learning64
Stopping condition 2: Already split on all features
©2017 Emily Fox
Root22 18
excellent9 0
Fair9 4
poor4 14
Loan status:Safe Risky
Credit?
Safe
5 years9 0
3 years0 4
Term?
Risky Safe
low0 9
high4 5
Income?
5 years4 3
Term?
Risky Safe
Risky
3 years0 2
Already split on all possible features
Nothing to do
5 years4 3
1/27/2017
33
CSE 446: Machine Learning65
Recursion
Stopping conditions 1 & 2
Greedy decision tree learning
©2017 Emily Fox
• Step 1: Start with an empty tree
• Step 2: Select a feature to split data
• For each split of the tree:
• Step 3: If nothing more to, make predictions
• Step 4: Otherwise, go to Step 2 & continue (recurse) on this split
Pick feature split leading to lowest classification error
CSE 446: Machine Learning66
Is this a good idea?
Proposed stopping condition 3:Stop if no split reduces the
classification error
©2017 Emily Fox
1/27/2017
34
CSE 446: Machine Learning67
Stopping condition 3: Don’t stop if error doesn’t decrease???
©2017 Emily Fox
y valuesTrue False
Root2 2
Error = .
=
Tree Classification error
(root) 0.5
x[1] x[2] yFalse False False
False True True
True False True
True True False
y = x[1] xor x[2]
CSE 446: Machine Learning68
Consider split on x[1]
©2017 Emily Fox
y valuesTrue False
Root2 2
Error = .
=
Tree Classification error
(root) 0.5
Split on x[1] 0.5
True1 1
False1 1
x[1]
x[1] x[2] yFalse False False
False True True
True False True
True True False
y = x[1] xor x[2]
1/27/2017
35
CSE 446: Machine Learning69
Consider split on x[2]
©2017 Emily Fox
y valuesTrue False
Root2 2
Error = 1+1 .
2+2= 0.5
Tree Classification error
(root) 0.5
Split on x[1] 0.5
Split on x[2] 0.5
True1 1
False1 1
x[2]
Neither featuresimprove training error… Stop now???
x[1] x[2] yFalse False False
False True True
True False True
True True False
y = x[1] xor x[2]
CSE 446: Machine Learning70
Final tree with stopping condition 3
©2017 Emily Fox
Tree Classification error
with stopping condition 3
0.5
y valuesTrue False
Root2 2
Predict True
x[1] x[2] yFalse False False
False True True
True False True
True True False
y = x[1] xor x[2]
1/27/2017
36
CSE 446: Machine Learning71
Without stopping condition 3
©2017 Emily Fox
y valuesTrue False
Root2 2
True1 1
False1 1
x[1]
True0 1
x[2]
True1 0
False1 0
x[2]
False0 1
True FalseFalse True
Tree Classification error
with stopping condition 3
0.5
without stopping condition 3
x[1] x[2] yFalse False False
False True True
True False True
True True False
y = x[1] xor x[2]
Condition 3 (stopping when training error doesn’t’ improve) is not recommended!
CSE 446: Machine Learning
Decision tree learning: Real valued features
©2017 Emily Fox
1/27/2017
37
CSE 446: Machine Learning73
How do we use real values inputs?
©2017 Emily Fox
Income Credit Term y
$105 K excellent 3 yrs Safe
$112 K good 5 yrs Risky
$73 K fair 3 yrs Safe
$69 K excellent 5 yrs Safe
$217 K excellent 3 yrs Risky
$120 K good 5 yrs Safe
$64 K fair 3 yrs Risky
$340 K excellent 5 yrs Safe
$60 K good 3 yrs Risky
CSE 446: Machine Learning74
Threshold split
©2017 Emily Fox
Root22 18
Loan status:Safe Risky
Split on the feature Income
< $60K8 13
>= $60K14 5
Income?
Subset of data with Income >= $60K
1/27/2017
38
CSE 446: Machine Learning75
Finding the best threshold split
©2017 Emily Fox
Infinite possible values of t
Income < t* Income >= t*
SafeRisky
Income
$120K$10K
Income = t*
CSE 446: Machine Learning76
Consider a threshold between points
©2017 Emily Fox
SafeRisky
Income
$120K$10K
vA vB
Same classification error for any threshold split between vA and vB
1/27/2017
39
CSE 446: Machine Learning77
Only need to consider mid-points
©2017 Emily Fox
SafeRisky
Income
$120K$10K
Finite number of splits to consider
CSE 446: Machine Learning78
Threshold split selection algorithm
©2017 Emily Fox
• Step 1: Sort the values of a feature hj(x) :
Let {v1, v2, v3, … vN} denote sorted values
• Step 2:
- For i = 1 … N-1
• Consider split ti = (vi + vi+1) / 2
• Compute classification error for treshold split hj(x) >= ti
- Chose the t* with the lowest classification error
1/27/2017
40
CSE 446: Machine Learning79
Visualizing the threshold split
©2017 Emily Fox
0 10 20 30 40 …
$0K
$40K
$80K
…
Age
IncomeThreshold split is the line Age = 38
CSE 446: Machine Learning80
Split on Age >= 38
©2017 Emily Fox
Age
Income age >= 38age < 38
Predict Safe
Predict Risky
0 10 20 30 40 …
$0K
$40K
$80K
…
1/27/2017
41
CSE 446: Machine Learning81
Depth 2: Split on Income >= $60K
©2017 Emily Fox
Age
Income
0 10 20 30 40 …
$0K
$40K
$80K
…
Threshold split is the line Income = 60K
CSE 446: Machine Learning82
Each split partitions the 2-D space
©2017 Emily Fox
Age
Age >= 38
Income >= 60KAge < 38
Age >= 38
Income < 60K
Income
0 10 20 30 40 …
$0K
$40K
$80K
…
1/27/2017
42
CSE 446: Machine Learning
Summary of decision trees
©2017 Emily Fox
CSE 446: Machine Learning84
What you can do now
• Define a decision tree classifier
• Interpret the output of a decision trees
• Learn a decision tree classifier using greedy algorithm
• Traverse a decision tree to make predictions- Majority class predictions
- Probability predictions
- Multiclass classification
©2017 Emily Fox