scaling up learning via sgd - university of washington › courses › cse446 › ... · cse 446:...

1/27/2017

1

CSE 446: Machine Learning

Linear classifiers: Scaling up learning via SGD

©2017 Emily Fox

This image cannot currently be displayed.

CSE 446: Machine LearningEmily FoxUniversity of WashingtonJanuary 27, 2017


Stochastic gradient descent:Learning, one data point at a time

©2017 Emily Fox

1/27/2017

2

CSE 446: Machine Learning3

Data

Stochastic gradient ascent

©2017 Emily Fox

w(t)

Compute gradientUse only

small subsets of data

w(t+1)Update

coefficients

w(t+2)Update

coefficients

w(t+3)Update

coefficients

w(t+4)Update

coefficients

Many updates for each pass

over data


Stochastic gradient ascent for logistic regression

©2017 Emily Fox

init w(1)=0, t=1

until convergedfor j=0,…,D

partial[j] =

wj(t+1) wj

(t) + η partial[j]

t t + 1

for i=1,…,N Sum over data points

Each time, pick different data point i

1/27/2017

3


Why would stochastic gradient ever work???

©2017 Emily Fox


Gradient is direction of steepest ascent

©2017 Emily Fox

Gradient is “best” direction, butany direction that goes “up” would be useful

1/27/2017

4


In ML, steepest direction is sum of “little directions” from each data point

©2017 Emily Fox

Sum over data points

For most data points,contribution points “up”


Stochastic gradient: Pick a data point and move in direction

©2017 Emily Fox

Most of the time, total likelihood will increase

≈

1/27/2017

5


Stochastic gradient ascent:Most iterations increase likelihood, but sometimes decrease itOn average, make progress

©2017 Emily Fox

until converged

for i=1,…,N

for j=0,…,D

wj(t+1) wj

(t) + ηt t + 1


Convergence path

©2017 Emily Fox

1/27/2017

6


Convergence paths

©2017 Emily Fox

Gradient

Stochastic gradient


Stochastic gradient convergence is “noisy”

©2017 Emily Fox

Bett

er

Avg

. lo

g li

kelih

oo

d

Gradient usually increases likelihood smoothly

Stochastic gradient makes “noisy” progress

Total time proportional to # passes over data

Stochastic gradient achieves higher likelihood sooner, but it’s noisier

1/27/2017

7


Eventually, gradient catches up

©2017 Emily Fox

Avg

. lo

g li

kelih

oo

dB

ett

er

Gradient

Stochastic gradient

Note: should only trust “average” quality of stochastic gradient


The last coefficients may be really good or really bad!!

©2017 Emily Fox

w(1000) was bad

w(1005) was good

How do we minimize risk of

picking bad coefficients

Stochastic gradient will eventually oscillate around a solution

Minimize noise: don’t return last learned coefficients

Output average:

ŵ= 1 w(t)

T

1/27/2017

8


Summary of why stochastic gradient works

Gradient finds direction of steepest ascent

Gradient is sum of contributions from each data point

Stochastic gradient uses direction from 1 data point

On average increases likelihood, sometimes decreases

Stochastic gradient has “noisy” convergence

©2017 Emily Fox


Online learning: Fitting models from streaming data

©2017 Emily Fox

1/27/2017

9


Batch vs online learning

Batch learning• All data is available at

start of training time

Online learning• Data arrives (streams in) over time

- Must train model as data arrives!

©2017 Emily Fox

DataML

algorithm

time

ŵ

Data

t=1

Data

t=2

ML algorithm

ŵ(1)

Data

t=3

Data

t=4

ŵ(2)ŵ(3)ŵ(4)


Online learning example: Ad targeting

©2017 Emily Fox

Website….......….......….......….......….......….......….......….......….......….......….......….......….......….......….......….......

Input: xt

User info, page text

ML algorithm

ŵ(t)

Ad1

Ad2

Ad3

ŷ= Suggested ads

User clicked on Ad2

yt=Ad2

ŵ(t+1)

1/27/2017

10


Online learning problem• Data arrives over each time step t:

- Observe input xt

• Info of user, text of webpage

- Make a predictionŷt

• Which ad to show

- Observe true output yt

• Which ad user clicked on

©2017 Emily Fox

Need ML algorithm to update coefficients each time step!


Stochastic gradient ascent can be used for online learning!!!

©2017 Emily Fox

• init w(1)=0, t=1

• Each time step t:- Observe input xt

- Make a predictionŷt

- Observe true output yt

- Update coefficients:

for j=0,…,D

wj(t+1) wj

(t) + η

1/27/2017

11


Summary of online learning

Data arrives over time

Must make a prediction every time new data point arrives

Observe true class after prediction made

Want to update parameters immediately

©2017 Emily Fox


Summary of stochastic gradient descent

©2017 Emily Fox

1/27/2017

12


What you can do now…

• Significantly speedup learning algorithm using stochastic gradient

• Describe intuition behind why stochastic gradient works

• Apply stochastic gradient in practice

• Describe online learning problems

• Relate stochastic gradient to online learning

©2017 Emily Fox


Decision Trees

©2017 Emily Fox

CSE 446: Machine LearningEmily FoxUniversity of WashingtonJanuary 27, 2017

1/27/2017

13


Predicting potential loan defaults

©2017 Emily Fox


What makes a loan risky?

©2017 Emily Fox

I want a to buy a new house! Credit History

★★★★

Income★★★

Term★★★★★

Personal Info★★★

Loan Application

1/27/2017

14


Credit history explained

©2017 Emily Fox

Credit History ★★★★

Income★★★

Term★★★★★


Did I pay previous loans on time?

Example:excellent, good, or fair


Income

©2017 Emily Fox


Income★★★

Term★★★★★


What’s my income?

Example:$80K per year

1/27/2017

15


Loan terms

©2017 Emily Fox


Income★★★

Term★★★★★


How soon do I need to pay the loan?

Example: 3 years, 5 years,…


Personal information

©2017 Emily Fox


Income★★★

Term★★★★★


Age, reason for the loan, marital status,…

Example: Home loan for a married couple

1/27/2017

16


Intelligent application

©2017 Emily Fox

Safe✓

Risky✘

Risky✘

Intelligent loan application review system

Loan Applications


Classifier review

©2017 Emily Fox

Loan Application

ClassifierMODEL

Input: xi

Output: ŷPredicted class

Safe

ŷi = +1

Risky

ŷi = -1

1/27/2017

17


This module ... decision trees

©2017 Emily Fox

Start

Credit?

Safe

excellent

Income?

poor

Term?

Risky Safe

fair

5 years3 years

Risky

Low

Term?

Risky Safe

high

5 years3 years


Scoring a loan application

©2017 Emily Fox

xi = (Credit = poor, Income = high, Term = 5 years)

Credit?

Safe Term?

Risky Safe

Income?

Term?

Risky Safe

Risky

Start

excellent poor

fair

5 years3 yearsLowhigh

5 years3 years

Credit?

Safe Term?

Risky Safe

Income?

Term?

Risky Safe

Risky

Start

excellent poor

fair

5 years3 yearsLowhigh

5 years3 years

ŷi = Safe

1/27/2017

18


Decision tree learning task

©2017 Emily Fox


Decision tree learning problem

©2017 Emily Fox

Optimize quality metricon training data

Training data: N observations (xi,yi)

Credit Term Income y

excellent 3 yrs high safe

fair 5 yrs low risky

fair 3 yrs high safe

poor 5 yrs high risky

excellent 3 yrs low risky

fair 5 yrs low safe


poor 5 yrs low safe


T(X)

1/27/2017

19


Quality metric: Classification error

• Error measures fraction of mistakes

- Best possible value : 0.0

- Worst possible value: 1.0

©2017 Emily Fox

Error = # incorrect predictions # examples


How do we find the best tree?

©2017 Emily Fox

Exponentially large number of possible trees makes decision tree learning hard!

T1(X) T2(X) T3(X)

T4(X) T5(X) T6(X)

Learning the smallest decision tree is an NP-hard problem[Hyafil & Rivest ’76]

1/27/2017

20


Greedy decision tree learning

©2017 Emily Fox


Our training data table

©2017 Emily Fox

Assume N = 40, 3 features

Credit Term Income y

excellent 3 yrs high safe

fair 5 yrs low risky



excellent 3 yrs low risky

fair 5 yrs low safe


poor 5 yrs low safe


1/27/2017

21


(all data)

Start with all the data

©2017 Emily Fox

Loan status: Safe Risky

N = 40 examples

# of Safe loans

22# of Risky loans

18


Root22 18

Compact visual notation: Root node

©2017 Emily Fox

Loan status: Safe Risky

N = 40 examples

# of Safe loans

# of Risky loans

1/27/2017

22


Decision stump: Single level tree

©2017 Emily Fox

Root22 18

Loan status:Safe Risky

poor4 14

fair9 4

excellent9 0

Credit?

Split on Credit

Subset of data with Credit = excellent

Subset of data with Credit = fair

Subset of data with Credit = poor


Visual notation: Intermediate nodes

©2017 Emily Fox

Root22 18

excellent9 0

fair9 4

poor4 14


Credit?

Intermediate nodes

1/27/2017

23


Making predictions with a decision stump

©2017 Emily Fox

root22 18

excellent9 0

fair9 4

poor4 14


credit?For each intermediate node, set ŷ= majority value

Safe Safe Risky


Selecting best feature to split on

©2017 Emily Fox

1/27/2017

24


How do we learn a decision stump?

©2017 Emily Fox

Root22 18

excellent9 0

fair9 4

poor4 14


Credit?

Find the “best” feature to split on!


How do we select the best feature?

©2017 Emily Fox

Root22 18

excellent9 0

fair9 4

poor4 14


Credit?

Choice 1: Split on Credit

Root22 18

3 years16 4

5 years6 14


Term?

Choice 2: Split on Term

1/27/2017

25


How do we measure effectiveness of a split?

©2017 Emily Fox

Error = # mistakes # data points

Root22 18

poor4 14


Credit?

excellent9 0

fair9 4

Idea: Calculate classification error of this decision stump


Calculating classification error

©2017 Emily Fox

• Step 1:ŷ= class of majority of data in node

• Step 2: Calculate classification error of predicting ŷfor this data

Root22 18

Loan status:Safe Risky Error = .

=18 mistakes22 correct

ŷ= majority class

Safe Tree Classification error

(root) 0.45

1/27/2017

26


Choice 1: Split on Credit history?

©2017 Emily Fox

Does a split on Credit reduce classification error below 0.45?

Root22 18

excellent9 0

fair9 4

poor4 14


Credit?



Split on Credit: Classification error

©2017 Emily Fox

Root22 18

excellent9 0

fair9 4

poor4 14


Credit?

0 mistakes 4 mistakes 4 mistakes

Safe Safe Risky


Error = .

=

Tree Classification error

(root) 0.45

Split on credit 0.2

1/27/2017

27


Choice 2: Split on Term?

©2017 Emily Fox

Root22 18

3 years16 4

5 years6 14


Term?

Safe Risky



Evaluating the split on Term

©2017 Emily Fox

Root22 18

3 years16 4

5 years6 14


Term?

4 mistakes 6 mistakes

Safe Risky

Error = .

=


(root) 0.45

Split on credit 0.2

Split on term 0.25


1/27/2017

28


Choice 1 vs Choice 2:Comparing split on Credit vs Term

©2017 Emily Fox

Root22 18

excellent9 0

fair8 4

poor4 14


Root22 18

3 years16 4

5 years6 14


Credit? Term?


(root) 0.45

split on credit 0.2

split on loan term 0.25

Choice 2: Split on TermChoice 1: Split on Credit


Feature split selection algorithm

©2017 Emily Fox

• Given a subset of data M (a node in a tree)

• For each feature hi(x):

1. Split data of M according to feature hi(x)

2. Compute classification error split

• Chose feature h*(x) with lowest classification error

1/27/2017

29


Recursion & Stopping conditions

©2017 Emily Fox


We’ve learned a decision stump, what next?

©2017 Emily Fox

Root22 18

excellent9 0

fair9 4

poor4 14


Credit?

Safe All data points are Safenothing else to do with this subset of data

Leaf node

1/27/2017

30


Tree learning = Recursive stump learning

©2017 Emily Fox

Root22 18

excellent9 0

fair9 4

poor4 14


Credit?

SafeBuild decision stump with subset of data

where Credit = poor

Build decision stump with subset of data where Credit = fair


Second level

©2017 Emily Fox

Root22 18


Credit?

excellent9 0

fair9 4

poor4 14

Safe

3 years0 4

5 years9 0

Term?

Risky Safe

Build another stumpthese data points

high4 5

Low0 9

Income?

Risky

1/27/2017

31


Final decision tree

©2017 Emily Fox

Root22 18

excellent9 0

Fair9 4

poor4 14


Credit?

Safe

5 years9 0

3 years0 4

Term?

Risky Safe

low0 9

high4 5

Income?

5 years4 3

3 years0 2

Term?

Risky Safe

Risky


Simple greedy decision tree learning

Pick best feature to split on

Learn decision stump with this split

For each leaf of decision stump, recurse

©2017 Emily Fox

When do we stop???

1/27/2017

32


Stopping condition 1: All data agrees on y

©2017 Emily Fox

Root22 18

excellent9 0

Fair9 4

poor4 14


Credit?

Safe

5 years9 0

3 years0 4

Term?

Risky Safe

low0 9

high4 5

Income?

5 years4 3

Term?

Risky Safe

Risky

3 years0 2

3 years0 2

All data in these nodes have same

y value Nothing to do

excellent9 0

5 years9 0

3 years0 4

low0 9


Stopping condition 2: Already split on all features

©2017 Emily Fox

Root22 18

excellent9 0

Fair9 4

poor4 14


Credit?

Safe

5 years9 0

3 years0 4

Term?

Risky Safe

low0 9

high4 5

Income?

5 years4 3

Term?

Risky Safe

Risky

3 years0 2

Already split on all possible features

Nothing to do

5 years4 3

1/27/2017

33


Recursion

Stopping conditions 1 & 2

Greedy decision tree learning

©2017 Emily Fox

• Step 1: Start with an empty tree

• Step 2: Select a feature to split data

• For each split of the tree:

• Step 3: If nothing more to, make predictions

• Step 4: Otherwise, go to Step 2 & continue (recurse) on this split

Pick feature split leading to lowest classification error


Is this a good idea?

Proposed stopping condition 3:Stop if no split reduces the

classification error

©2017 Emily Fox

1/27/2017

34


Stopping condition 3: Don’t stop if error doesn’t decrease???

©2017 Emily Fox

y valuesTrue False

Root2 2

Error = .

=


(root) 0.5

x[1] x[2] yFalse False False

False True True

True False True

True True False

y = x[1] xor x[2]


Consider split on x[1]

©2017 Emily Fox

y valuesTrue False

Root2 2

Error = .

=


(root) 0.5

Split on x[1] 0.5

True1 1

False1 1

x[1]


False True True

True False True

True True False

y = x[1] xor x[2]

1/27/2017

35


Consider split on x[2]

©2017 Emily Fox

y valuesTrue False

Root2 2

Error = 1+1 .

2+2= 0.5


(root) 0.5

Split on x[1] 0.5

Split on x[2] 0.5

True1 1

False1 1

x[2]

Neither featuresimprove training error… Stop now???


False True True

True False True

True True False

y = x[1] xor x[2]


Final tree with stopping condition 3

©2017 Emily Fox


with stopping condition 3

0.5

y valuesTrue False

Root2 2

Predict True


False True True

True False True

True True False

y = x[1] xor x[2]

1/27/2017

36


Without stopping condition 3

©2017 Emily Fox

y valuesTrue False

Root2 2

True1 1

False1 1

x[1]

True0 1

x[2]

True1 0

False1 0

x[2]

False0 1

True FalseFalse True


with stopping condition 3

0.5

without stopping condition 3


False True True

True False True

True True False

y = x[1] xor x[2]

Condition 3 (stopping when training error doesn’t’ improve) is not recommended!


Decision tree learning: Real valued features

©2017 Emily Fox

1/27/2017

37


How do we use real values inputs?

©2017 Emily Fox

Income Credit Term y

$105 K excellent 3 yrs Safe

$112 K good 5 yrs Risky

$73 K fair 3 yrs Safe


$217 K excellent 3 yrs Risky

$120 K good 5 yrs Safe

$64 K fair 3 yrs Risky


$60 K good 3 yrs Risky


Threshold split

©2017 Emily Fox

Root22 18


Split on the feature Income

< $60K8 13

>= $60K14 5

Income?

Subset of data with Income >= $60K

1/27/2017

38


Finding the best threshold split

©2017 Emily Fox

Infinite possible values of t

Income < t* Income >= t*

SafeRisky

Income

$120K$10K

Income = t*


Consider a threshold between points

©2017 Emily Fox

SafeRisky

Income

$120K$10K

vA vB

Same classification error for any threshold split between vA and vB

1/27/2017

39


Only need to consider mid-points

©2017 Emily Fox

SafeRisky

Income

$120K$10K

Finite number of splits to consider


Threshold split selection algorithm

©2017 Emily Fox

• Step 1: Sort the values of a feature hj(x) :

Let {v1, v2, v3, … vN} denote sorted values

• Step 2:

- For i = 1 … N-1

• Consider split ti = (vi + vi+1) / 2

• Compute classification error for treshold split hj(x) >= ti

- Chose the t* with the lowest classification error

1/27/2017

40


Visualizing the threshold split

©2017 Emily Fox

0 10 20 30 40 …

$0K

$40K

$80K

…

Age

IncomeThreshold split is the line Age = 38


Split on Age >= 38

©2017 Emily Fox

Age

Income age >= 38age < 38

Predict Safe

Predict Risky

0 10 20 30 40 …

$0K

$40K

$80K

…

1/27/2017

41


Depth 2: Split on Income >= $60K

©2017 Emily Fox

Age

Income

0 10 20 30 40 …

$0K

$40K

$80K

…

Threshold split is the line Income = 60K


Each split partitions the 2-D space

©2017 Emily Fox

Age

Age >= 38

Income >= 60KAge < 38

Age >= 38

Income < 60K

Income

0 10 20 30 40 …

$0K

$40K

$80K

…

1/27/2017

42


Summary of decision trees

©2017 Emily Fox


What you can do now

• Define a decision tree classifier

• Interpret the output of a decision trees

• Learn a decision tree classifier using greedy algorithm

• Traverse a decision tree to make predictions- Majority class predictions

- Probability predictions

- Multiclass classification

©2017 Emily Fox

scaling up learning via sgd - university of washington › courses › cse446 › ... · cse 446:...

Documents