machine learning (cse 446): perceptronperceptron convergence due to rosenblatt (1958). if dis...

37
Machine Learning (CSE 446): Perceptron Noah Smith c 2017 University of Washington [email protected] October 11, 2017 1 / 37

Upload: others

Post on 20-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Machine Learning (CSE 446):Perceptron

Noah Smithc© 2017

University of [email protected]

October 11, 2017

1 / 37

Page 2: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

No office hours for Noah today.

John’s office hours will be 12–2 today.

Patrick’s office hours will run 12:30–2:30 on Thursday.

2 / 37

Page 3: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Happy Medium?

Decision trees (that aren’t too deep): use relatively few features to classify.

K-nearest neighbors: all features weighted equally.

Today: use all features, but weight them.

3 / 37

Page 4: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Happy Medium?

Decision trees (that aren’t too deep): use relatively few features to classify.

K-nearest neighbors: all features weighted equally.

Today: use all features, but weight them.

For today’s lecture, assume that y ∈ {−1,+1} instead of {0, 1}, and that x ∈ Rd.

4 / 37

Page 5: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Happy Medium?

Decision trees (that aren’t too deep): use relatively few features to classify.

K-nearest neighbors: all features weighted equally.

Today: use all features, but weight them.

For today’s lecture, assume that y ∈ {−1,+1} instead of {0, 1}, and that x ∈ Rd.

Remember that x[j] = φj(x). (Features have already been applied to the data.)

5 / 37

Page 6: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Inspiration from NeuronsImage from Wikimedia Commons.

Input signals come in through dendrites, output signal passes out through the axon.

6 / 37

Page 7: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Neuron-Inspired Classifier

x[1]

w[1] ×

x[2] w[2] ×

x[3] w[3] ×

x[d] w[d] ×

b

!

fire, or not?ŷ

bias parameter

weight parameters

“activation” output

input

7 / 37

Page 8: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Neuron-Inspired Classifier

f(x) = sign (w · x+ b)

remembering that: w · x =

d∑j=1

w[j] · x[j]

8 / 37

Page 9: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Neuron-Inspired Classifier

f(x) = sign (w · x+ b)

remembering that: w · x =

d∑j=1

w[j] · x[j]

Learning requires us to set the weights w and the bias b.

9 / 37

Page 10: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Perceptron Learning AlgorithmData: D = 〈(xn, yn)〉Nn=1, number of epochs EResult: weights w and bias binitialize: w = 0 and b = 0;for e ∈ {1, . . . , E} do

for n ∈ {1, . . . , N}, in random order do# predicty = sign (w · xn + b);if y 6= yn then

# updatew← w + yn · xn;b← b+ yn;

end

end

endreturn w, b

Algorithm 1: PerceptronTrain

10 / 37

Page 11: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Linear Decision Boundary

w·x + b < 0

w·x + b > 0

w·x + b = 0

11 / 37

Page 12: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Linear Decision Boundary

w·x + b = 0

activation = w·x + b

12 / 37

Page 13: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Linear Decision Boundary

w·x + b = 0

13 / 37

Page 14: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Interpretation of Weight Values

What does it mean when . . .

I w12 = 100?

I w12 = −1?

I w12 = 0?

14 / 37

Page 15: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Interpretation of Weight Values

What does it mean when . . .

I w12 = 100?

I w12 = −1?

I w12 = 0?

15 / 37

Page 16: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Interpretation of Weight Values

What does it mean when . . .

I w12 = 100?

I w12 = −1?

I w12 = 0?

16 / 37

Page 17: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Interpretation of Weight Values

What does it mean when . . .

I w12 = 100?

I w12 = −1?

I w12 = 0?

17 / 37

Page 18: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Interpretation of Weight Values

What does it mean when . . .

I w12 = 100?

I w12 = −1?

I w12 = 0?

What if we multiply w by 2?

18 / 37

Page 19: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Linear Separability

A dataset D = 〈(xn, yn)〉Nn=1 is linearly separable if there exists some linear classifier(defined by w, b) such that, for all n, yn = sign (w · xn + b).

margin(D,w, b) =

{minn yn · (w · xn + b) if w and b separate D−∞ otherwise

margin(D) = supw,b

margin(D,w, b)

19 / 37

Page 20: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Perceptron ConvergenceDue to Rosenblatt (1958).

If D is linearly separable with margin γ > 0 and for all n ∈ {1, . . . , N}, ‖xn‖2 ≤ 1,then the perceptron algorithm will converge in at most 1

γ2updates.

20 / 37

Page 21: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Perceptron ConvergenceDue to Rosenblatt (1958).

If D is linearly separable with margin γ > 0 and for all n ∈ {1, . . . , N}, ‖xn‖2 ≤ 1,then the perceptron algorithm will converge in at most 1

γ2updates.

I Proof can be found in Daume (2017), pp. 50–51.

I The theorem does not guarantee that the perceptron’s classifier will achievemargin γ.

21 / 37

Page 22: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

What if D isn’t linearly separable?

22 / 37

Page 23: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

What if D isn’t linearly separable?

No guarantees.

23 / 37

Page 24: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Voted Perceptron

Let w(e,n) and b(e,n) be the parameters after updating based on the nth example onepoch e.

y = sign

(E∑e=1

N∑n=1

sign(w(e,n) · x+ b(e,n))

)

24 / 37

Page 25: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Voted Perceptron

Let w(e,n) and b(e,n) be the parameters after updating based on the nth example onepoch e.

y = sign

(E∑e=1

N∑n=1

sign(w(e,n) · x+ b(e,n))

)

Do we need to store E ·N parameter settings?

25 / 37

Page 26: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Voted Perceptron

Let w(e,n) and b(e,n) be the parameters after updating based on the nth example onepoch e.

y = sign

(E∑e=1

N∑n=1

sign(w(e,n) · x+ b(e,n))

)

Can do a little better by storing “survival” time for each vector (number of times anupdate didn’t need to be made because y = y).

26 / 37

Page 27: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Voted Perceptron

27 / 37

Page 28: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Voted Perceptron

28 / 37

Page 29: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Voted Perceptron

29 / 37

Page 30: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Voted Perceptron

30 / 37

Page 31: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Voted Perceptron

31 / 37

Page 32: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Voted Perceptron

32 / 37

Page 33: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Averaged Perceptron

y = sign

(E∑e=1

N∑n=1

w(e,n) · x+ b(e,n)

)

33 / 37

Page 34: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Averaged Perceptron

y = sign

(E∑e=1

N∑n=1

w(e,n) · x+ b(e,n)

)

Do we need to store E ·N parameter settings?

34 / 37

Page 35: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Averaged Perceptron

y = sign

(E∑e=1

N∑n=1

w(e,n) · x+ b(e,n)

)

Do we need to store E ·N parameter settings?

What kind of decision boundary does the averaged perceptron have?

35 / 37

Page 36: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Averaged Perceptron

y = sign

(E∑e=1

N∑n=1

w(e,n) · x+ b(e,n)

)

= sign

((E∑e=1

N∑n=1

w(e,n)

)· x+

(E∑e=1

N∑n=1

b(e,n)

))

Do we need to store E ·N parameter settings?

What kind of decision boundary does the averaged perceptron have?

36 / 37

Page 37: Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

References I

Hal Daume. A Course in Machine Learning (v0.9). Self-published athttp://ciml.info/, 2017.

Frank Rosenblatt. The perceptron: A probabilistic model for information storage andorganization in the brain. Psychological Review, 65:386–408, 1958.

37 / 37