perceptron - kangwon

Post on 04-May-2022

16 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Perceptron

Some slides from CS546 …

2

Linear Functions

f (x) = 1 if w1 x1 + w2 x2 +. . . wn xn >= θ 0 Otherwise

{ •  Disjunctions: y = x1 ∨ x3 ∨ x5

y = ( 1• x1 + 1• x3 + 1• x5 >= 1) •  At least m of n: y = at least 2 of {x1 , x3 , x5}

y = ( 1• x1 + 1• x3 + 1• x5 >=2)

•  Exclusive-OR: y = (x1 Λ x2) v (x1 Λ x2)

•  Non-trivial DNF: y = (x1 Λ x2) v (x3 Λ x4)

3

Linear Functions

w � x = 0

- - - - - - - - -

- -

- - -

-

w � x = θ

Some Biology

•  Very loose inspiration: human neurons

Perceptrons abstract from the details of real neurons

•  Conductivity delays are neglected •  An output signal is either discrete (e.g., 0 or 1) or it is a

real-valued number (e.g., between 0 and 1) •  Net input is calculated as the weighted sum of the input

signals •  Net input is transformed into an output signal via a

simple function (e.g., a threshold function)

Different Activation Functions

•  Threshold Activation Function (step) •  Piecewise Linear Activation Function •  Sigmoid Activation Function •  Gaussian Activation Function

–  Radial Basis Function

BIAS UNIT With X0 = 1

Types of Activation functions

The Perceptron

Features

LTU Sigmoid

The Binary Perceptron

•  Inputs are features •  Each feature has a weight •  Sum is the activation

•  If the activation is: –  Positive, output 1 –  Negative, output 0

Σ f1

f2

f3

w1

w2

w3 >0?

10

Perceptron learning rule

•  On-line, mistake driven algorithm. •  Rosenblatt (1959) suggested that when a target output value is provided for a single neuron with fixed input, it can incrementally change weights and learn to produce the output using the Perceptron learning rule Perceptron == Linear Threshold Unit

1 2

6

3 4 5

7

6w

1w

∑T

y

1x

6x

xwTii

i xwy ==∑ˆ

11

Perceptron learning rule

•  We learn f:X→{-1,+1} represented as f = sgn{w•x) Where X= or X= w∈ n{0,1} nR nR•  Given Labeled examples: )}y,(x),...,y,(x),y,{(x mm2211

1.  Initialize w=0∈ 2. Cycle through all examples

a. Predict the label of instance x to be y’ = sgn{w•x) b. If y’≠y, update the weight vector:

w = w + r y x (r - a constant, learning rate) Otherwise, if y’=y, leave weights unchanged.

nR

12

Footnote About the Threshold

•  On previous slide, Perceptron has no threshold •  But we don’t lose generality:

θ−⇔

∀⇔

,

1,

ww

xxx

0x

1x

θ=•xw

0x

1x

θ 01,, =•− xw θ

13

Geometric View

14

15

16

Deriving the delta rule

•  Define the error as the squared residuals summed over all training cases:

•  Now differentiate to get error derivatives for weights

•  The batch delta rule changes the weights in proportion to their error derivatives summed over all training cases

E = 12 (ynn∑ − yn )

2

∂E∂wi

= 12

∂yn∂wi

∂En∂ynn

= − xi ,nn∑ (yn − yn )

Δwi = −ε∂E∂wi

18

Perceptron Learnability

•  Obviously can’t learn what it can’t represent –  Only linearly separable functions

•  Minsky and Papert (1969) wrote an influential book demonstrating Perceptron’s representational limitations –  Parity functions can’t be learned (XOR) –  In vision, if patterns are represented with local features,

can’t represent symmetry, connectivity •  Research on Neural Networks stopped for years

•  Rosenblatt himself (1959) asked,

•  “What pattern recognition problems can be transformed so as to become linearly separable?”

19

Perceptron Convergence •  Perceptron Convergence Theorem: If there exist a set of weights that are consistent with the (I.e., the data is linearly separable) the perceptron learning algorithm will converge -- How long would it take to converge ? •  Perceptron Cycling Theorem: If the training data is not linearly the perceptron learning algorithm will eventually repeat the same set of weights and therefore enter an infinite loop. -- How to provide robustness, more expressivity ?

20

•  Maintains a weight vector w ∈ RN, w0=(0,…,0). •  Upon receiving an example x ∈ RN •  Predicts according to the linear threshold function w•x ≥ 0. Theorem [Novikoff,1963] Let (x1; y1),…,: (xt; yt), be a sequence of

labeled examples with xi ∈ RN, ||xi|| ≤ R and yi ∈{-1,1} for all i. Let u ∈ RN, γ > 0 be such that, ||u|| = 1 and yi u • xi ≥ γ for all i. Then Perceptron makes at most R2 / γ 2 mistakes on this example

sequence. (see additional notes)

Perceptron: Mistake Bound Theorem

Margin

Complexity Parameter

21

Perceptron-Mistake Bound Proof: Let vk be the hypothesis before the k-th mistake. Assume that the k-th mistake occurs on the input example (xi, yi).

Assumptions

v1 = 0

||u|| ≤ 1 yi u • xi ≥ γ

K < R2 / γ 2

Multiply by u

By definition of u

By induction

Projection

22

Dual Perceptron

23

Dual Perceptron - We can replace xi � xj with K(xi ,xj) which can be regarded a dot product in some large (or infinite) space

-  K(x,y) - often can be computed efficiently without computing mapping to this space

24

Efficiency

•  Dominated by the size of the feature space

•  Could be more efficient since work is done in the original feature space.

•  In practice: explicit Kernels (feature space blow-up) is often more efficient.

∑=i

ii )K(x,xcf(x)

•  Additive algorithms allow the use of Kernels No need to explicitly generate the complex features

kn ) (x)... (x), (x), (x) n321 >>Χ→ χχχχ(),...,,( 321 kxxxxX

•  Most features are functions (e.g., conjunctions) of raw attributes

Which vi should we use?

Maybe the last one?

Here it’s never gotten any test cases right! (Experimentally, the classifiers move around a lot.)

Maybe the “best one”?

But we “improved” it with later mistakes…

Voted-Perceptron

Voted-Perceptron Idea two: keep around intermediate hypotheses, and have them “vote” [Freund

and Schapire, 1998] n = 1 w1 = 0 c1 = 0 for k = 1 to K for i = 1 to m if (xi,yi) is misclassified: wn+1 = wn + yi xi cn+1 = 1 n = n + 1 else cn = cn + 1

At the end, a collection of linear separators w0, w1, w2, …, along with survival times: cn = amount of time that wn survived.

Idea two: keep around intermediate hypotheses, and have them “vote” [Freund and Schapire, 1998]

At the end, a collection of linear separators w0, w1, w2, …, along

with survival times: cn = amount of time that wn survived. This cn is a good measure of the reliability of wn. To classify a test point x, use a weighted majority vote:

Voted-Perceptron – cont’d

Problem: need to keep around a lot of wn vectors Solutions: (i) Find “representatives” (ii)  Alternative prediction rule:

wavg

Voted-Perceptron – cont’d

From Freund & Schapire, 1998: Classifying digits with VP

30

•  In general – regularization is used to bias the learner in the direction of a low-expressivity (low VC dimension) separator • Averaged Perceptron

•  Returns a weighted average of a number of earlier hypothesis; •  The weights are a function of the length of no-mistakes stretch.

Extensions: Regularization

The two most important extensions for Perceptron turns out to be: Averaged Perceptron & Thick Separator

31

Regularization: Thick Separator

•  Thick Separator (Perceptron)

–  Promote if: w�x > θ+γ –  Demote if: w�x < θ-γ

w�x = 0

- - - - - - - - -

- -

- - -

-

w�x = θ

Multiclass Classification

What if there are k classes?

1

3

2

Reduce to binary: all-vs-one

Winnow Algorithm

33

34

SNoW •  A learning architecture that supports several linear update

rules (Perceptron, Winnow, naïve Bayes) •  Allows regularization; voted Winnow/Perceptron; pruning;

many options •  True multi-class classification •  Variable size examples; very good support for large scale

domains in terms of number of examples and number of features.

•  “Explicit” kernels (blowing up feature space). •  Very efficient (1-2 order of magnitude faster than SVMs) •  Stand alone, implemented in LBJ

[Dowload from: http://L2R.cs.uiuc.edu/~cogcomp ]

35

Passive-Aggressive: Motivation

•  Perceptron: No guaranties of margin after the update

•  PA: Enforce a minimal non-zero margin after the update

•  In particular: §  If the margin is large enough, then do nothing §  If the margin is less then unit, update such that the

margin after the update is enforced to be unit

36

Aggressive Update Step

•  Set to be the solution of the following optimization problem:

•  Closed-form update:

where,

37

Passive-Aggressive Update

Online Passive-Aggressive Algorithms

38

Online Passive-Aggressive Algorithms – cont’d

39

40

Unrealizable Case

top related