1 fast asymmetric learning for cascade face detection jiaxin wu, and charles brubaker ieee pami,...

1

Fast Asymmetric Learning for Cascade Face Detection

Jiaxin Wu, and Charles Brubaker IEEE PAMI, 2008

Chun-Hao Chang 張峻豪2009/12/01

2

Outline

1. Introduction

2. Recall Adaboost

3. System Flowchart

4. Forward Feature Selection (FFS)

5. Linear Asymmetric Classifier (LAC)

6. Experimental Result

7. Conclusion

3

1. Introduction

2. Recall Adaboost

3. System Flowchart




7. Conclusion

4

1. Introduction Observe three asymmetries in face detection problem:

1. Uneven class priors –Training database: # of Positives Vs. # of Negatives.

2. Goal asymmetry –Detection Rate Vs. False Positive Rate => EER

3. Unequal complexity with positive and negative classes –Face Vs. Car (Non-Face) => Easy to classify

Face Vs. Animal (Non-Face) => Hard to classify

This paper present a framework similar to Adaboost: but

faster in learning have the freedom to design an ensemble classifier.

5

1. Introduction Decoupled classifier design step into

Feature selection and Ensemble classifier. (ex: FDA, SVM…)

Proposed Forward Feature Selection (FFS) and Linear Asymmetric Classifier (LAC).

Advantage : 1. FFS is about 2.5 ~3.5 times faster than Fast Adaboost and 50~100 times

faster than Adaboost in training process.2. FFS only requires about 3% memory usage as that of Adaboost.3. Have the freedom to design an ensemble classifier.

6

1. Introduction: Adaboost Vs. FFS+LAC

Adaboost

FFS+LAC

h1α1

k kk

α hα2 h2

α3

α4

α5

h5

h3

h4

FFS

h1

h2

h4

h5

h3

11

1

1

1

'k k

k

α h

LAC

h2

α1’ h1

h5

h4

h3

α2’

α5’

α3’

α4’

image zi, weight wi=1/Np

Assume N=Np+Nn

Np positive samples

Nn negative samples

image zi, weight wi=1/ Np

1 11

( )N

i i ii

ε w h z c

11

1

1log

εα

ε

ci is the label of zi, and h1 is the weak classifier with weight α1

i N

k = # of weak classifier

7

1. Introduction

2. Recall Adaboost

3. System Flowchart




7. Conclusion

8

2. Recall Adaboost (1/2)1. Input Data

N Training DataN samples

2. Cascaded Framework

Learning goal satisfied? Adding new node

Node Learning

FT

1. Normalize weights.2. Pick appropriate threshold for each weak classifier hi, where 1<i<M. M is the number of features.

Feature Selection and Ensemble Classifier (Adaboost)

3. Cascaded Detector

{ ( ) sgn( )}Th z z m τ

4. Update weights

1 11

1 ( )5. ( ) 2

0 otherwise

T Tt t t t tα h z α

H z

with input data z, and h’s corresponding mask (feature) m and threshold τ.

Hk+1

3. Choose the classifier, ht, with the lowest error.

Were coupling together (not separable)

αh

H1 H2 H3

N=Np+Nn

,αt is the weight of ht

T iterations

9

2. Recall Adaboost (2/2) αt is decided once ht is chosen.

Weight wt,i is updated by the error rate εi at the end of each iteration, where wt,i is the weight of sample i at iteration t.

11, ,

where =0 if example is classified correctly, =1 otherwise, and 1

iet i t i t

ti i i t

t

w w β

εe z e β

ε

Feature = (Filter, Position)

Feature Value = Feature * example, * = convolution

Classifier = (Feature, Threshold)

10

1. Introduction

2. Recall Adaboost

3. System Flowchart




7. Conclusion

11

3. System Flowchart: Notations z: Input example.

x: Vector of feature values of a positive example.

y: Vector of feature values of a negative example.

: Covariance matrix of x.

a: Optimal weight.

b: Optimal threshold.

x

sample xi

h1

h3

h2

h4

1

2

3

4

( )

( )( )

( )

( )

i

ii

i

i

h x

h xx

h x

h x

x h

1 ( )Average vector of positive feature values

xnii

x

x

n

hx

1 ( )ynii

y

y

n

hy

weak classifiers

Convolution

12

3. System Flowchart: FFS+LAC

))(sgn()( ;),(1

1

T

ttt

T bzhazHyabyxx

a

3. ( ) sgn( ( ) )h SH x h x θ

1. Input Data

N Training DataN samples

2. Cascaded Framework

Learning goal satisfied? Adding new node

Node Learning

FT

1. Build Feature Table.

2. Choose the weak classifier, hi, that makes H’ has the smallest error rate .

Feature Selection (FFS)

Ensemble Classifier (LAC)

3. Cascaded Detector

3. ( ) sgn( ( ) )h SH z h z θ

1

1( ), ; ( ) sgn( ( ) )

TT

t tt

b H a h bx

a x y a y z z

Hk+1

Separable

H1 H2 H3

N=Np+Nn

Θ is the threshold of H(z)

T iterations

13

3. System Flowchart: Q&A (1/2) Q1: What’s the difference between Adaboost and FFS+LAC? A1:

We can’t separate Adaboost into feature selection and ensemble classifier step. (Adaboost) αi is decided once hi is chosen. (FFS) αi is 1 for all hi.

Each sample weight wi in Adaboost is updated at the end of each round.

Q2: Why using FFS instead of Adaboost? A2:

FFS: 1-bit for each weight storage. (only 3% memory) Adaboost: 32-bits each.

Q3: Can Adaboost be expedited by a pre-computing strategy? A3: Yes.

If the weights keeps unchanged (no weight update)=> fast Adaboost.

14

3. System Flowchart: Q&A (2/2) Conclusion:

1. (Training Process) FFS is about:

a. 2.5 ~3.5 times faster than Fast Adaboost.

b. 50~100 times faster than Adaboost.

c. only 3% memory usage.

2. It’s much easier to implement in plate form.

3. We have freedom to design our own algorithms (ex: SVM, FDA…) for solving different problems.

15

1. Introduction

2. Recall Adaboost

3. System Flowchart




7. Conclusion

16


Fig. 1. Adaboost vs FFS

Train all weak classifiers

Add the feature with minimum weighted error

to the ensemble

Adjust threshold of theEnsemble to meet the

learning goal

(a) Adaboost

O(NMTlogN)

O(T)

O(N)

T

Train all weak classifiers

Add the feature to minimize error of the

current ensemble

Adjust threshold of theensemble to meet the

Learning goal

(b) FFS

O(NMlogN)

O(NMT)

O(N)

T

17

4. FFS: Adaboost Vs. FFS - Adaboostw1 w2 w4w3 w5 w6

Samples:

h1 h2

ε1=2 ε2=5

h3h4

ε3=7 ε4=4

Iteration 1

w1’ w2’ w4’w3’ w5’ w6’Samples:

取 min =>ε1

Iteration 2

ε2’= 9 ε3’ =5 ε4’= 3 取 min =>ε4

Error:

Error:

Updated

6

1 11

( ) , is the label of i i i i ii

ε w h z c c z

Weak classifiers:

Weak classifiers:

18

4. FFS: Adaboost Vs. FFS - FFSw1 w2 w4w3 w5 w6

Samples:

Iteration 1

w1 w2 w4w3 w5 w6Samples:

Iteration 2

ε2’=6 ε3’=10 ε4’=8 取 min =>ε2Error:

h1

Unchanged

The chosen one in first iteration

h1 h2

ε1=2 ε2=5

h3h4

ε3=7 ε4=4 取 min =>ε1Error:

Weak classifiers:

Weak classifiers:

6

21

' '( ) , is the label of i i i i ii

ε w H z c c z

1 2

'

' { , }

'( ) sgn( ( ) )i h S i i

S h h

H z h z θ

19

4. FFS: Training Process

a. for i =1 to M do

find θ that makes H’ has the smallest error rate

end forb. k<=arg min1≦i ≦M εi

c. Find hk that makes H’ has the smallest error rate

'' ; '( ) sgn( ( ) )i h SS S h H h θ z z

For each feature i a. Sort the feature value Vi1~ViN

b. Choose a threshold τ with the smallest error

i = M

i < M,

i=i+1

))(sgn()( θxhxH Sh

t = T

t < T

( ) sgn( )Ti i ih z z m τ

(With N samples (images), M features)

( )kS S h

2. Build Feature Table: Size MxN

with input example z, and h’s corresponding mask m and threshold τ.

1. Train all weak classifiers

3. Add the feature to minimize error of the current ensemble

4. Adjust value of θ: adjust θ to make H has a 50% false positive rate on the training set

εi<=the error rate of H’ with the chosen θ (threshold)

Vin

S <= ψFix Theta => Adjust V

20

4. FFS: Example - Train all weak classifiersFor a given feature i, 1 i M≦ ≦

Feature values for each example

SortSet N = 6

Initial ε= 0.2+0.3+0.1+0.4 = 1

9 257 3612 16

Non-Face Face

12 369 257 16

w1 w2 w6w3 w5w4

0.6 0.50.40.10.30.2

12 369 257 16

1. ε= 1-0.2=0.8 12 369 257 16

2. ε= 0.8-0.3=0.5 12 369 257 16

3. ε= 0.5-0.1=0.4

4. ε= 0.4-0.4=0

5. ε= 0+0.6=0.6

12 369 257 16

12 369 257 16

12 369 257 16

6. ε= 0.6+0.5=1.1 12 369 257 16

τ=16

zTm zi

( ) sgn( 16)Th x x m

Paper: P. 5, Algo. 3

thresholdPN

21

4. FFS: Example – Feature Selection Using Table (1/2)

0 0 0 0 0 0v

1 1 0 0 0 0

label for each sample

c

M=4

N=6

1 1 1 1 1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 2 2 2 2 2θ t t t t t t

t = 1

1

confidence value for

1 1

each examp

0 0 1

l

' 0

e

v v V

'( ) sgn( ' 1 1 0 0 1 0)H v z θ

1

1 1 0 0 1 0 1 1 0 0 0 0

0 0 0 0

( '( ) )

( )

0 0 0 0

1

1 0

0

1ε

abs H c

abs

z

2 1 0 0 0 0 0'v v V

'( ) sgn( ' 1 0 0 0 0 0)H v z θ

2

0 1 0 0 0( '( ) )

0 1 0 0 10 0

0ab H

ε

s c

z

3 1 1 0 0 0 0'v v V

'( ) sgn( ' 0 0 0 0 0 0)H v z θ

3

0 0 0 0 0( '( ) )

0 0 0 0 00 0

0ab H

ε

s c

z

4 0 1 1 0 0 0'v v V

'( ) sgn( ' 0 1 1 0 0 0)H v z θ

4

1 0 1 0 0( '( ) )

1 0 1 0 20 0

0ab H

ε

s c

z

取 h3 為第一輪的 weak h

1 1 0 0 1 0

1 0 0 0 0 0

1 1 0 0 0 0

0 1 1 0 0 0

V

Pos Neg

1

2

3

4

h

h

h

h

1 face

0 non-face

Classify result while apply h3 to sample 2

:

1 2 3 4 5 6

samples

3 1 1 0 0 0 0Update v v V

22

4. FFS: Example – Feature Selection Using Table (2/2)

1 1 0 0 0 0v

1 1 0 0 0 0c

M=4, N=6

1 1 1 1 1 11 1 1 1 1 1

2 2 2 2 2 2θ t t t t t t

t = 2

1

confidence value for

2 2

each examp

0 0 1

l

' 0

e

v v V

'( ) sgn( ' 1 1 0 0 0 0)H v z θ

1

0 0 0 0 0( '( ) )

0 0 0 0 00 0

0ab H

ε

s c

z

2 2 1 0 0 0 0'v v V

'( ) sgn( ' 1 0 0 0 0 0)H v z θ

2

0 1 0 0 0( '( ) )

0 1 0 0 10 0

0ab H

ε

s c

z

1 1 0 0 1 0

1 0 0 0 0 0

1 1 0 0 0 0

0 1 1 0 0 0

V

Pos Neg

1

2

3

4

h

h

h

h

4 1 2 1 0 0 0'v v V

'( ) sgn( ' 0 1 0 0 0 0)H v z θ

4

1 0 0 0 0( '( ) )

1 0 0 0 10 0

0ab H

ε

s c

z

取 h1 為第二輪的 weak h

1 2 2 0 0 1 0Update v v V

23

4. FFS: FFS Vs. Adaboost Three major difference between FFS and Adaboost in implementation

No weight update – Faster, due to the Table.

Total vote (confidence value before normalized) in FFS is between 0 and T. Adaboost can be any real number.

Criterion: FFS: selected feature should make the ensemble classifier has smallest error on the

training set. Adaboost: choose a feature with the smallest weighted error on the training set.

24

1. Introduction

2. Recall Adaboost

3. System Flowchart




7. Conclusion

25


1

1

let

: vector of feature values of a positive data

( ) : , =number of positives

: covariance matrix

: vector of feature values of a negative data

( ) : , =number of negatives

x

y

nii

xx

x

nii

yy

xn

n

yn

n

x

hx

y

hy

: covariance matrixy

The linear classifier to be learned can be written as ( , ),

and z is the example with unknown class label :

1 if ( )( )

1 if ( )

T

T

H b

z bH z

z b

a

a h

a h

0, ~( , )

~( , )

The node learning goal is expressed as:

max Pr { }

s.t. Pr { } (1)

x

y

T

a b x x

T

y y

b

b β

a x

a y

We can treat β as (1 - false positive rate)

We want to optimize this

1

2

3

4

( )

( )

( )

( )

( )

i

i

i

i

T i

h x

h x

h x

h x

h x

x

26

5. LAC: Definitions

:)1,0(~ which ), ofdirection theonto

projected(x x a of version edstandardiz thedenote Let x Ta

axa

: xof (c.d.f)function on distributi cumulative thedenotes ΨLet a,ax

assimilarly defined are Ψ and ,ayay

( )

T

aT

x

x

a x x

a a

,Ψ (k) Pr{ k}x a ax

( )T

aT

y

y

a y y

a a,Ψ (k) Pr{ k}y a ay

k,Ψ ( )x a b

1( ( ) )( ( ) )xn Ti ii

xx

x x

n

h x h x

Normalize term

27

5. LAC: Derivation (1/3)

,Pr{ } Pr{ } Ψ ( )T T T T

Ty a

T T Ty y y

b bβ b

a y a y a y a y

a ya a a a a a

Constraint (1) can be re-written as

1,Ψ ( ) (2)T T

y a yb β a y a a

We want to maximize

Pr{ } Pr{ }T T T

T

T Tx x

bb

a x a x a x

a xa a a a

,Pr{ } Ψ ( )T T

x x aT T

x x

b ba

a x a x

a a a aIt’s equal to minimize

1,

,0

( ) Ψ ( )min Ψ ( )

T Ty a y

x aTa

x

β

a y x a a

a aTake (2) into b

28


,Recall that Ψ ( ) Pr{ }x a ak x k

1,

,0

( ) Ψ ( )min Ψ ( )

T Ty a y

x aT

x

β

a

a y x a a

a a

=k

, 1 , 2 1 2When Ψ ( ) Ψ ( ) x a x ak k k k

1,

0

( ) Ψ ( )min

T Ty a y

Tax

β

a y x a a

a a

1,

0

( ) Ψ ( )max

T Ty a y

Tax

β

a x y a a

a a

Assume y is symmetric distribution, we have 1,Ψ (0.5) 0y a

0

( )max

T

Tax

a x y

a aFor β=0.5 we have

k2, 2Ψ ( )x a k

k1

, 1Ψ ( )x a k

Fig. 2.

1,Ψ ( )

T

y aT

y

bβ

a y

a a

When =0.5=(1-FP), is at median value

so 0 ( mean median)T

β b

b a y

29


Fig. 3.

Normality test for aTy, in which y is a feature vector extracted from non-face data, and a is drawn from the uniform distribution [0 1]T.

It’s more likely to be normal distribution while we are close to the red line

30

5. LAC: Optimal ResultCompared with FDA

0

( )max

( )

T

Tax y

a x y

a a

FDA

0

( )max

T

Tx

a

a x y

a a

LAC

* 1 * *( ) ( ), Tx ya b x y a y * 1 * *( ), T

x b a x y a y

Optimal ResultOutput is a classifier

1

(z) sgn( ( ) ) sgn( ( ) )T

Tt t

t

H a h z b z b

a h

31

1. Introduction

2. Recall Adaboost

3. System Flowchart




7. Conclusion

32

6. Experimental Result: LAC Vs. FDA

Fig. 4. Comparing LAC and FDA on synthetic data set when both x and y are Gaussians. (red for positives, blue for negatives)

33

6. Experimental Result: Synthetic Data

Fig. 5. Synthetic data where y is not symmetric.

34

Fig. 6. Experiments comparing different linear discrimination functions. In 6(a), training data sets are collected from AdaBoost+FDA cascade’s node 11 to 21. And in 6(b), were collected from AdaBoost+LAC.

6. Experimental Result: Adaboost Vs. FDA&LAC

35

Fig. 7. Experiments comparing different linear discrimination functions. Training sets were collected from AdaBoost cascade’s.

6. Experimental Result: Adaboost Vs. FDA&LAC

36

6. Experimental Result: Adaboost Vs. FFS

Fig. 8. Experiments comparing cascade performances on the MIT+CMU test set (ROC).

37

Fig. 8. Experiments comparing cascade performances on the MIT+CMU test set. (a) with post-processing. (b) without post-processing.

(a) (b)

6. Experimental Result: Effect of Post-

Processing???

Post-processing:????

38

1. Introduction

2. Recall Adaboost

3. System Flowchart




7. Conclusion

39

7. Conclusion (Contribution??) Three types of asymmetric are categorized.

Decoupled classifier design step into feature selection and design ensemble classifier.

Proposed FFS for feature selection, and it is 2.5~3.5 times faster than Adaboost with only 3% memory usage as that of Adaboost.

Proposed LAC for ensemble classifier to solve the asymmetric problem.

Problems: Q&A???

40

41

Reference [1] J. Wu, C. Brubaker, "Fast Asymmetric Learning for Cascade Face Detection",

IEEE transaction on Pattern Analysis and Machine Intelligence, pp369-382, March 2008.

[2] P. Viola and M Jones, "Robust Real-time Object Detection", Intl. J. Computer Vision, 57(2): pp.137-154, 2004.

[3] P. Viola and M Jones, " Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade", NIPS, pp.1311-1318, 2001.

1 fast asymmetric learning for cascade face detection jiaxin wu, and charles brubaker ieee pami,...

Documents

lac adaboost

fast adaboost

weak classifier slide

ensemble classifier

adaboost vs

conclusion slide

t slide

forward feature selection