1 fast asymmetric learning for cascade face detection jiaxin wu, and charles brubaker ieee pami,...
TRANSCRIPT
1
Fast Asymmetric Learning for Cascade Face Detection
Jiaxin Wu, and Charles Brubaker IEEE PAMI, 2008
Chun-Hao Chang 張峻豪2009/12/01
2
Outline
1. Introduction
2. Recall Adaboost
3. System Flowchart
4. Forward Feature Selection (FFS)
5. Linear Asymmetric Classifier (LAC)
6. Experimental Result
7. Conclusion
3
1. Introduction
2. Recall Adaboost
3. System Flowchart
4. Forward Feature Selection (FFS)
5. Linear Asymmetric Classifier (LAC)
6. Experimental Result
7. Conclusion
4
1. Introduction Observe three asymmetries in face detection problem:
1. Uneven class priors –Training database: # of Positives Vs. # of Negatives.
2. Goal asymmetry –Detection Rate Vs. False Positive Rate => EER
3. Unequal complexity with positive and negative classes –Face Vs. Car (Non-Face) => Easy to classify
Face Vs. Animal (Non-Face) => Hard to classify
This paper present a framework similar to Adaboost: but
faster in learning have the freedom to design an ensemble classifier.
5
1. Introduction Decoupled classifier design step into
Feature selection and Ensemble classifier. (ex: FDA, SVM…)
Proposed Forward Feature Selection (FFS) and Linear Asymmetric Classifier (LAC).
Advantage : 1. FFS is about 2.5 ~3.5 times faster than Fast Adaboost and 50~100 times
faster than Adaboost in training process.2. FFS only requires about 3% memory usage as that of Adaboost.3. Have the freedom to design an ensemble classifier.
6
1. Introduction: Adaboost Vs. FFS+LAC
Adaboost
FFS+LAC
h1α1
k kk
α hα2 h2
α3
α4
α5
h5
h3
h4
FFS
h1
h2
h4
h5
h3
11
1
1
1
'k k
k
α h
LAC
h2
α1’ h1
h5
h4
h3
α2’
α5’
α3’
α4’
image zi, weight wi=1/Np
Assume N=Np+Nn
Np positive samples
Nn negative samples
image zi, weight wi=1/ Np
1 11
( )N
i i ii
ε w h z c
11
1
1log
εα
ε
ci is the label of zi, and h1 is the weak classifier with weight α1
i N
k = # of weak classifier
7
1. Introduction
2. Recall Adaboost
3. System Flowchart
4. Forward Feature Selection (FFS)
5. Linear Asymmetric Classifier (LAC)
6. Experimental Result
7. Conclusion
8
2. Recall Adaboost (1/2)1. Input Data
N Training DataN samples
2. Cascaded Framework
Learning goal satisfied? Adding new node
Node Learning
FT
1. Normalize weights.2. Pick appropriate threshold for each weak classifier hi, where 1<i<M. M is the number of features.
Feature Selection and Ensemble Classifier (Adaboost)
3. Cascaded Detector
{ ( ) sgn( )}Th z z m τ
4. Update weights
1 11
1 ( )5. ( ) 2
0 otherwise
T Tt t t t tα h z α
H z
with input data z, and h’s corresponding mask (feature) m and threshold τ.
Hk+1
3. Choose the classifier, ht, with the lowest error.
Were coupling together (not separable)
αh
H1 H2 H3
N=Np+Nn
,αt is the weight of ht
T iterations
9
2. Recall Adaboost (2/2) αt is decided once ht is chosen.
Weight wt,i is updated by the error rate εi at the end of each iteration, where wt,i is the weight of sample i at iteration t.
11, ,
where =0 if example is classified correctly, =1 otherwise, and 1
iet i t i t
ti i i t
t
w w β
εe z e β
ε
Feature = (Filter, Position)
Feature Value = Feature * example, * = convolution
Classifier = (Feature, Threshold)
10
1. Introduction
2. Recall Adaboost
3. System Flowchart
4. Forward Feature Selection (FFS)
5. Linear Asymmetric Classifier (LAC)
6. Experimental Result
7. Conclusion
11
3. System Flowchart: Notations z: Input example.
x: Vector of feature values of a positive example.
y: Vector of feature values of a negative example.
: Covariance matrix of x.
a: Optimal weight.
b: Optimal threshold.
x
sample xi
h1
h3
h2
h4
1
2
3
4
( )
( )( )
( )
( )
i
ii
i
i
h x
h xx
h x
h x
x h
1 ( )Average vector of positive feature values
xnii
x
x
n
hx
1 ( )ynii
y
y
n
hy
weak classifiers
Convolution
12
3. System Flowchart: FFS+LAC
))(sgn()( ;),(1
1
T
ttt
T bzhazHyabyxx
a
3. ( ) sgn( ( ) )h SH x h x θ
1. Input Data
N Training DataN samples
2. Cascaded Framework
Learning goal satisfied? Adding new node
Node Learning
FT
1. Build Feature Table.
2. Choose the weak classifier, hi, that makes H’ has the smallest error rate .
Feature Selection (FFS)
Ensemble Classifier (LAC)
3. Cascaded Detector
3. ( ) sgn( ( ) )h SH z h z θ
1
1( ), ; ( ) sgn( ( ) )
TT
t tt
b H a h bx
a x y a y z z
Hk+1
Separable
H1 H2 H3
N=Np+Nn
Θ is the threshold of H(z)
T iterations
13
3. System Flowchart: Q&A (1/2) Q1: What’s the difference between Adaboost and FFS+LAC? A1:
We can’t separate Adaboost into feature selection and ensemble classifier step. (Adaboost) αi is decided once hi is chosen. (FFS) αi is 1 for all hi.
Each sample weight wi in Adaboost is updated at the end of each round.
Q2: Why using FFS instead of Adaboost? A2:
FFS: 1-bit for each weight storage. (only 3% memory) Adaboost: 32-bits each.
Q3: Can Adaboost be expedited by a pre-computing strategy? A3: Yes.
If the weights keeps unchanged (no weight update)=> fast Adaboost.
14
3. System Flowchart: Q&A (2/2) Conclusion:
1. (Training Process) FFS is about:
a. 2.5 ~3.5 times faster than Fast Adaboost.
b. 50~100 times faster than Adaboost.
c. only 3% memory usage.
2. It’s much easier to implement in plate form.
3. We have freedom to design our own algorithms (ex: SVM, FDA…) for solving different problems.
15
1. Introduction
2. Recall Adaboost
3. System Flowchart
4. Forward Feature Selection (FFS)
5. Linear Asymmetric Classifier (LAC)
6. Experimental Result
7. Conclusion
16
4. Forward Feature Selection (FFS)
Fig. 1. Adaboost vs FFS
Train all weak classifiers
Add the feature with minimum weighted error
to the ensemble
Adjust threshold of theEnsemble to meet the
learning goal
(a) Adaboost
O(NMTlogN)
O(T)
O(N)
T
Train all weak classifiers
Add the feature to minimize error of the
current ensemble
Adjust threshold of theensemble to meet the
Learning goal
(b) FFS
O(NMlogN)
O(NMT)
O(N)
T
17
4. FFS: Adaboost Vs. FFS - Adaboostw1 w2 w4w3 w5 w6
Samples:
h1 h2
ε1=2 ε2=5
h3h4
ε3=7 ε4=4
Iteration 1
w1’ w2’ w4’w3’ w5’ w6’Samples:
取 min =>ε1
Iteration 2
ε2’= 9 ε3’ =5 ε4’= 3 取 min =>ε4
Error:
Error:
Updated
6
1 11
( ) , is the label of i i i i ii
ε w h z c c z
Weak classifiers:
Weak classifiers:
18
4. FFS: Adaboost Vs. FFS - FFSw1 w2 w4w3 w5 w6
Samples:
Iteration 1
w1 w2 w4w3 w5 w6Samples:
Iteration 2
ε2’=6 ε3’=10 ε4’=8 取 min =>ε2Error:
h1
Unchanged
The chosen one in first iteration
h1 h2
ε1=2 ε2=5
h3h4
ε3=7 ε4=4 取 min =>ε1Error:
Weak classifiers:
Weak classifiers:
6
21
' '( ) , is the label of i i i i ii
ε w H z c c z
1 2
'
' { , }
'( ) sgn( ( ) )i h S i i
S h h
H z h z θ
19
4. FFS: Training Process
a. for i =1 to M do
find θ that makes H’ has the smallest error rate
end forb. k<=arg min1≦i ≦M εi
c. Find hk that makes H’ has the smallest error rate
'' ; '( ) sgn( ( ) )i h SS S h H h θ z z
For each feature i a. Sort the feature value Vi1~ViN
b. Choose a threshold τ with the smallest error
i = M
i < M,
i=i+1
))(sgn()( θxhxH Sh
t = T
t < T
( ) sgn( )Ti i ih z z m τ
(With N samples (images), M features)
( )kS S h
2. Build Feature Table: Size MxN
with input example z, and h’s corresponding mask m and threshold τ.
1. Train all weak classifiers
3. Add the feature to minimize error of the current ensemble
4. Adjust value of θ: adjust θ to make H has a 50% false positive rate on the training set
εi<=the error rate of H’ with the chosen θ (threshold)
Vin
S <= ψFix Theta => Adjust V
20
4. FFS: Example - Train all weak classifiersFor a given feature i, 1 i M≦ ≦
Feature values for each example
SortSet N = 6
Initial ε= 0.2+0.3+0.1+0.4 = 1
9 257 3612 16
Non-Face Face
12 369 257 16
w1 w2 w6w3 w5w4
0.6 0.50.40.10.30.2
12 369 257 16
1. ε= 1-0.2=0.8 12 369 257 16
2. ε= 0.8-0.3=0.5 12 369 257 16
3. ε= 0.5-0.1=0.4
4. ε= 0.4-0.4=0
5. ε= 0+0.6=0.6
12 369 257 16
12 369 257 16
12 369 257 16
6. ε= 0.6+0.5=1.1 12 369 257 16
τ=16
zTm zi
( ) sgn( 16)Th x x m
Paper: P. 5, Algo. 3
thresholdPN
21
4. FFS: Example – Feature Selection Using Table (1/2)
0 0 0 0 0 0v
1 1 0 0 0 0
label for each sample
c
M=4
N=6
1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2θ t t t t t t
t = 1
1
confidence value for
1 1
each examp
0 0 1
l
' 0
e
v v V
'( ) sgn( ' 1 1 0 0 1 0)H v z θ
1
1 1 0 0 1 0 1 1 0 0 0 0
0 0 0 0
( '( ) )
( )
0 0 0 0
1
1 0
0
1ε
abs H c
abs
z
2 1 0 0 0 0 0'v v V
'( ) sgn( ' 1 0 0 0 0 0)H v z θ
2
0 1 0 0 0( '( ) )
0 1 0 0 10 0
0ab H
ε
s c
z
3 1 1 0 0 0 0'v v V
'( ) sgn( ' 0 0 0 0 0 0)H v z θ
3
0 0 0 0 0( '( ) )
0 0 0 0 00 0
0ab H
ε
s c
z
4 0 1 1 0 0 0'v v V
'( ) sgn( ' 0 1 1 0 0 0)H v z θ
4
1 0 1 0 0( '( ) )
1 0 1 0 20 0
0ab H
ε
s c
z
取 h3 為第一輪的 weak h
1 1 0 0 1 0
1 0 0 0 0 0
1 1 0 0 0 0
0 1 1 0 0 0
V
Pos Neg
1
2
3
4
h
h
h
h
1 face
0 non-face
Classify result while apply h3 to sample 2
:
1 2 3 4 5 6
samples
3 1 1 0 0 0 0Update v v V
22
4. FFS: Example – Feature Selection Using Table (2/2)
1 1 0 0 0 0v
1 1 0 0 0 0c
M=4, N=6
1 1 1 1 1 11 1 1 1 1 1
2 2 2 2 2 2θ t t t t t t
t = 2
1
confidence value for
2 2
each examp
0 0 1
l
' 0
e
v v V
'( ) sgn( ' 1 1 0 0 0 0)H v z θ
1
0 0 0 0 0( '( ) )
0 0 0 0 00 0
0ab H
ε
s c
z
2 2 1 0 0 0 0'v v V
'( ) sgn( ' 1 0 0 0 0 0)H v z θ
2
0 1 0 0 0( '( ) )
0 1 0 0 10 0
0ab H
ε
s c
z
1 1 0 0 1 0
1 0 0 0 0 0
1 1 0 0 0 0
0 1 1 0 0 0
V
Pos Neg
1
2
3
4
h
h
h
h
4 1 2 1 0 0 0'v v V
'( ) sgn( ' 0 1 0 0 0 0)H v z θ
4
1 0 0 0 0( '( ) )
1 0 0 0 10 0
0ab H
ε
s c
z
取 h1 為第二輪的 weak h
1 2 2 0 0 1 0Update v v V
23
4. FFS: FFS Vs. Adaboost Three major difference between FFS and Adaboost in implementation
No weight update – Faster, due to the Table.
Total vote (confidence value before normalized) in FFS is between 0 and T. Adaboost can be any real number.
Criterion: FFS: selected feature should make the ensemble classifier has smallest error on the
training set. Adaboost: choose a feature with the smallest weighted error on the training set.
24
1. Introduction
2. Recall Adaboost
3. System Flowchart
4. Forward Feature Selection (FFS)
5. Linear Asymmetric Classifier (LAC)
6. Experimental Result
7. Conclusion
25
5. Linear Asymmetric Classifier (LAC)
1
1
let
: vector of feature values of a positive data
( ) : , =number of positives
: covariance matrix
: vector of feature values of a negative data
( ) : , =number of negatives
x
y
nii
xx
x
nii
yy
xn
n
yn
n
x
hx
y
hy
: covariance matrixy
The linear classifier to be learned can be written as ( , ),
and z is the example with unknown class label :
1 if ( )( )
1 if ( )
T
T
H b
z bH z
z b
a
a h
a h
0, ~( , )
~( , )
The node learning goal is expressed as:
max Pr { }
s.t. Pr { } (1)
x
y
T
a b x x
T
y y
b
b β
a x
a y
We can treat β as (1 - false positive rate)
We want to optimize this
1
2
3
4
( )
( )
( )
( )
( )
i
i
i
i
T i
h x
h x
h x
h x
h x
x
26
5. LAC: Definitions
:)1,0(~ which ), ofdirection theonto
projected(x x a of version edstandardiz thedenote Let x Ta
axa
: xof (c.d.f)function on distributi cumulative thedenotes ΨLet a,ax
assimilarly defined are Ψ and ,ayay
( )
T
aT
x
x
a x x
a a
,Ψ (k) Pr{ k}x a ax
( )T
aT
y
y
a y y
a a,Ψ (k) Pr{ k}y a ay
k,Ψ ( )x a b
1( ( ) )( ( ) )xn Ti ii
xx
x x
n
h x h x
Normalize term
27
5. LAC: Derivation (1/3)
,Pr{ } Pr{ } Ψ ( )T T T T
Ty a
T T Ty y y
b bβ b
a y a y a y a y
a ya a a a a a
Constraint (1) can be re-written as
1,Ψ ( ) (2)T T
y a yb β a y a a
We want to maximize
Pr{ } Pr{ }T T T
T
T Tx x
bb
a x a x a x
a xa a a a
,Pr{ } Ψ ( )T T
x x aT T
x x
b ba
a x a x
a a a aIt’s equal to minimize
1,
,0
( ) Ψ ( )min Ψ ( )
T Ty a y
x aTa
x
β
a y x a a
a aTake (2) into b
28
5. LAC: Derivation (2/3)
,Recall that Ψ ( ) Pr{ }x a ak x k
1,
,0
( ) Ψ ( )min Ψ ( )
T Ty a y
x aT
x
β
a
a y x a a
a a
=k
, 1 , 2 1 2When Ψ ( ) Ψ ( ) x a x ak k k k
1,
0
( ) Ψ ( )min
T Ty a y
Tax
β
a y x a a
a a
1,
0
( ) Ψ ( )max
T Ty a y
Tax
β
a x y a a
a a
Assume y is symmetric distribution, we have 1,Ψ (0.5) 0y a
0
( )max
T
Tax
a x y
a aFor β=0.5 we have
k2, 2Ψ ( )x a k
k1
, 1Ψ ( )x a k
Fig. 2.
1,Ψ ( )
T
y aT
y
bβ
a y
a a
When =0.5=(1-FP), is at median value
so 0 ( mean median)T
β b
b a y
29
5. LAC: Derivation (3/3)
Fig. 3.
Normality test for aTy, in which y is a feature vector extracted from non-face data, and a is drawn from the uniform distribution [0 1]T.
It’s more likely to be normal distribution while we are close to the red line
30
5. LAC: Optimal ResultCompared with FDA
0
( )max
( )
T
Tax y
a x y
a a
FDA
0
( )max
T
Tx
a
a x y
a a
LAC
* 1 * *( ) ( ), Tx ya b x y a y * 1 * *( ), T
x b a x y a y
Optimal ResultOutput is a classifier
1
(z) sgn( ( ) ) sgn( ( ) )T
Tt t
t
H a h z b z b
a h
31
1. Introduction
2. Recall Adaboost
3. System Flowchart
4. Forward Feature Selection (FFS)
5. Linear Asymmetric Classifier (LAC)
6. Experimental Result
7. Conclusion
32
6. Experimental Result: LAC Vs. FDA
Fig. 4. Comparing LAC and FDA on synthetic data set when both x and y are Gaussians. (red for positives, blue for negatives)
33
6. Experimental Result: Synthetic Data
Fig. 5. Synthetic data where y is not symmetric.
34
Fig. 6. Experiments comparing different linear discrimination functions. In 6(a), training data sets are collected from AdaBoost+FDA cascade’s node 11 to 21. And in 6(b), were collected from AdaBoost+LAC.
6. Experimental Result: Adaboost Vs. FDA&LAC
35
Fig. 7. Experiments comparing different linear discrimination functions. Training sets were collected from AdaBoost cascade’s.
6. Experimental Result: Adaboost Vs. FDA&LAC
36
6. Experimental Result: Adaboost Vs. FFS
Fig. 8. Experiments comparing cascade performances on the MIT+CMU test set (ROC).
37
Fig. 8. Experiments comparing cascade performances on the MIT+CMU test set. (a) with post-processing. (b) without post-processing.
(a) (b)
6. Experimental Result: Effect of Post-
Processing???
Post-processing:????
38
1. Introduction
2. Recall Adaboost
3. System Flowchart
4. Forward Feature Selection (FFS)
5. Linear Asymmetric Classifier (LAC)
6. Experimental Result
7. Conclusion
39
7. Conclusion (Contribution??) Three types of asymmetric are categorized.
Decoupled classifier design step into feature selection and design ensemble classifier.
Proposed FFS for feature selection, and it is 2.5~3.5 times faster than Adaboost with only 3% memory usage as that of Adaboost.
Proposed LAC for ensemble classifier to solve the asymmetric problem.
Problems: Q&A???
40
41
Reference [1] J. Wu, C. Brubaker, "Fast Asymmetric Learning for Cascade Face Detection",
IEEE transaction on Pattern Analysis and Machine Intelligence, pp369-382, March 2008.
[2] P. Viola and M Jones, "Robust Real-time Object Detection", Intl. J. Computer Vision, 57(2): pp.137-154, 2004.
[3] P. Viola and M Jones, " Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade", NIPS, pp.1311-1318, 2001.