Download - Lecture 1 Object Detection - CLASS
Lecture 1
Object Detection
Bill TriggsLaboratoire Jean Kuntzmann, Grenoble, France
International Computer Vision Summer SchoolSicily
July 2008
What do we need to build a good object detector?
● There may be lighting variations, changes in appearance, complex backgrounds– We need robust visual features
● Instances may have variable geometry or internal degrees of freedom– Orientation, 3D pose, body pose, facial expression
– We need a flexible recognition method
● Instances may occur anywhere in the image and at any scale– We need a good search control strategy...
What do we need to build a good object detector?
● There may be overlapping instances or detections– We need a detection postprocessing strategy
● The method is likely to be based on learning and will need to be validated– We need labelled training and validation sets
● Computational cost or embeddability may be an issue– We need to review the whole system for efficiency
A Naive Image Scanning Detector – Template Matching
Match window against a rigid template, e.g. by correlation
Scan image at all scales and locations
Object detections
Detection Phase
`Scale-space pyramid
Detection windowReturn above-threshold matches as detections
Problems with this approach
• It is photometrically too rigid to resist changes in lighting and appearance variations
• It is geometrically too rigid to resist shape variations• It does not have a strategy for overlapping detections
Anatomy of a Modern Object Detector
• Strong image preprocessing and feature normalization for resistance to illumination changes
• Local rectification and pooling for resistance to small shape variations
• Overcomplete feature set for rich description• Machine learning based decision rule to capture
statistics and variability of real application• Postprocessing to fuse multiple detections
Image Scanning Detectors
Fuse multiple detections in 3-D position & scale space
Extract features over windows
Scan image(s) at all scales and locations
Object detections with bounding boxes
Detection Phase
`Scale-space pyramid
Detection window
Run window classifier at all locations
Image Preprocessing● Preprocessing is often neglected but it can
make a huge difference in performance● One example of a preprocessing chain
input image
strong gamma
compression
centre-surround
filter
robust local contrast
normalization
highlight suppression
Performance Improvements from Preprocessing
Face Recognition Grand Challenge 1.0.4 Dataset,various features,baseline LDA classifier
Local Binary Pattern Features
● Descriptors based on local thresholding or ranking of pixel or edge intensities are very resistant to illumination changes
● Local Binary Patterns – threshold ring of pixels at value of central pixel
– locally histogram resulting binary codes
– currently one of the best descriptors for face recognition
Detectors using Local Filters● Convolution filters inspired by V1 simple cell
responses, multiscale image representations– Gaussian derivatives, Gabor filters, log-polar Gabor
filters, steerable filters, Haar wavelets
– use a number of orientations (4-12)
– output is typically squared or rectified before use
2nd & 3rd order Gaussian derivative, scaled Gaussian derivative and log-polar Gabor filters
2nd order steerable filter and its frequency response
Haar wavelets
Training set (2k positive / 10k negative)
Haar wavelet descriptors
Support vector
machine
Multi-scale search
training
Test image
results
testdescriptors
Haar Wavelet / SVM Human Detector
[Papageorgiou & Poggio, 1998]
1326-D descriptor
Which Descriptors are Important?
32x32 descriptors (HVX) 16x16 descriptors (HVX)
Mean response difference between positive & negative training examples
Essentially just a coarse-scale human silhouette template!
Some Detection Results
Detectors using Edge / Gradient Orientation Histograms
● Divide local region into spatial cells● Calculate orientation of image gradient at each pixel● Pool quantized orientations over each cell
– descriptor contains an orientation histogram for each cell– weight votes by gradient magnitude
● Can also use edge orientations from a discrete edge detector
● Basis of the popular SIFT, HOG, Generalized Shape Context methods
orientation voting and pooling into spatial cells
C.f. Shape context– pool counts of edge pixels into log-polar spatial bins
– centre descriptor on regularly spaced / all edge pixels
Histogram of Oriented Gradient (HOG) Person Detector
● This simple detector is still one of the best generic human detectors
● It is a good illustration of – the power that modern features and training methods
have given to basic template matching
– the need for good engineering and attention to detail
N. Dalal and B. Triggs. Histograms of Oriented Gradients for Human Detection. CVPR, 2005
Feature Extraction
Compute gradients
Feature vector f = [ ..., ..., ...]
Block
Normalise gamma
Weighted vote in spatial & orientation cells
Contrast normalise over overlapping spatial cells
Collect HOGs over detection window
Input image
Detection window
Linear SVM
Overlap of Blocks
Cell
Overview of Learning Phase
Learn binary classifier
Encode images into feature spaces
Create fixed-resolution normalised training image data set
Learning phase
Object/Non-object decision
Learn binary classifier
Encode images into feature spaces
Resample negative training images to create hard examples
Input: Annotations on training images
Re-training reduces false positives by an order of magnitude!
HOG DescriptorsParameters Gradient scale Orientation bins Percentage of block
overlapε+← 2
2/ vvv
Schemes RGB or Lab, colour/gray-space Block normalisation
L2-norm,
orL1-norm,
CellBlock
R-H
OG
/SIF
T
Cente
r bin
C-H
OG
)/(1
ε+← vvv
Evaluation Data SetsINRIA person databaseMIT pedestrian database
Overall 709 annotations+ reflections
200 positive windows
Negative data unavailable
507 positive windows
Negative data unavailable
566 positive windows
453 negative images
1208 positive windows
1218 negative images
Overall 1774 annotations+ reflections
Tra
inTest
Tra
inTest
Performance on MIT Dataset
● R-HOG and C-HOG give near perfect separation on MIT database● Both have 1-2 order lower false positives than wavelets and similar
descriptors
Performance on INRIA Database
Influence of ParametersGradient smoothing, σ Orientation bins, β
Reducing gradient scale from 3 to 0 decreases false positives by 10 times
Increasing orientation bins from 4 to 9 decreases false positives by 10 times
Influence of ParametersNormalisation method Block overlap
● Strong local normalisation is essential
Overlapping blocks improve performance, but descriptor size increases
Influence of Block and Cell Size
● Trade off between need for local spatial invariance and need for finer spatial resolution
12
8
64
Which Cues are Important?
Input example
Weighted pos wts
Weighted neg wts
Outside-in weights
Most important cues are head, shoulder, leg silhouettes Vertical gradients inside a person are counted as
negative Overlapping blocks just outside the contour are most
important
Average gradients
Merging Overlapping Detections
Robust mode detection (mean shift)
∑
Η−−=
=Η
−n
i iii
syixii
wf
ss
2//)(exp)(
],)exp(,)[exp(
21xxx
σσσ
x
y s (i
n log
)
Clip Detection Score
Multi-scale dense scan of detection window
Final detections
Threshold
Bias
Influence of Mean Shift Kernel
Spatial smoothing aspect ratio as per window shape, smallest sigma approx. equal to stride/cell size
Relatively independent of scale smoothing, sigma equal to 0.4 to 0.7 octaves gives good results
Influence of Other Parameters
Different mappings Effect of scale-ratio
Hard clipping of SVM scores gives better results than simple probabilistic mapping of the scores
Fine scale sampling improves recall
Results Using Static HOGNo temporal smoothing of detections
Conclusions for Static HOG Human Detector
● Fine grained features improve performance– Rectify fine gradients then pool spatially
● No gradient smoothing, [1 0 -1] derivative mask● Orientation voting into fine bins● Spatial voting into coarser bins
– Use gradient magnitude (no thresholding)– Strong local normalization– Use overlapping blocks– Robust non-maximum suppression
● Fine scale sampling, hard clipping & anisotropic kernel
Human detection rate of 90% at 10-4 false positives per window
Slower than integral images of Viola & Jones, 2001
Applications to Other Classes
M. Everingham et al. The 2005 PASCAL Visual Object Classes Challenge. Proceedings of the PASCAL Challenge Workshop, 2006.
Parameter Settings
● Most HOG parameters are stable across different classes
● Parameters that change– Gamma compression– Normalisation methods – Signed/un-signed gradients
Results from Pascal VOC 2006
0.160
-
-
-
-
0.151
Cat
0.137
-
0.140
-
-
0.091H
ors
e
0.265
0.153
0.318
0.390
-
0.178
Moto
rbik
e
0.303
-
0.440
0.414
-
0.249
Bic
ycle
0.169
-
-
0.117
-
0.138
Bu
s
0.039
0.074
0.114
0.164
-
0.030
Pers
on
0.227
-
-
0.251
-
0.131
Sh
eep
0.252
-
0.224
0.212
0.159
0.149
Cow
0.113
-
-
-
-
0.118
Dog
0.222TKK
-TUD
-
Laptev=HOG+
Ada-boost
0.444HOG
0.398ENSMP
0.254Cam
bridge
Car
HOG outperformed other methods for 4 out of 10 classes Its adaBoost variant outperformed other methods for 2 out of 10 classes
Finding People in Videos
● Motivation– Human motion is very characteristic
● Requirements– Must work for moving camera and background– Robust coding of relative motion of human parts
● Method– Use differential flow for resistance to camera motion
– HOG like spatial histogramming for robust coding of relative motion
Motion HOG Processing Chain
Collect HOGs for all blocks over detection window
Normalise contrast within overlapping blocks of cells
Accumulate votes for differential flow orientation over spatial cells
Compute optical flow
Normalise gamma & colour
Compute differential flow
Input image Consecutive image
Flow field Magnitude of flow
Differential flow X Differential flow Y
Block
Overlap of Blocks
Cell
Detection windows
Overview of Feature Extraction
Collect HOGs over detection window
Object/Non-object decision
Linear SVM
Static HOG Encoding
Motion HOG Encoding
Input image Consecutive image(s)
App
eara
nc
e C
hannel M
otio
n
Channe
l
Test 2
Test 1
Train
Same 5 DVDs, 50 shots
1704 positive windows
5 DVDs, 182 shots
5562 positive windows
6 new DVDs, 128 shots
2700 positive windows
Data Set
Motion Boundary Histograms
First frame
Second frame
Estd. flow
Flow mag.
y-flow diff
x-flow diff
Avg. x-flow diff
Avg. y-flow diff
Treat x, y-flow components as independent images
Take their local gradients separately, and compute HOGs as in static images
Flow discontinuities follow occlusion boundaries, so this encodes depth and motion boundaries
Internal Motion Histograms
● Alternatively, we can use orientations of flow differences not boundaries
● This captures relative motions of body parts
● We tested several different coding schemes based on finite spatial (inter-part) displacements
IMH Encoding Schemes● Simple difference
– Take x, y differentials of flow vector images [Ix, Iy ]
– Variants may use larger spatial displacements while differencing, e.g. [1 0 0 0 -1]
● Center cell difference
+1
+1
+1+1
+1
+1+1
-1
+1
Wavelet-style cell differences
+1
-1
+1
-1
+1 -1
+1
-1
+1
-2
+1
-1
+1 -1
+1
+1 -1
+1
-1
+1-1
-1
+1
+1-2 +1
Flow Methods● Proesman’s flow [ Proesmans et al. ECCV 1994]
– 15 seconds per frame● Our flow method
– Multi-scale pyramid based method, no regularization– Brightness constancy based damped least squares solution
on 5X5 window
– 1 second per frame● MPEG-4 based block matching
– Runs in real-time
Input image Proesman’s flow Our multi-scale flow
( ) bAIAA TTT 1],[
−+= βyx
Performance Comparison
Only motion information Appearance + motion
With motion only, MBH scheme on Proesmans’ flow works best
Combined with appearance, centre difference IMH performs best
Trained on Static & Flow
Tested on flow only Tested on appearance + flow
Adding static images during test reduces performance margin
No deterioration in performance on static images
Motion HOG VideoNo temporal smoothing, each pair of frames treated independently
AdaBoost Cascade Face Detector● A computationally efficient architecture that rapidly rejects
unpromising windows– A chain of classifiers that each reject some fraction of the negative
training samples while keeping almost all positive ones
● Each classifier is an AdaBoost ensemble of rectangular Harr-like features sampled from a large pool
[Viola & Jones, 2001]
Rectangular Haar features and the first two features chosen by AdaBoost
Dynamic Pedestrian DetectionViola, Jones and Snow, ICCV 2003
Similar to the above face detector but also includes motion derivative filters
Convolutional Neural Nets● A series of banks of convolution filters that alternately analyse
the output images of the previous bank (“simple cells”) and spatially pool the resulting rectified responses (“complex cells”)
● Trained by gradient descent on large training sets
AT&T system – reads ~10% of U.S. cheques
[Lecun 1992-8]
Rotation Invariant Neural Net Face Detector
Learn rectifier network for rotations, then upright face detector
[Rowley et al., 1998]
Convolutional Net Multipose Face Detector● Net is trained to produce zero for a non-face, a unit-vector
encoding the facial pose for a face● At run time must run a descent search to find best putative pose
for observed image, then check whether “face” is likely given this
[Osadchy, 2007]
Video
Exemplar based Pedestrian Detector● Build model by clustering training examples hierarchically ● At run-time, use similarity tree to find similar examples quickly
[D.Gavrila, ICPR'98]
Distance Transform based Edge Template Matching
[Gavrila, Philomin, ICCV'99]
For best results, use DT over orientated edges
Learning to Detect Object Contoursby Cue Combination
Brightness, colour & texture
gradient, combined with
boosted logistic regression
[Martin et al., PAMI'04]
Capturing Local Statistics● Many approaches capture local image content
using statistics or distributions of primitive descriptors over local image regions– e.g. tile local region with small cells, find statistics in
each cell
– captures local context, increases robustness to spatial displacements
● Capture distributions using mixture models, histograms of quantized descriptor values
● Capture statistics using moments, pairwise correlations...
Wavelet Histogram Face Detector● Crudely quantize wavelet responses (3-5 levels)● Partition wavelets into groups of 5-8 with strong mutual information● For each group build histogram of log P(object)/P(non-object)● Final classifier is naïve Bayes combination of histogram lookups● Learn frontal and profile face detectors and combine outputs
Wavelets with strong MI with indicated one, and a chosen
coefficient pair
Some detections
Green regions strongly support face, red regions
support non-face[Schneiderman, IJCV'02]
Learning Based Feature Detectors● Many kinds of local cues are informative, but
responses are typically strongly correlated
● Naïve Bayes feature combination doesn't work well, but we can learn to combine cues to produce a stronger detector
● e.g. Maximum Entropy learning of distribution, ML of presence decisions
Maximum Entropy Learning● Models joint distribution by matching predicted
and empirical 1D projections (e.g. histograms of linear filter responses)
[Siddenbladh & Black]
Maximum Entropy Learning
Local Descriptor Methods ● Represent image as a set of descriptors over local image
regions (patches)
● Patches contain a lot of information about image content
● Locality reduces interference from
– occlusion & clutter
– lighting variations (local normalization)
– global effects of changes in form or viewpoint
● But it fragments the scene – global form is harder to see
“Texton” / “Bag of Features” Image Classification
● Classify images by their distributions of local patch appearances– Sample patches densely, randomly, at salient interest points...– Characterize appearance using any local descriptor (e.g. SIFT)– Characterize descriptor set or distribution by vector quantizing
descriptors against a large dictionary of patches and histogramming results
– Learn classification rules for classes of images using ML over the BoF histograms
● Inter-patch relationships and global image structure are ignored
Extremely Randomized Clustering Forests
● Instead of vector quantization, quantize against an ensemble of discriminatively trained random decision trees
● Each leaf of each tree has a separate bin● Then learn linear SVM classifier over these bins● Fast and works very well
Object Localization in Bag of Features Models
● BoF models work surprisingly well for content based image classification because certain patches are very characteristic of certain object classes– e.g. this can be seen in the linear SVM weights
● We can use these to approximately localize the object– iterate updating the location mask and using it to remake histogram
Bicycle localization with
randomized forest features
Local Feature Based Pedestrian Detector
Combines ● bottom-up local cues
from bag of interest point recognition
● probabilistic top-down segmentation
for good handling of occlusions
(Leibe & Schiele, CVPR'05)
Implicit Shape Model - Liebe and Schiele, 2003
BackprojectedHypotheses
Interest Points Matched Codebook Entries
Probabilistic Voting
Voting Space(continuous)
Backprojectionof Maxima
Segmentation
Refined Hypotheses(uniform sampling)
Liebe and Schiele, 2003, 2005
Learning Surface Orientation & Type
● Learning based features (detectors) for vertical, horizontal left/right/centre facing and “porous” vs. solid surfaces– logistic AdaBoosted decision trees over a large set of
local cues
Using Geometric Context to Aid Detection
● Making sense of city scenes by combining surface orientation cues, object detector responses, horizon estimates
Image
P(object | surfaces, viewpoint)P(object)
P(surfaces) P(viewpoint)
[Hoiem, CVPR'06]
Image Parsing● Attempts to synthesize entire scenes
from component models using multilevel MCMC sampling
– faces, letters, background...
[Zhu et al, 2003...]
The End