# a mixed generative-discriminative framework for pedestrian … · 2008-07-08 · a mixed...

Post on 21-Mar-2020

4 views

Embed Size (px)

TRANSCRIPT

A Mixed Generative-Discriminative Framework for Pedestrian Classification

Markus Enzweiler1 Dariu M. Gavrila2,3

1 Image & Pattern Analysis Group, Dept. of Math. and Comp. Sc., Univ. of Heidelberg, Germany2 Environment Perception, Group Research, Daimler AG, Ulm, Germany

3 Intelligent Systems Lab, Faculty of Science, Univ. of Amsterdam, The Netherlands{uni-heidelberg.enzweiler,dariu.gavrila}@daimler.com

AbstractThis paper presents a novel approach to pedestrian clas-

sification which involves utilizing the synthesized virtualsamples of a learned generative model to enhance the clas-sification performance of a discriminative model. Our gen-erative model captures prior knowledge about the pedes-trian class in terms of a number of probabilistic shapeand texture models, each attuned to a particular pedestrianpose. Active learning provides the link between the genera-tive and discriminative model, in the sense that the former isselectively sampled such that the training process is guidedtowards the most informative samples of the latter.

In large-scale experiments on real-world datasets oftens of thousands of samples, we demonstrate a signif-icant improvement in classification performance of thecombined generative-discriminative approach over thediscriminative-only approach (the latter exemplified by aneural network with local receptive fields and a support vec-tor machine using Haar wavelet features).

1. IntroductionThe ability to automatically detect pedestrians in images

is key for a number of application domains such as surveil-lance and intelligent vehicles. Large variations in pedestrianappearance (e.g. clothing, pose) and environmental condi-tions (e.g. lighting, background) make this problem partic-ularly challenging. A typical approach starts by identifyingregions of interest in the image using a computationally ef-ficient method (e.g. background subtraction, motion detec-tion, obstacle detection) and thereafter moves on to a moreexpensive pattern classification step [8, 23, 31]. In this pa-per, we focus on the classification step, see also [5, 18].

Recently, an experimental study on pedestrian classifica-tion investigated the combination of several state-of-the-artfeatures and classifiers [21]. Some combinations performed

The authors acknowledge the support of the Studienstiftung desdeutschen Volkes (German National Academic Foundation).

Figure 1. Framework overview. Utilizing the synthesized samplesof a learned generative model to enhance the classification perfor-mance of a discriminative model.

better than others, but interestingly, the benefit obtained byselecting the best combination was less pronounced than thegain obtained by increasing the training set (even thoughthe latter was already quite large, involving many thou-sands of training samples). Methods to acquire additionaltraining samples of the non-target class are commonly used[21, 23, 27], although it was observed that the performancetends to saturate fairly quickly, after a few bootstrapping it-erations. The enlargement of the training set with respectto the target class yielded more gain [21], but this usuallyrequires time consuming (and thus costly) manual labeling.

This paper proposes a novel combined generative-discriminative approach to pedestrian classification, aimedat addressing the bottleneck caused by the scarcity of sam-ples of the target class. A generative model is learned froma pedestrian dataset captured in real urban traffic and used

978-1-4244-2243-2/08/$25.00 ©2008 IEEE

to synthesize virtual samples of the target class, thus en-larging the training set of a discriminative pattern classifierat little cost. This set of virtual samples can be consideredas a regularization term to the real data to be fitted, whichincorporates prior knowledge about the target object class.The paper proposes the use of selective sampling, by meansof probabilistic active learning, to guide the training processtowards the most informative samples. See Figure 1.

The general idea is independent of the particular genera-tive and discriminative model used, and can in principle ex-tend to other object classes than pedestrians. In this paper,we propose a generative model which consists of a numberof probabilistic shape and texture models, each attuned toa generic object pose. For this, we require the existence ofa registration method amongst samples associated with thesame generic pose. Our use of active learning furthermorerequires a confidence measure associated with the output ofthe discriminative model, but this assumption is easily metin practice.

2. Previous WorkA sizeable body of literature exists on the acquisition of

object models from training data. Roughly, these modelscan be categorized into generative and discriminative mod-els, see [29]. Generative models capture the (unknown) datageneration process by representing the appearance of an ob-ject class in terms of its class-conditional density function.Using associated prior distributions, posterior probabilitiesfor object classification can be inferred using Bayes rule. Incontrast, discriminative models directly approximate poste-rior probabilities by learning the parameters of a discrim-inant function (decision boundary) between object classes.Regarding object classification, discriminative models aretypically faster and more robust with regard to the predic-tion of class labels. On the other hand, generative modelscan handle partially labeled data and allow to synthesizenew examples [29].

A main representative of generative models are linearshape models [3], learned from a set of training shapes,given the existence of an appropriate registration method.Extensions to cover non-rigid shape deformations involvedusing a single global shape model in combination with non-linear PCA techniques [25, 26] or a probabilistic model[4], as well as the use of multiple local-linear shape mod-els [7, 11]. Other work considered joint linear shape andtexture models [3, 6, 15] to capture underlying dependen-cies in a principled way. For this, texture models are de-rived from shape-normalized examples. The downside ofthe joint shape-texture approach is that it requires data nor-malization procedures and that it results in a significantlyincreased feature space dimensionality. Recently, layeredshape-texture models have been introduced to add addi-tional robustness against missing features and substantial

Authors Shape Model Texture Model SamplePlausibility

Cootes et al. [3]Fan et al. [6]

Jones et al. [15]

global linear(PCA)

global linear(PCA)

limit ondeviation from

meanJones et al. [14] multi-layer

global linear(weighted PCA)

multi-layerglobal linear(weighted PCA)

limit ondeviation from

meanGavrila et al. [7]Heap et al. [11]

pose-specific lin-ear (PCA)

- limit ondeviation from

meanRomdhani et al.

[25]global non-linear(Kernel PCA)

- limit ondeviation from

meanSozou et al. [26] global non-linear

(polynomialregression)

- limit ondeviation from

meanCootes et al. [4] global linear

(PCA)- probabilistic

(GMM)current paper pose-specific

linear (PCA)pose-specificlinear (PCA),decomposed

probabilistic(KDE)

Table 1. Overview of existing and proposed generative shape andtexture models.

spatial rearrangement of object parts [14]. See Table 1.While generative models implicitly establish a feature-

space tailored to the object class under consideration, dis-criminative approaches to object classification involve acombination of a feature extraction method, e.g. local re-ceptive fields, Haar wavelets, and a pattern classificationtechnique, e.g. neural networks, support vector machines,[13, 21].

Several techniques which combine generative and dis-criminative models have recently been proposed [17, 20, 28,33]. Discriminative models have been employed to learn agenerative model in an iterative fashion [28]. One line ofresearch has been concerned with designing objective func-tions which incorporate both generative and discriminativeterms, where their balance is controlled by both heuristic[20] and probabilistic [17] weighting schemes. Further,likelihood ratios in generative models have been replacedby more powerful discriminative models [33].

Aside from the particular models used, incorporatingprior knowledge about the target class has been suggestedas a way out of the bias-variance dilemma to increaserobustness [22]. Prior knowledge can be both incorpo-rated directly into the error function of a discriminativemodel (vicinal risk minimization) [30] and during trainingin terms of enlarging the training set with additional sam-ples [21, 23, 24, 27, 30, 31]. While virtual samples of thenon-target class can be easily collected using bootstrapping[21, 23, 27, 31], acquiring additional target class samplesis typically burdensome. Besides the trivial approach of la-borious manual labeling, a number of techniques to synthe-size virtual patterns of the target-class have been proposed.Some require controlled data acquisition (e.g. same indi-vidual with respect to changes in viewpoint, facial expres-sion and lighting) to obtain prototypical images to be lin-early combined [1, 2, 9]. Others utilize explicit 3D models

[12]. If such prerequisites cannot be satisfied, the synthesisof virtual examples has been limited to simple geometricand photometric jittering in terms of adding mirrored, ro-tated, shifted or intensity-manipulated versions of the orig-inal training patterns [21, 24, 27, 30].

Finally, relevant to current work are sampling techniqueswhich assess the information content of training samplesand select the most informative training examples by boot-strapping [21, 23, 27, 31] or active learning [10, 16, 19].

The contributions of this paper are twofold. We con-sider our main contribution to be the novel framework il-lustrated in Figure 1. A learned generative model is usedto enhance the performance of a discriminative model interms of synthesizing virtual training samples combinedwith active learning. This is quite unlike previous com-bination strategies for generative and discriminative mod-els [17, 20, 28, 33] and unlike previous applications of ac-tive learning. We neither require controlled data acquisition[1, 2, 9], nor do we have 3D models [12] to our disposition.At the same time, we go beyond the synthesis of samplesbased on simple transformations [21, 24, 27, 30] and takeinto account sample probabilities.

A secondary contribution concerns the generative pedes-trian model proposed, see Table 1. Similar to [7, 11], our ap-proach uses separate feature-spaces to model topologicallydiverse shapes (e.g. pedestrian with feet apart and with feetclosed), in order to increase model specificity. However,we extend the shape representation of [7, 11] with a texturecomponent, distinguishing between texture variations at thecoarse and the detail level. We establish a statistical shape-texture model along with the associated class-conditionaldensity functions. This provides a sound basis for the syn-thesis of virtual pedestrian samples by means of three com-ponents: foreground shape, foreground texture and back-ground texture.

3. Generative Pedestrian ModelInput to our pedestrian model is a set D of pedestrians

(xi, ω0) ∈ D with class label ω0. We apply an integratedshape registration and clustering approach [7] to obtain a setof K view-specific clusters, Ψk, from the shapes underlyingD, with prototype shapes pk (we use K = 12 in the exper-iments). See Figure 1. Let xk,i denote the i-th example inthe k-th pose-specific cluster Ψk, with i = 1, . . . , Nk. Apedestrian sample xk,i = (sk,i, tk,i) ⊕ bk,i is representedas the composition ⊕ of a foreground texture tk,i over abackground bk,i, partitioned by a discrete shape contoursk,i.

The introduction of pose-specific feature-spaces Ψk ef-fectively reduces correlations between pedestrian textureand their pose or heading. Within each pose-specific space,a generative model is instantiated describing the pedes-trian class-conditional density function for the shape and

Figure 2. Shape registration. a) - c) automatically determined con-tour point correspondences, d) Delaunay triangulation

foreground texture component separately. Foreground andbackground are assumed uncorrelated, thus the backgroundtexture component bk is not included into the generativemodel.

We now outline the learning procedure for the proposedpose-specific generative pedestrian shape-texture model in-volving the setup of separate shape and texture model-spaces, as well as the estimation of the class-conditionaldensities therein.

Shape Model-Space As a result of shape registration, [7],it is possible to embed the shapes within a cluster Ψk intoa common feature-space. The features involve the pixel co-ordinates of corresponding points sampled at a given (arc-length normalized) distance along the contour. See Figure2a-c. PCA is applied to the shape space to obtain a com-pact representation utilizing Nsk dimensions (e.g. to model95% of the total variance). The parametric representationmsk,i of a pedestrian shape sk,i in terms of shape modelcoordinates is given by

msk,i = ΦTsk

(sk,i − s̄k) . (1)

Here, s̄k denotes the mean shape within Ψk, and Φsk is amatrix containing Nsk eigenvectors in its columns.

Foreground Texture Model-Space To establish a fore-ground texture feature-space within each cluster Ψk, all tex-ture vectors tk,i are first shape-normalized to t̂k,i by warp-ing them with respect to the cluster prototype pk, see Figure3. A Delaunay triangulation-based piecewise-affine warp-ing function Wsk,i is employed, utilizing shape correspon-dences between shape sk,i and prototype pk to map trian-gles (Figure 2):

t̂k,i = Wsk,i(tk,i) (2)

Shape-normalization can be seen as a partial linearization ofnon-linear interdependencies within each pose-specific tex-ture feature-space resulting from (slightly) different bodyposes and headings.

As done before, PCA is applied to establish a parametrictexture model-space representation of t̂k,i in terms of mean¯̂tk and eigenvectors Φt̂k :

mt̂k,i = ΦTt̂k

(t̂k,i − ¯̂tk

)(3)

Figure 3. Shape-normalized examples for a pose-specific sub-space.

Figure 4 depicts the mean texture along with the first foureigenvectors for a pose-specific texture model.

Given the scarcity of available texture samples (mean-while subdivided by pose) and the high dimensionality ofthe shape-normalized texture model-space, we cannot reli-ably establish a generative texture model to capture a size-able amount of variance (e.g. 95%), as done before forshape. Using solely a subspace spanned by fewer principalcomponents is however not a viable option, as projectionleads to subtle texture details being washed-out, which inlarge part determine pedestrian appearance. As a way out,we propose to decompose the full Nt̂k -dimensional texturemodel-space obtained by PCA into two subspaces. The firstsubspace represents coarse texture components (e.g. model-ing overall appearance of clothing parts such as trousers andcoat). Its dimensionality N ′

t̂kis selected such that a reliable

estimation of the relevant pdf from training data is possi-ble (e.g. we model 65% of the total variance). The secondand complementary subspace captures fine texture compo-nents. Here no pdf estimation takes place, for synthesis (seeSection 4) the associated entries are derived from particulartraining samples.

Hence, the parametric model-space representation mt̂k,i(cf. Eq. (3)) of a shape-normalized texture vector t̂k,i isdecomposed into:

mt̂k,i =(m′

t̂k,i,m′′

t̂k,i

)(4)

with

m′t̂k,i

=(mt̂k,i,1, . . . ,mt̂k,i,N ′t̂k

)(5)

m′′t̂k,i

=(mt̂k,i,N ′t̂k+1

, . . . ,mt̂k,i,Nt̂k)

(6)

Class-Conditional Density Estimation After establish-ing K pose-specific shape and shape-normalized texturemodel-spaces, we estimate the class-conditional densitiespsk

(msk |ω0

)and pt̂k

(m′

t̂k|ω0

)with respect to the pedes-

trian class ω0 within each subspace. In preliminary ex-periments, we found Gaussian Kernel Density Estimation(KDE) to outperform Gaussian Mixture Models (GMM),based on the likelihood of model-fit.

Temporarily dropping the distinction between shape skand texture t̂k, the Kernel Density estimate of the class-

Figure 4. Mean texture and eigenvectors for a pose-specific texturemodel (background masked out).

conditional densities is given by:

pk (m |ω0) =1

Nk

Nk∑n=1

1det(H)

K{H−1(m −mn)

}(7)

where K denotes the kernel function and H represents adiagonal matrix containing kernel bandwidths. We useanisotropic multivariate Gaussian kernels K, with band-widths optimized via maximum likelihood on the trainingset [13], for both the shape and shape-normalized texturespace, respectively.

The class-conditional density functions psk(msk |ω0

)and pt̂k

(m′

t̂k|ω0

)provide the basis for the proposed syn-

thesis of virtual pedestrians. As opposed to [3, 6, 15], whereplausibility has been enforced by limiting the deviation ofthe model coordinates from the mean (which does not ex-tend to a multimodal distribution), the probabilistic formu-lation allows for a direct assessment of plausibility for agiven shape or texture vector.

4. Model-Based Virtual Pedestrian SynthesisThe model-based synthesis of virtual pedestrian samples

utilizing the proposed pose-specific generative shape andtexture models involves the variation of three components:shape, foreground texture and background texture. See Fig-ure 5 for an overview.

Shape Variation Model coordinates m∗sk,j representinga new virtual shape s∗k,j can be sampled directly from thegenerative shape model psk

(msk |ω0

):

m∗sk,j ∼ psk(msk |ω0

)(8)

Sampling the KDE estimate of psk(msk |ω0

)involves uni-

formly selecting the j-th example msk,j in model-space andsampling from the local kernel K, centered at msk,j . Plau-sibility of the virtual shape model coordinates is enforcedby requiring psk

(m∗sk,j |ω0

)> csk , with csk a threshold pa-

rameter learned from the distribution of the training set sothat the large majority of training samples (e.g. 99%) arecovered.

Transforming m∗sk,j from shape model-space back to theshape feature-space yields a new virtual shape contour:

s∗k,j = s̄k + Φskm∗sk,j

(9)

Figure 5. Overview of the proposed model-based pedestrian synthesis procedure within a pose-specific cluster Ψk. Existing pedestrianexamples are projected onto a generative shape-texture model which is re-sampled to create virtual pedestrian samples.

The virtual shape s∗k,j is utilized to warp an existing pedes-trian example into a new shape, as shown in Figure 6b.

Foreground Texture Variation Regarding the synthesisof virtual texture samples for the pedestrian class, we uti-lize the proposed decomposed representation of the shape-normalized texture space in terms of coarse and detailedcomponents, as outlined in Section 3. The main idea is,to employ the main modes of variation to control coarse ap-pearance variations (e.g. individual clothing parts or globalillumination) and induce pose-specific effects of differenttypes of wear (e.g. closed coat vs. coat-shirt pattern, seeFigure 4 mode 2 vs. mode 4, respectively), while at thesame time retaining fine-scales details (e.g. internal bodyor face contours), which are crucial for pedestrian appear-ance.

Hence, to obtain virtual shape-normalized texture pa-rameters m∗

t̂k,j, we first sample model parameters pertain-

ing to coarse texture components m′∗t̂k,j

from the generative

texture model pt̂k(m′

t̂k|ω0

), by uniformly selecting the j-th

example in model-space and sampling from the local kernel:

m′∗t̂k,j

∼ pt̂k(m′

t̂k|ω0

)(10)

Similar to the way the shape component is addressed, plau-sibility is enforced by applying a coverage threshold ct̂k(e.g. 99% coverage), with pt̂k

(m

′∗t̂k,j

|ω0)

> ct̂k . Model pa-rameters m′′

t̂k,jrepresenting the original shape-normalized

texture details of the j-th example mt̂k,j are retained andcombined with the synthesized coarse model coordinatesm

′∗t̂k,j

to yield (cf. Eq. (4)):

m∗t̂k,j

=(m

′∗t̂k,j

,m′′t̂k,j

)(11)

Thereafter, m∗t̂k,j

is projected from the model-space back tothe feature-space of shape-normalized texture:

t̂∗k,j =¯̂tk + Φt̂km

∗t̂k,j

(12)

Finally, the inverse of the shape-normalization operator,W−1s∗k,j

, is applied to warp the virtual shape-normalized tex-

ture t̂∗k,j to a shape s∗k,j (which can be a new virtual shape

or an existing shape) within the same pose-specific model(cf. Eq. (2)):

t∗k,j = W−1s∗k,j

(̂t∗k,j) (13)

An example of this technique is depicted in Figure 6c-e.Note how fine-scale details, e.g. the internal contour of theright arm (Figure 6c-e, first row) are preserved, while theoverall texture exhibits sensible variations.

Background Texture Variation The background texturecomponent is assumed independent from pedestrian ap-pearance and is represented by a non-parametric exemplar-based model. Virtual background texture vectors b∗k,j areuniformly sampled U from a set of non-pedestrian imagesB that can be obtained at low cost:

b∗k,j ∼ U(B) (14)

Application-specific constraints regarding likely target lo-cations (e.g. flat-world assumption, people standing on theground) can be incorporated at this point.

Joint Variation and Compositing Joint variation ofshape, foreground and background texture involves sam-pling virtual examples for each component. Virtual tex-ture t∗k,j is sampled from the generative texture modelpt̂k

(m′

t̂k|ω0

)(cf. Eqs. (10)-(13)) and warped to a vir-

tual shape s∗k,j , sampled from the generative shape modelpsk

(msk |ω0

)(cf. Eqs. (8)-(9)). Finally, background b∗k,j

is sampled from the the non-parametric background model(cf. Eq. (14)) and a virtual pedestrian example x∗k,j is ob-tained by compositing the textured pedestrian shape overthe background, see Figure 6:

x∗k,j = (s∗k,j , t

∗k,j) ⊕ b∗k,j (15)

5. Probabilistic Selective SamplingA probabilistic least-certain querying scheme, an in-

stance of an active learning algorithm [10, 16, 19], is uti-lized to directly link the discriminative with the generativemodel in terms of assessing the information content of vir-tual pedestrian samples. Resampling a generative model al-lows to create a virtually infinite number of training sam-ples for a discriminative model. Here, selective sampling

Figure 6. Example of virtual pedestrian synthesis. a) originalpedestrian examples, b) shape variation, c) foreground texturevariation, d) - e) joint variation of shape, foreground and back-ground texture

becomes a necessity to remove redundancy from the train-ing set and focus the resources of the discriminative learn-ing procedure on the examples with the highest informationcontent. In classification tasks, there exists a region of un-certainty RD, where the classification result is not unam-biguously defined (see the hatched area in Figure 1, ActiveLearning). That is, the discriminative model can learn amultitude of decision boundaries which are consistent withthe given training patterns, but yet disagree in some regionsof the decision space. If a sample is drawn from RD, thesize of RD and thus the global uncertainty can be reduced.

In our probabilistic least-certain querying scheme, weapproximate RD using the probability of error for each sam-ple xi. Given a two-class problem with classes ω0 (targetclass) and ω1 (non-target class), we assume the discrim-inative model to approximate posterior probabilities andto make a Bayesian decision, i.e. xi is classified as ω0,if P (ω0|xi) > P (ω1|xi). Then, the probability of errorP (error|xi) is given by

P (error|xi) = min {P (ω0|xi), P (ω1|xi)} . (16)

Obviously, P (error|xi) has a peak at P (ω0|xi) =P (ω1|xi) = 0.5, which represents the decision boundary.To base uncertainty on P (error|xi), we introduce a thresh-old Θ ∈ [0, 0.5] on P (error|xi) and consider only thosesamples xi as informative examples, where P (error|xi) >Θ. This is equivalent to putting a threshold on the absolutedifference of the posterior probabilities:

0 ≤ |P (ω0|xi)− P (ω1|xi)| ≤ 1− 2Θ (17)

Hence, the approximation of the region of uncertainty RDis defined as a symmetric region centered at P (ω0|x) =P (ω1|x) = 0.5, the decision boundary of the discriminativemodel. This technique requires an estimate of the underly-ing (unknown) probabilities. The outputs of many state-of-the-art classifiers, e.g. neural networks or support vector

Pedestrians Pedestrians Non-(labeled) (jittered) Pedestrians

Init. Train Set 10946 43784 82698Test Set 13971 251478 133813

Table 2. Training and test set statistics.

machines can be converted to an estimate of posterior prob-abilities [13, 16, 19]. We use this in our experiments.

The aforementioned selective sampling strategy is usedin an iterative scheme to link the training of the discrimina-tive model with the generative pedestrian synthesis. In eachiteration l, the set of virtual examples D∗l is resampled toD̂∗l by retaining only the informative samples x

∗j ∈ D∗l , as

evaluated by the discriminative model trained on Dl, usingEq. (17). Finally, the discriminative model is retrained onthe joint dataset Dl+1 = Dl ∪ D̂∗l .

6. ExperimentsThe proposed generative-discriminative framework was

tested in large-scale experiments on pedestrian classifica-tion. Our purpose is not to establish the best absolute clas-sification performance amongst the various state-of-the-artmethods (Section 2). Rather, our aim is to examine the rel-ative performance gain that can be obtained by using theproposed mixed generative-discriminative framework overa particular discriminative-only approach. To illustrate thegenerality with respect to the discriminative model used, weconsidered two diverse instances: a neural network with lo-cal receptive fields of size 5 × 5 pixels (NN/LRF) [32] anda linear1 support vector machine using Haar wavelet fea-tures at scales of 4 × 4 and 8 × 8 pixels (Haar SVM) [23].Results are expected to generalize to other pedestrian clas-sifiers that are sufficiently complex to represent the largetraining datasets e.g. [5, 13, 18].

See Table 2 for the datasets used. Training and test setscontain manually labeled pedestrian bounding boxes withadditional contour labels for the training set. All trainingsamples are scaled to 18 × 36 pixels with a two-pixel bor-der in order not to lose contour information. The sampleswere acquired in daylight conditions from a moving vehicleand depict non-occluded pedestrians in front of a changingbackground. The non-pedestrian samples were the resultof a pedestrian shape detection pre-processing step with re-laxed threshold setting, i.e. containing a bias towards more”difficult” patterns, similar to [21]. Training and test setwere strictly separated: no instance of the same real-worldpedestrian appears in both training and test set, similarly forthe non-target samples. See Figure 7 for some examples ofthe dataset. Discriminative models trained on this datasetare referred to as base classifiers.

1training a non-linear SVM on our large datasets was not feasible dueto excessive memory requirements

a) b)

Figure 7. Dataset overview. a) training set examples, b) test set ex-amples. Top and bottom rows show target and non-target samples,respectively.

We examine the effect of introducing jittering to pedes-trian training samples; this represents the applicable state-of-the-art, see Section 2. Geometric jittering is introducedin terms of creating four patterns from each pedestrian sam-ple in the training set by applying a random shift (±2 pix-els) and mirroring. Since we employ contrast normalizationduring training of the classifiers, photometric jittering is notconsidered. Discriminative models utilizing this dataset arereferred to as jittered classifiers.

In all experiments with our mixed generative-discriminative framework (Figure 1), we perform severaliterations of virtual sampling and discriminative modelretraining, up to performance saturation. In each suchiteration, the training set is extended by 10946 synthesizedpedestrians (plus additional four jittered versions of eachvirtual pedestrian), guided by selective sampling (Eq. (17)),with Θ = 0.35. For the case of non-targets, we performa similar iterative dataset extension approach (4 × 10946samples, now obtained by selective sampling on images notcontaining targets, without jittering).

In a first experiment with a NN/LRF classifier (Figure8a), the number of non-target training samples is kept con-stant and the benefit of jittering and virtual pedestrian syn-thesis is studied. From Figure 8a one observes that jitter-ing leads to a significant performance improvement overthe base classifier (more jittered samples did not yield fur-ther improvement). Yet we obtained additional performancegain using the proposed framework, by incrementally incor-porating shape, foreground and background texture varia-tion.

Furthermore, we compare target-class resampling in-volving joint shape, foreground and background variation(the best performing synthesis variant in Figure 8a) to non-target class resampling, see Figure 8b and 8c. The total per-formance gain by adding non-target training samples onlyis significant, yet less than in the case of augmenting thepedestrian set only (Figure 8b and 8c, magenta vs. greencurve). Best performance is reached by joint augmentationof the pedestrian and non-pedestrian class. This variant sat-urated after three iterations, compared to two iterations forall others.

a)

b)

c)

Figure 8. ROC performance for classification experiments. a) vir-tual pedestrian synthesis (NN/LRF), b)+c) target class vs. non-target class resampling for NN/LRF and Haar SVM

For comparison, we added 10946 real pedestrian sam-ples plus four jittered versions, manually labeled from anauxiliary data pool, to the base dataset (without syntheticsamples and active learning). Remarkably, the proposedgenerative-discriminative framework even outperforms themanual approach (see Figure 8b and 8c, green vs. red cir-cled curve). This is not an aberration caused by overfit-ting; the datasets used are truly large. Rather, it is the

consequence of the fact that, although the manually la-beled samples are more realistic, they are not necessarilymore informative (we tediously label samples that the clas-sifier already knows). Of course, the aim of our proposedgenerative-discriminative framework is to avoid this addi-tional manual labeling in the first place.

We finally note that, although absolute performances forthe two considered discriminative models are different, therelative order in which the various resampling techniquesperform is identical, see Figure 8b vs. Figure 8c.

7. ConclusionThis paper presented a novel framework for pedestrian

classification which involves utilizing the synthesized sam-ples of a learned generative model to enhance the classifi-cation performance of a discriminative model. In extensiveexperiments, we obtained the non-trivial result that classifi-cation performance is substantially enhanced by the aug-mented training set; the false positive rate of the mixedgenerative-discriminative approach was reduced by up to afactor of two compared to discriminative-only approach, atthe same detection rate. Our approach also outperformedclassifiers bootstrapped by non-target data or by jitteredsamples of the target class. We take this as evidence of thestrength of our generative pedestrian model and selectivesampling method. Future work involves applying the pro-posed framework to other object classes.

References[1] D. Beymer and T. Poggio. Face recognition from one exam-

ple view. In Proc. ICCV, pages 500–507, 1995.[2] H.-P. Chiu et al. Virtual training for multi-view object class

recognition. In Proc. CVPR, 2007.[3] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appear-

ance models. IEEE PAMI, 23(6):681–685, 2001.[4] T. F. Cootes and C. J. Taylor. A mixture model for represent-

ing shape variation. IVC, 17(8):567–574, 1999.[5] N. Dalal and B. Triggs. Histograms of oriented gradients for

human detection. In Proc. CVPR, pages 886–893, 2005.[6] L. Fan, K.-K. Sung, and T.-K. Ng. Pedestrian registration

in static images with unconstrained background. PatternRecognition, 36:1019–1029, 2003.

[7] D. M. Gavrila and J. Giebel. Virtual sample generation fortemplate-based shape matching. In Proc. CVPR, pages 676–681, 2001.

[8] D. M. Gavrila and S. Munder. Multi-cue pedestrian detec-tion and tracking from a moving vehicle. IJCV, 73(1):41–59,2007.

[9] A. S. Georghiades et al. From few to many: Illuminationcone models for face recognition under variable lighting andpose. IEEE PAMI, 23(6):643–660, 2001.

[10] M. Hasenjaeger and H. Ritter. Active learning in neural net-works. New learning paradigms in soft computing, pages137–169, 2002.

[11] T. Heap and D. Hogg. Improving specificity in PDMs usinga hierarchical approach. In Proc. BMVC, pages 80–89. A. F.Clark (ed.), 1997.

[12] B. Heisele et al. Categorization by learning and combiningobject parts. In NIPS, pages 1239–1245, 2001.

[13] A. K. Jain, R. P. W. Duin, and J. Mao. Statistical patternrecognition: A review. IEEE PAMI, 22(1):4–37, 2000.

[14] E. Jones and S. Soatto. Layered active appearance models.In Proc. ICCV, volume 2, pages 1097– 1102, 2005.

[15] M. J. Jones and T. Poggio. Multidimensional morphablemodels. In Proc. ICCV, pages 683–688, 1998.

[16] A. Kapoor et al. Active learning with Gaussian processes forobject categorization. In Proc. ICCV, 2007.

[17] J. A. Lasserre, C. M. Bishop, and T. P. Minka. Principledhybrids of generative and discriminative models. In Proc.CVPR, 2006.

[18] B. Leibe, E. Seemann, and B. Schiele. Pedestrian detectionin crowded scenes. In Proc. CVPR, pages 878–885, 2005.

[19] M. Li and I. K. Sethi. Confidence-based active learning.IEEE PAMI, 28(8):1251–1261, 2006.

[20] A. McCallum et al. Multi-conditional learning: Genera-tive/discriminative training for clustering and classification.In Proc. AAAI, 2006.

[21] S. Munder and D. M. Gavrila. An experimental study onpedestrian classification. IEEE PAMI, 28(11):1863–1868,2006.

[22] P. Niyogi, F. Girosi, and T. Poggio. Incorporating prior in-formation in machine learning by creating virtual examples.In IEEE Proc. Int. Sig. Proc., pages 2196–2209, 1998.

[23] C. Papageorgiou and T. Poggio. A trainable system for objectdetection. IJCV, 38:15–33, 2000.

[24] D. Pomerleau. Neural network vision for robot driving. InThe Handbook of Brain Theory and Neural Networks. M.Arbib, ed., 1995.

[25] S. Romdhani, S. Gong, and A. Psarrou. A multi-view non-linear active shape model using kernel PCA. In Proc. BMVC,pages 483–492. A. F. Clark (ed.), 1999.

[26] P. D. Sozou et al. A non-linear generalisation of PDMs us-ing polynomial regression. In Proc. BMVC, pages 397–406,1994.

[27] K.-K. Sung and T. Poggio. Example-based learning for view-based human face detection. IEEE PAMI, 20(1):39–51, 1995.

[28] Z. Tu. Learning generative models via discriminative ap-proaches. In Proc. CVPR, 2007.

[29] I. Ulusoy and C. M. Bishop. Generative versus discrimina-tive methods for object recognition. In Proc. CVPR, pages258–265, 2005.

[30] A. Vedaldi, P. Favaro, and E. Grisan. Boosting invarianceand efficiency in supervised learning. In Proc. ICCV, 2007.

[31] P. Viola, M. Jones, and D. Snow. Detecting pedestrians usingpatterns of motion and appearance. IJCV, 63(2):153–161,2005.

[32] C. Wöhler and J. K. Anlauf. A time delay neural network al-gorithm for estimating image-pattern shape and motion. IVC,17:281–294, 1999.

[33] D.-Q. Zhang and S.-F. Chang. A generative-discriminativehybrid method for multi-view object detection. In Proc.CVPR, 2006.