unsupervised and transfer learning under uncertainty: from ...lisa/pointeurs/2013_object... · 3.2...

Unsupervised and Transfer Learning under Uncertainty:from Object Detections to Scene Categorization

Gregoire Mesnil1,2, Salah Rifai1, Antoine Bordes3,Xavier Glorot1, Yoshua Bengio1 and Pascal Vincent1

1LISA, Universite de Montreal, Quebec, Canada2LITIS, Universite de Rouen, France

3CNRS - Heudiasyc UMR 7253, Universite de Technologie de Compiegne, [email protected]

Keywords: Unsupervised Learning, Transfer Learning, Deep Learning, Scene Categorization, Object Detection

Abstract: Classifying scenes (e.g. into “street”, “home” or “leisure”) is an important but complicated task nowadays,because images come with variability, ambiguity, and a wide range of illumination or scale conditions. Stan-dard approaches build an intermediate representation of the global image and learn classifiers on it. Recently,it has been proposed to depict an image as an aggregation of its contained objects:the representation on whichclassifiers are trained is composed of many heterogeneous feature vectors derived from various object de-tectors. In this paper, we propose to study different approaches to efficiently combine the data extracted bythese detectors. We use the features provided by Object-Bank (Li-Jia Li and Fei-Fei, 2010a) (177 differentobject detectors producing 252 attributes each), and show on several benchmarks for scene categorization thatcareful combinations, taking into account the structure of the data, allows to greatly improve over original re-sults (from +5% to +11%) while drastically reducing the dimensionality of the representation by 97% (from44,604 to 1,000).

1 Introduction

Automatic scene categorization is crucial formany applications such as content-based image in-dexing (Smeulders et al., 2000) or image under-standing. This is defined as the task of assign-ing images to predefined categories ( “office”, “sail-ing”,“mountain”, etc.). Classifying scene is compli-cated because of the large variability of quality, sub-ject and conditions of natural images which lead tomany ambiguities w.r.t. the corresponding scene la-bel.

Standard methods build an intermediate represen-tation before classifying scenes by considering theimage as a whole (Torralba, 2003; Vogel and Schiele,2004; Fei-Fei and Perona, 2005; Oliva and Torralba,2006). In particular, many such approaches rely onpower spectral information, such as magnitude of spa-tial frequencies (Oliva and Torralba, 2006) or localtexture descriptors (Fei-Fei and Perona, 2005). Theyhave shown to perform well in cases where there arelarge numbers of scene categories.

Another line of work conveys promising potentialin scene categorization. First applied to object recog-

nition (Farhadi et al., 2009), attribute-based methodshave now proved to be effective for dealing with com-plex scenes. These models define high-level represen-tations by combining semantic lower-level elements,e.g., detection of object parts. A precursor of this ten-dency for scenes was an adaptation of pLSA (Hof-mann, 2001) to deal with “visual words” proposed by(Bosch et al., 2006). An extension of this idea con-sists in modeling an image based on its content i.e. itsobjects (Espinace et al., 2010; Li-Jia Li and Fei-Fei,2010a). Hence, the Object-Bank (OB) project (Li-Jia Li and Fei-Fei, 2010b) aims at building high-dimensional over-complete representations of scenes(of dimension 44,604) by combining the outputs ofmany object detectors (177) taken at various poses,scales and positions in the original image (leadingto 252 attributes per detector). Experimental resultsindicate that this approach is effective since simpleclassifiers such as Support Vector Machines trainedon their representations achieve state-of-the-art per-formance. However, this approach suffers from twoflaws: (1) curse of dimensionality (very large numberof features) and (2) individual object detectors havea poor precision (30% at most). To solve (1), the

original paper proposes to use structured norms andgroup sparsity to make best use of the large input. Ourwork studies new ways to combine the very rich in-formation provided by these multiple detectors, deal-ing with the uncertainty of the detections. A methoddesigned to select and combine the most informativeattributes would be able to carefully manage redun-dancy, noise and structure in the data, leading to betterscene categorization performance.

Hence, in the following, we propose a sequential2-steps strategy for combining the feature representa-tions provided by the OB object detectors on whichthe linear SVM classifier is destined to be trained forcategorizing scenes. The first step adapts PrincipalComponents Analysis (PCA) to this particular set-ting: we show that it is crucial to take into accountthe structure of the data in order for PCA to performwell. The second one is based on Deep Learning.Deep Learning has emerged recently (see (Bengio,2009) for a review) and is based on neural networkalgorithms able to discover data representations inan unsupervised fashion (Hinton et al., 2006; Bengioet al., 2007; Ranzato et al., 2007; Kavukcuoglu et al.,2009; Jarrett et al., 2009). We propose to use thisability to combine multiple detector features. Hence,we present a model trained using Contractive Auto-Encoders (Rifai et al., 2011b; Rifai et al., 2011a),which have already proved to be efficient on manyimage tasks and has contributed to winning a transferlearning challenge (Mesnil et al., 2012).

We validate the quality of our models in an ex-tensive set of experiments in which several setups ofthe sequential feature extraction process are evalu-ated on benchmarks for scene classification (Lazeb-nik et al., 2006; Li and Fei-Fei, 2007; Quattoni andTorralba, 2009; Xiao et al., 2010). We show thatour best results substantially outperform the originalmethods developed on top of OB features, while pro-ducing representations of much lower dimension. Theperformance gap is usually large, indicating that ad-vanced combinations are highly beneficial. We showthat our method based on dimensionality reductionfollowed by deep learning offers a flexibility whichmakes it able to benefit from semi-supervised andtransfer learning.

2 Scene Categorization withObject-Bank

Let us begin by introducing the approach of theOB project (Li-Jia Li and Fei-Fei, 2010a). First, the177 most useful (or frequent) objects were selectedfrom popular image datasets such as LabelMe (Rus-

sell et al., 2008), ImageNet (Deng et al., 2009) andFlickr. For each of these 177 objects, a specific de-tector, existing in the literature (Felzenszwalb et al.,2008; Hoiem et al., 2005), was trained. Every de-tector is composed of 2 root filters depending on thepose, each one coming with its own deformable pat-tern of parts, e.g., there is one root filter for the front-view of a bike and one for the side-view. These354 = 177×2 part-based filters (each composed by aroot and its parts) are used to produce features of nat-ural images. For a given image, a filter is convolvedat 6 different scales. At each scale, the max-responseamong 21 = 1+4+16 positions (whole image, quad-rants, quadrants within each quadrant) is kept, pro-ducing a response map of dimension 126 = 6× 21.All 2× 177 maps are finally concatenated to pro-duce an over-complete representation x ∈ R44,604 ofthe original image.

In the original OB paper (Li-Jia Li and Fei-Fei,2010a), classifiers for scene categorization are learneddirectly on these feature vectors of dimension 44,604.More precisely, C classifiers (Linear SVM or Logis-tic Regression) are trained in a 1-versus-all setting inorder to predict the correct scene category ycategory(x)among C different categories. Various strategies usingstructured sparsity with combinations of `1/`2 normshave been proposed to handle the very large input.

3 Unsupervised Feature Learning

The approach of OB for the task of scene catego-rization, based on specific object detectors, is appeal-ing since it works well in practice. This suggests thata scene is better recognized by first identifying basicobjects and then exploiting the underlying semanticsin the dependencies between the corresponding detec-tors.

However, it appears that none of the individual ob-ject detectors reaches a recognition precision of morethan 30%. Hence, one may question whether theideal view that inspired this approach (and expressedabove) is indeed the reason of OB’s success. Alter-natively, one may hypothesize that the 44,604 OBfeatures are more useful for scene categorization be-cause they represent high level statistical propertiesof images than because they precisely report the ab-sence/presence of objects − see Figure 1. OB triedstructured sparsity to handle this feature selection butthere may be other ways – simpler or not.

This paper investigates several ways of learninghigher-level features on top of the high dimensionalrepresentation provided by OB, expecting that cap-turing further structure may improve categorization

Figure 1: Left: Cloud Middle: Man Right: Television. Top: False Detections Bottom: True Detections. Images fromSUN (Xiao et al., 2010) for which we compute the OB representation and display the bounding box around the averageposition of various objects detectors. For instance, the television detector can be viewed either as a television detector or arectangle shape detector i.e. high-order statistical properties of the image.

performance. Our approach employs unsupervisedfeature learning/extraction algorithms, i.e. genericfeature extraction methods which were not devel-oped specifically for images. We will consider bothstandard Principal Component Analysis and Contrac-tive Auto-Encoders (Rifai et al., 2011b; Rifai et al.,2011a). The latter is a recent machine learningmethod which has proved to be a robust feature ex-traction tool.

3.1 Principal Component Analysis

Principal Component Analysis (PCA) (Pearson,1901; Hotelling, 1933) is the most prevalent tech-nique for linear dimensionality reduction. A PCAwith k components finds the k orthonormal directionsof projection in input space that retain most of thevariance of the training data. These correspond tothe eigenvectors associated with the leading eigenval-ues of the training data’s covariance matrix. Principalcomponents are ordered, so that the first correspondsto the direction along which the data varies the most(largest eigenvalue), etc. . .

Since we will consider an auto-encoder variant(presented next), we should mention here a well-known result: a linear auto-encoder with k hiddenunits, trained to minimize squared reconstruction er-ror, will learn projection directions that span the samesubspace as a k component PCA (Baldi and Hornik,1989). However the regularized non-linear auto-encoder variant that we consider below is capableof extracting qualitatively different, and usually more

useful, nonlinear features.

3.2 Contractive Auto-Encoders

Contractive Auto-Encoders (CAEs) (Rifai et al.,2011b; Rifai et al., 2011a) are among the latest de-velopments in a line of machine learning research onnonlinear feature learning methods, that started withthe success of Restricted Boltzmann Machines (Hin-ton et al., 2006) for pre-training deep networks, andwas followed by other variants of auto-encoders suchas sparse (Ranzato et al., 2007; Kavukcuoglu et al.,2009; Goodfellow et al., 2009) and denoising auto-encoders (Vincent et al., 2008). It was selected heremainly due to its practical ease of use and recent em-pirical successes.

Unlike PCA that decomposes the input spaceinto leading global directions of variations, the CAElearns features that capture local directions of varia-tion (in some regions of input space). This is achievedby penalizing the norm of the Jacobian of a latent rep-resentation h(x) with respect to its input x at train-ing samples. Rifai et al. (Rifai et al., 2011b) showthat the resulting features provide a local coordinatesystem for a low dimensional manifold of the inputspace. This corresponds to an atlas of charts, eachcorresponding to a different region in input space, as-sociated with a different set of active latent features.One can think about this as being similar to a mixtureof PCAs, each computed on a different set of trainingsamples that were grouped together using a similaritycriterion (and corresponding to a different input re-

gion), but without using an independent parametriza-tion for each component of the mixture, i.e., allow-ing to generalize across the charts, and away from thetraining examples.

In the following, we summarize the formula-tion of the CAE as a regularized extension of abasic Auto-Encoder (AE). In our experiments, theparametrization of this AE consists in a non-linearencoder or latent representation h of m hidden unitswith a linear decoder or reconstruction g towards aninput space of dimension d.

Formally, the latent variables are parametrized by:h(x) = s(Wx+bh), (1)

where s is the element-wise logistic sigmoid s(z) =1

1+e−z , W ∈Mm×d(R) and bh ∈Rm are the parametersto be learned during training. Conversely, the units ofthe decoder are linear projections of h(x) back intothe input space:

g(h(x)) = W T h(x). (2)Using mean squared error as the reconstruction objec-tive and the L2-norm of the Jacobian of h with respectto x as regularization, training is carried out by min-imizing the following criterion by stochastic gradientdescent:

JCAE(Θ) = ∑x∈D‖x−g(h(x))‖2 +λ

m

∑i=1

d

∑j=1

∣∣∣∣ ∂hi

∂x j(x)

∣∣∣∣2

, (3)

where Θ = {W,bh}, D = {x(i)}i=1,...,n corre-sponds to a set of n training samples x ∈ Rd and λ

is a hyper-parameter controlling the level of contrac-tion of h. A notable difference between CAEs andPCA is that features extracted by CAEs are non-linearw.r.t. the inputs, so that multiple layers of CAEs canbe usefully composed (stacked), whereas stacking lin-ear PCAs is pointless.

4 Extracting Better Features withAdvanced Combination Strategies

In this work, we study two different sub-structuresof OB. We consider the pose response defined by theoutput of only one part-based filter at all positions andscales, and the object response which is the concate-nation of all pose responses associated to an object.Combination strategies are depicted in Figure 2.

4.1 Simplistic Strategies: Mean andMax Pooling

The idea of pooling responses at different locationsor poses has been successfully used in Convolutional

Neural Networks such as LeNet-5 (LeCun et al.,1999) and other visual processing (Serre et al., 2005)architectures inspired by the visual cortex.

Here, we pool the 252 responses of each objectdetector into one component (using the mean or maxoperator) leading to a representation of size 177 =44604/252. It corresponds to the mean/max over theobject responses at different scales and locations. Onemay view the object max responses as features encod-ing absence/presence of objects while discarding allthe information about the detector’s positions.

4.2 Combination Strategies with PCA

PCA is a standard method for extracting features fromhigh dimensional input, so it is a good starting point.However, as we find in our experiments, exploitingthe particular structure of the data, e.g., according toposes, scales, and locations, can yield to improved re-sults.

Whole PCA. An ordinary PCA is trained on the rawoutput of OB (x ∈ R44,604) without looking for anystructure. Given the high-dimensionality of OB’s rep-resentation, we used the Randomized PCA algorithmof the scikits toolbox1.

Pose-PCA. Each of the two poses associated witheach object detector is considered independently.This results in 354 = 2× 177 different PCAs, whichare trained on pose outputs (x ∈ R126) – see Figure 2.

Object-PCA. Only each object response (x ∈R252)is considered separately, therefore 177 PCAs aretrained in total. It allows the model to capturevariations among all pose responses at various scalesand positions – see Figure 2.

Note that, in all cases, whitening the PCA (i.e. di-viding each eigenvector’s response by the correspond-ing squared root eigenvalue) performs very poorly.For post-processing, the PCA outputs x are alwaysnormalized: x← (x− µ)/σ according to mean µ andthe deviation σ of the whole, per object or per posePCA outputs. Thereby, we ensure contributions fromall objects or poses to be in the same range. The num-ber of components in all cases has been selected ac-cording to the classification accuracy estimated by 5-fold cross-validation.

4.3 Improving upon PCA with CAE

Due to hardware limitations and high-dimensional in-put, we could not train a CAE on the whole OB output

1Available from http://scikits.appspot.com/

{ {Object 1 Object 177

OB raw

pose 1 pose 2 pose 1 pose 2

354 pose-PCAs

SVM

{ {

Object 1 Object 177

OB raw


177 object-PCAs

SVM

{ {

Object 1 Object 177

OB raw


354 pose-PCAs

High Level CAE

SVM

Figure 2: Different Combination Strategies (a) and (b) pose and object PCAs (c) high-level CAE: pose-PCA as dimension-ality reduction technique in the first layer and a CAE stacked on top. We denote it high-level because it can learn contextinformation i.e. plausible joint appearance of different objects.

(“whole CAE”). However, we address this problemwith the sequential feature extraction steps below.

To overcome the tractability problem that forbidsa CAE to be trained on the whole OB output, we pre-process it by using the pose-PCAs as a dimensionalityreduction method. We keep only the 5 first compo-nents of each pose. Given this low-dimensional rep-resentation (of dimension 1,770), we are able to traina CAE – see Figure 2. The CAE has a global view ofall object detectors and can thus learn to capture con-text information, defined by the joint appearance ofcombinations of various objects. Moreover, instead ofusing an SVM on top of the learned representations,we can use a Multi-Layer Perceptron whose weightswould be initialized by those of this CAE. This set-ting is where the CAE has shown to perform best inpractice (Rifai et al., 2011a).

5 Experiments

5.1 Datasets

We evaluate our approach on 3 scene datasets,cluttered indoor images (MIT Indoor Scene), nat-ural scenes (15-Scenes), and event/activity images(UIUC-Sports). Images from a large scale scenerecognition dataset (SUN-397 database) have alsobeen used for unsupervised learning.

• MIT Indoor is composed of 67 categories and,following (Li-Jia Li and Fei-Fei, 2010a; Quattoniand Torralba, 2009), we used 80 images from eachcategory for training and 20 for testing.

• 15-Scenes is a dataset of 15 natural scene classes.According to (Lazebnik et al., 2006), we used 100images per class for training and the rest for test-ing.

• UIUC-Sports contains 8 event classes. We ran-domly chose 70 / 60 images for our training / testset respectively, following the setting of (Li-Jia Liand Fei-Fei, 2010a; Li and Fei-Fei, 2007).• SUN-397 contains a full variety of 397 well sam-

pled scene categories (100 samples per class)composed of 108,754 images in total.

5.2 Tasks

We consider 3 different tasks to evaluate and comparethe considered combination strategies. In particular,various supervision settings for learning the CAE areexplored. Indeed, a great advantage of this kind ofmethod is that it can make use of vast quantities of un-labeled examples to improve its representations. Wethus illustrate this by proposing experiments in whichthe CAE has been trained in supervised or in semi-supervised way and also in a transfer context.

MIT Indoor (plain). Only the official training setof the MIT Indoor scene dataset (5,360 images) isused for unsupervised feature learning. Each repre-sentation is evaluated by training a linear SVM on topof the learned features.

MIT+SUN (semi-supervised). This task, like theprevious one, uses the official train/test split of theMIT Indoor scene dataset for its supervised trainingand evaluation of scene categorization performance.For the initial unsupervised feature extraction how-ever, we augmented the MIT Indoor training set withthe whole dataset of images from SUN-397 (108,754images). This yields a total of 123,034 images for un-supervised feature learning and corresponds to a semi-supervised setting. Our motivation for adding sceneimages from SUN, besides increasing the number oftraining samples, is that on MIT Indoor, which con-tains only indoor scenes, OB detectors specialized on

MIT MIT+SUN(plain) (semi-supervised)

object-MAX + SVM 24.3% –object-MEAN + SVM 41.0% –whole-PCA + SVM 40.2% –object-PCA + SVM 42.6% 46.1%pose-PCA + SVM 40.1% 46.0%pose-PCA + MLP 42.9% 46.3%pose-PCA + CAE (MLP) 44.0% 49.1%Object Bank + SVM 37.6% –Object Bank + rbf-SVM 37.7% –DPM + Gist + SP 43.1% –Improvement w.r.t. Object Bank +6.4% +11.5%

Table 1: MIT Indoor. Results are reported on the official split (Quattoni and Torralba, 2009) for all combination strategiesdescribed in Section 4. Only the unsupervised feature learning strategies (PCA and CAE based) can benefit from the additionof unlabeled scenes from SUN. Object Bank + SVM refers to the original system (Li-Jia Li and Fei-Fei, 2010a) and DPM +Gist + SP (Pandey and Lazebnik, 2011) corresponds to the state-of-the-art method on MIT Indoor.

outdoor objects would likely be mostly inactive (asa sailboat detector applied on indoor scenes) and ir-relevant, introducing an harmful noise in the unsuper-vised feature learning. As SUN is composed of a widerange of indoor and outdoor scene images, its addi-tion to MIT Indoor ensures that each detector mean-ingfully covers its whole range of activity (having a”balanced” number of positives/negatives detectionsthrough the training set) and the feature extractionmethods can be efficiently trained to capture it.

One may object that training on additional imagesdoes not provide a fair comparison w.r.t. the origi-nal OB method. Nevertheless, we recall that (1) thesupervised classifiers do not benefit from these addi-tional examples and (2) object detectors which are thecore of OB representations (and all detector-based ap-proaches) have also obviously been trained on addi-tional images.

UIUC-Sports and 15-Scenes (transfer). Wewould also like to evaluate the discriminativepower of the various representations learned on theMIT+SUN dataset, but on new scene images and cat-egories that were not part of the MIT+SUN dataset.This might be useful in case other researchers wouldlike to use our compact representation on a differentset of images. Using the representation output by thefeature extractors learned with MIT+SUN, we trainand evaluate classifiers for scene categorization onimages from UIUC-Sports and 15-Scenes (not usedduring unsupervised training). This corresponds to atransfer learning setting for the feature extractors.

5.3 SVMs on Features Learned witheach Strategy

In order to evaluate the quality of the features gener-ated by each strategy, a linear SVM is trained on thefeatures extracted by each combination method. Weused LibLinear (Fan et al., 2008) as SVM solver andchose the best C according to 5-fold cross-validationscheme. We compare accuracies obtained by fea-tures provided by all considered combination meth-ods against the original OB performances (Li-Jia Liand Fei-Fei, 2010a). Results obtained with SVM clas-sifiers on all MIT-related tasks are displayed in Ta-ble 1 and those concerning UIUC and 15-scenes inTable 2.

The simplistic strategy object mean-pooling per-forms surprisingly well on all datasets and taskswhereas object max-pooling obtained the worst re-sults. It suggests that taking the mean response ofan object detector across various scales and positionsis actually meaningful compared to consider pres-ence/absence of objects as max-pooling does.

On MIT and MIT+SUN, object or pose PCAsreach almost the same range of performanceslightly above the current state-of-the-art perfor-mances (Pandey and Lazebnik, 2011), except forwhole-PCA which performs poorly: one must con-sider the structure of OB to combine features effi-ciently. In the experiments, keeping the 10 (resp. 15)first principal components gave us the best results forpose-PCA (resp. object-PCA).

Besides, Table 3 shows that both PCAs andPCA+CAE allow a huge reduction of the dimensionof the OB feature representation.

Results obtained for the UIUC-Sports and 15-

UIUC-Sports 15-SCENESobject-MAX + SVM 67.23±1.29% 71.08±0.57%object-MEAN + SVM 81.88±1.16% 83.17±0.53%object-PCA + SVM 83.90±1.67% 85.58±0.48%pose-PCA + SVM 83.81±2.22% 85.69±0.39%pose-PCA + MLP 84.29±2.23% 84.93±0.39%pose-PCA + CAE (MLP) 85.13±1.07% 86.44±0.21%Object Bank + SVM 78.90% 80.98%Object Bank + rbf-SVM 78.56±1.50% 83.71±0.64%Improvement w.r.t. OB +6.23% +5.46%

Table 2: UIUC Sports and 15-Scenes Results are reported for 10 random splits (available at www.anonymous.org) andcompared to the original OB results (Li-Jia Li and Fei-Fei, 2010a) - Object Bank + SVM - on one single split.

Scenes transfer learning tasks are displayed in Ta-ble 2. Representations learned on MIT+SUN general-ize quite well and can be easily used for other datasetseven if images from those datasets have not been seenat all during unsupervised learning.

5.4 Deep Learning with Fine Tuning

Previous work (Larochelle et al., 2009) on DeepLearning generally showed that the features learnedthrough unsupervised learning could be improvedupon by fine-tuning them through a supervised train-ing stage. In this stage (which follows the unsuper-vised pre-training stage), the features and the clas-sifier on top of them are together considered to bea supervised neural network, a Multi-Layer Percep-tion (MLP) whose hidden layer is the output of thetrained features. Hence we apply this strategy to thepose PCA+CAE architecture, keeping the PCA trans-formation fixed but fine-tuning the CAE and the MLPaltogether. These results are given at the bottom of ta-bles 1 and 2. The MLP are trained with early stoppingon a validation set (taken from the original trainingset) for 50 epochs.

This yields 44.0% test accuracy on plain MIT and49.1% on MIT+SUN: this allows to obtain state-of-the-art performance, with or without semi-supervisedtraining of the CAEs, even if these additional exam-ples are highly beneficial. As a check, we also eval-uate the effect of the unsupervised pre-training stageby completely skipping it and only training a regu-lar supervised MLP of 1000 hidden units on top ofthe PCA output, yielding a worse test accuracy of42.9% on MIT and 46.3% on MIT+SUN. This im-provement with fine-tuning on labeled data is a greatadvantage for CAE compared to PCA. Fine-tuning isalso beneficial on UIUC-Sports and 15-Scenes. Onboth datasets, this leads to an improvement of +6%and +5% w.r.t the original system.

Finally, we trained a non-linear SVM (with rbfkernel) to verify whether this gap in performanceswas simply due to the replacement of a linear clas-sifier (SVM) by a non-linear one (MLP) or to the de-tectors’ outputs combination. The poor results of therbf-SVM (see tables 1 and 2) suggests that the care-ful combination strategies are essential to reach goodperformance.

6 Discussion

In this work, we add one or more levels of trainedrepresentations on top of the layer of object and partdetectors (OB features) that have constituted the basisof very promising trend of approach for scene classi-fication (Li-Jia Li and Fei-Fei, 2010a). These higher-level representations are mostly trained in an unsu-pervised way, following the trend of so-called DeepLearning (Hinton et al., 2006; Bengio, 2009; Jarrettet al., 2009), but can be fine-tuned using the super-vised detection objective.

These learned representations capture statisticaldependencies in the co-occurrence of detections theobject detectors from (Li-Jia Li and Fei-Fei, 2010a).In fact, one can see in Table 4 plausible contextsof joint appearance of several objects learned by theCAE. These detectors, which can be quite imperfectwhen seen as actual detectors, contain a lot of in-formation when combined altogether. The extractionof those context semantics with unsupervised feature-learning algorithms has empirically shown better per-formances.

In particular, we find that Contractive Auto-Encoder (Rifai et al., 2011b; Rifai et al., 2011a) cansubstantially improve performance on top of posePCAs as a way to extract non-linear dependenciesbetween these lower-level OB detectors (especiallywhen fine-tuned). They also improve greatly uponthe use of the detectors as inputs to an SVM or a lo-

Object-Bank Pooling whole-PCA object-PCA pose-PCA pose-PCA+CAE44,604 177 1,300 2,655 1,770 1,000

Table 3: Dimensionality Reduction. Dimension of representations obtained on MIT Indoor. The pose-PCA+CAE producesa compact and powerful combination.

Context Semantics learned by the CAEsailboat, rock, tree, coral, blind

roller coaster, building, rail, keyboard, bridgesailboat, autobus, bus stop, truck, shipcurtain, bookshelf, door, closet, racksoil, seashore, rock, mountain, duckattire, horse, bride, groom, bouquet

bookshelf, curtain, faucet, screen, cabinetdesktop computer, printer, wireless, computer screen

Table 4: Context semantics: Names of the detectors corresponding to the highest weights of 8 hidden units of the CAE.These hidden units will fire when those objects will be detected altogether.

gistic regression (which were, with structured regu-larization, the original methods used by OB).

This trained post-processing allows us to reach thestate-of-the-art on MIT Indoor and UIUC (85.13%against 85.30% obtained by LScSPM (Gao et al.,2010)) while being competitive on 15-scenes (86.44%also versus 89.70% LScSPM). On these last twodatasets, we reach the best performance for methodsonly relying on object/part detectors. Compared toother kinds of methods, we are limited by the accu-racy of those detectors (only trained on HOG fea-tures), whereas competitive methods can make useof other descriptors such as SIFT (Gao et al., 2010),known to achieve excellent performance in imagerecognition.

Besides its good accuracies, it is worth notingthat the feature representation obtained by the posePCA+CAE is also very compact, allowing a 97% re-duction compared to the original data (see Table 3).Handling a dense input of dimension 44,604 is not acommon thing. By providing this compact represen-tation, we think that researchers will be able to usethe rich information provided by OB in the same waythey use low-level image descriptors such as SIFT.

As future work, we are planning other ways ofcombining OB features e.g. considering the outputof all detectors at a given scale and position and com-bine them afterwards in a hierarchical manner. Thiswould be a kind of dual view of the OB features.Other plausible departures could take into account thetopology (e.g. spatial structure) of the pattern of de-tections, rather than treat the response at each loca-tion and scale as an attribute and the set of attributesas unordered. This could be done in the same spiritas in Convolutional Networks (LeCun et al., 1999),aggregating the responses for various objects detec-

tors/locations/scales in a way that takes explicitly intoaccount the object category, location and scale of eachresponse, similarly to the way filter outputs at neigh-boring locations are pooled in each layer of a Convo-lutional Network.

Acknowledgments: We would like to thank Glo-ria Zen for her helpful comments. This workwas supported by NSERC, CIFAR, the Canada Re-search Chairs, Compute Canada and by the FrenchANR Project ASAP ANR-09-EMER-001. Codesfor the experiments have been implemented usingTheano (Bergstra et al., 2010) Machine Learning li-brary.

REFERENCES

Baldi, P. and Hornik, K. (1989). Neural networks and prin-cipal component analysis: Learning from exampleswithout local minima. Neural Networks, 2:53–58.

Bengio, Y. (2009). Learning deep architectures for AI.Foundations and Trends in Machine Learning, 2(1):1–127. Also published as a book. Now Publishers, 2009.

Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H.(2007). Greedy layer-wise training of deep networks.In Adv. Neural Inf. Proc. Sys. 19, pages 153–160.

Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu,R., Desjardins, G., Turian, J., Warde-Farley, D., andBengio, Y. (2010). Theano: a CPU and GPU mathexpression compiler. In Proceedings of the Python forScientific Computing Conference (SciPy). Oral Pre-sentation.

Bosch, A., Zisserman, A., and Munoz, X. (2006). Sceneclassification via plsa. In In Proc. ECCV, pages 517–530.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A Large-Scale HierarchicalImage Database. In CVPR09.

Espinace, P., Kollar, T., Soto, A., and Roy, N. (2010). In-door scene recognition through object detection. InProceedings of the IEEE International Conference onRobotics and Automation (ICRA), Anchorage, AK.

Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., andLin, C.-J. (2008). Liblinear: A library for large linearclassification. J. Mach. Learn. Res., 9:1871–1874.

Farhadi, A., Endres, I., Hoiem, D., and Forsyth, D. (2009).Describing objects by their attributes. IEEE Confer-ence on Computer Vision and Pattern Recognition,pages 1778–1785.

Fei-Fei, L. and Perona, P. (2005). A bayesian hierarchi-cal model for learning natural scene categories. InProceedings of the 2005 IEEE Computer Society Con-ference on Computer Vision and Pattern Recognition(CVPR’05) - Volume 2 - Volume 02, CVPR ’05, pages524–531. IEEE Computer Society.

Felzenszwalb, P., McAllester, D., and Ramanan, D. (2008).A discrimitatively trained, multiscale, deformable partmodel. CVPR.

Gao, S., Tsang, I., Chia, L., and Zhao, P. (2010). Local fea-tures are not lonely laplacian sparse coding for imageclassification. IEEE Conference on Computer Visionand Pattern Recognition.

Goodfellow, I., Le, Q., Saxe, A., and Ng, A. (2009). Mea-suring invariances in deep networks. In NIPS’09,pages 646–654.

Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fastlearning algorithm for deep belief nets. Neural Com-putation, 18:1527–1554.

Hofmann, T. (2001). Unsupervised learning by probabilisticlatent semantic analysis. Mach. Learn., 42:177–196.

Hoiem, D., Efros, A., and Hebert, M. (2005). Automaticphoto pop-up. SIGGRAPH, 24(3):577584.

Hotelling, H. (1933). Analysis of a complex of statisticalvariables into principal components. Journal of Edu-cational Psychology, 24:417–441, 498–520.

Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun,Y. (2009). What is the best multi-stage architecturefor object recognition? In Proc. International Con-ference on Computer Vision (ICCV’09), pages 2146–2153. IEEE.

Kavukcuoglu, K., Ranzato, M., Fergus, R., and LeCun,Y. (2009). Learning invariant features through topo-graphic filter maps. In Proc. CVPR’09, pages 1605–1612. IEEE.

Larochelle, H., Bengio, Y., Louradour, J., and Lamblin, P.(2009). Exploring strategies for training deep neuralnetworks. JMLR, 10:1–40.

Lazebnik, S., Schmid, C., and Ponce, J. (2006). Beyondbags of features: Spatial pyramid matching for recog-nizing natural scene categories. IEEE Conference onComputer Vision and Pattern Recognition.

LeCun, Y., Haffner, P., Bottou, L., and Bengio, Y. (1999).Object recognition with gradient-based learning. InShape, Contour and Grouping in Computer Vision,pages 319–345. Springer.

Li, L.-J. and Fei-Fei, L. (2007). What, where and who?classifying events by scene and object recognition.ICCV.

Li-Jia Li, Hao Su, E. P. X. and Fei-Fei, L. (2010a). Ob-ject bank: A high-level image representation for sceneclassification and semantic feature sparsification. Pro-ceedings of the Neural Information Processing Sys-tems (NIPS).

Li-Jia Li, Hao Su, Y. L. and Fei-Fei, L. (2010b). Ob-jects as attributes for scene classification. In Eu-ropean Conference of Computer Vision (ECCV), In-ternational Workshop on Parts and Attributes, Crete,Greece.

Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y.,Goodfellow, I., Lavoie, E., Muller, X., Desjardins,G., Warde-Farley, D., Vincent, P., Courville, A., andBergstra, J. (2012). Unsupervised and transfer learn-ing challenge: a deep learning approach. In Guyon,I., Dror, G., Lemaire, V., Taylor, G., and Silver, D.,editors, JMLR W& CP: Proceedings of the Unsuper-vised and Transfer Learning challenge and workshop,volume 27, pages 97–110.

Oliva, A. and Torralba, A. (2006). Building the gist of ascene: The role of global image features in recogni-tion. Visual Perception, Progress in Brain Research,155.

Pandey, M. and Lazebnik, S. (2011). Scene recognitionand weakly supervised object localization with de-formable part-based models. ICCV.

Pearson, K. (1901). On lines and planes of closest fit tosystems of points in space. Philosophical Magazine,2(6):559–572.

Quattoni, A. and Torralba, A. (2009). Recognizing indoorscenes. CVPR.

Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y.(2007). Efficient learning of sparse representationswith an energy-based model. In NIPS’06.

Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y.,Dauphin, Y., and Glorot, X. (2011a). Higher ordercontractive auto-encoder. In European Conferenceon Machine Learning and Principles and Practice ofKnowledge Discovery in Databases (ECML PKDD).

Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio,Y. (2011b). Contracting auto-encoders: Explicit in-variance during feature extraction. In Proceedingsof the Twenty-eight International Conference on Ma-chine Learning (ICML’11).

Russell, B. C., Torralba, A., Murphy, K. P., and Freeman,W. T. (2008). Labelme: A database and web-basedtool for image annotation. Int. J. Comput. Vision,77:157–173.

Serre, T., Wolf, L., and Poggio, T. (2005). Object recog-nition with features inspired by visual cortex. IEEEConference on Computer Vision and Pattern Recogni-tion.

Smeulders, A. W. M., Worring, M., Santini, S., Gupta, A.,and Jain, R. (2000). Content-based image retrieval atthe end of the early years. IEEE Trans. Pattern Anal.Mach. Intell., 22:1349–1380.

Torralba, A. (2003). Contextual priming for object de-tection. International Journal of Computer Vision,53(2):169–191.

Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and composing robust featureswith denoising autoencoders. In Cohen, W. W., Mc-Callum, A., and Roweis, S. T., editors, ICML’08,pages 1096–1103. ACM.

Vogel, J. and Schiele, B. (2004). Natural scene retrievalbased on a semantic modeling step. In Proceeedingsof the International Conference on Image and VideoRetrieval CIVR 2004, Dublin, Ireland, LNCS, volume3115.

Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., and Torralba,A. (2010). SUN database: Large-scale scene recogni-tion from abbey to zoo. In IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), pages3485–3492. IEEE.

unsupervised and transfer learning under uncertainty: from ...lisa/pointeurs/2013_object... · 3.2...

Documents