iccv & cvpr paper reading

66
ICCV ICCV & & CVPR CVPR paper reading 池 @jdl.ac.cn 2009.11.27

Upload: cruz-raymond

Post on 31-Dec-2015

81 views

Category:

Documents


2 download

DESCRIPTION

ICCV & CVPR paper reading. 池 晨 @ jdl.ac.cn 2009.11.27. CVPR09, # 2128 , Recognizing Indoor Scenes. Recognizing Indoor Scenes. Ariadna Quattoni & Antonio Torralba. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ICCV  &  CVPR  paper reading

ICCV ICCV && CVPR CVPR paper reading

池 晨 @jdl.ac.cn2009.11.27

Page 2: ICCV  &  CVPR  paper reading

CVPR09, # 2128 ,Recognizing Indoor Scenes

Page 3: ICCV  &  CVPR  paper reading

Recognizing Indoor ScenesAriadna Quattoni & Antonio Torralba

Ariadna QuattoniPh.D studentMIT Computer Science and Artificial Intelligence Laboratory(CSAIL)

•A. Quattoni, X. Carreras, M. Collins, T. Darrell, An Efficient Projection for L1,Infinity Regularization, ICML 2009. •A. Quattoni, A.Torralba, Recognizing Indoor Scenes, CVPR 2009. •A. Quattoni, M. Collins, T. Darrell, Transfer Learning for Image Classification with Sparse Prototype Representations , CVPR 2008. •A. Quattoni, M. Collins, T. Darrell, Learning Visual Representations using Images with Captions, CVPR 2007. •A. Quattoni, S. Wang, L.P. Morency, M. Collins, and T. Darrell, Hidden-state Conditional Random Fields, IEEE PAMI, 2007

Page 4: ICCV  &  CVPR  paper reading

Recognizing Indoor ScenesAriadna Quattoni & Antonio Torralba

Ariadna QuattoniPh.D studentMIT Computer Science and Artificial Intelligence Laboratory(CSAIL)

•L.P. Morency, A. Quattoni, T. Darrell, Latent-Dynamic Discriminative Models for Continuous Gesture Recognition, CVPR 2007. •S. Wang, A. Quattoni, L.P. Morency, D. Demirdjian, T. Darrell, Hidden Conditional Random Fields for Gesture Recognition, CVPR 2006. •A. Quattoni, M. Collins, T. Darrell, Incorporating Semantic Constraints into a Discriminative Categorization and Labeling Model, Workshop on Semantic Knowledge in Vision, ICCV, 2005. •A. Quattoni, M. Collins and T. Darrell, Conditional Random Fields for Object Recognition, In Proceedings of NIPS, 2004.

Page 5: ICCV  &  CVPR  paper reading

Recognizing Indoor ScenesAriadna Quattoni & Antonio Torralba

Antonio Torralba Associate Professor MIT Computer Science and Artificial Intelligence Laboratory(CSAIL)

Research Interests•Computer vision ,•Machine learning,•Human visual perception,•Scene and object recognition.

Page 6: ICCV  &  CVPR  paper reading

Recognizing Indoor ScenesAriadna Quattoni & Antonio Torralba

Antonio Torralba Associate Professor MIT Computer Science and Artificial Intelligence Laboratory(CSAIL)

LabelMe: online image annotation and applicationsA. Torralba, B. C. Russell, and J. Yuen,MIT CSAIL Technical Report, 2009.How many pixels make an image?A. Torralba ,Visual Neuroscience, volume 26, issue 01, pp. 123-131, 2009.Small codes and large databases for recognitionA. Torralba, R. Fergus, Y. Weiss,CVPR,2008. 80 million tiny images: a large dataset for non-parametric object and scene recognitionA. Torralba, R. Fergus, W. T. FreemanIEEE Transactions on PAMI, vol.30(11), pp. 1958-1970, 2008.Sharing visual features for multiclass and multiview object detectionA. Torralba, K. P. Murphy and W. T. Freeman,PAMI,2007.

Page 7: ICCV  &  CVPR  paper reading

?Most scene recognition models that work well for outdoor scenes perform poorly in the indoor domain.Fig1. Comparison of Spatial Sift and Gist features for a scenerecognition task. Both set of features have a strong correlation inthe performance across the 15 scene categories. Average performance for the different features are: Gist: 73.0%, Pyramid matching: 73.4%, bag of words: 64.1%, and color pixels (SSD): 30.6%.In all cases we use an SVM.

Page 8: ICCV  &  CVPR  paper reading

Abstract

•Indoor scene recognition is a challenging open problem.• By global spatial properties or by objects they contain? •A prototype based model that can successfully combine both sources of information.•A dataset of 67 indoor scenes categories.•Good results.

Page 9: ICCV  &  CVPR  paper reading

What is ‘a prototype based model’?

Page 10: ICCV  &  CVPR  paper reading

Prototype Image

Prototype image?

Page 11: ICCV  &  CVPR  paper reading

ROI(Regions of Interests)

Page 12: ICCV  &  CVPR  paper reading

A Prototype Based Model

Prototype Image T

ROI 1

ROI 2

ROI 3 ROI 4

ROI 5

ROI mk

……

For each scene category:

For each prototypeTp:

1 2, , ,

pS T T T

1 2, , ,

k mkt t tT

Global spatial properties

Contained objects

Page 13: ICCV  &  CVPR  paper reading

How does it work?

Page 14: ICCV  &  CVPR  paper reading

Image Descriptor

Prototype Image T

ROI 1

ROI 2

ROI 3 ROI 4

ROI 5

ROI mk

……

Global spatial properties

Contained objects

How to represent global spatial properties?——Using Gist descriptor

How to represent each ROI?——Using a spatial pyramid of visual words

Page 15: ICCV  &  CVPR  paper reading

Gist (1/2)

Orginal image

Scale

Sampled filter outputsGist feature

Magnitude of multiscale oriented filter outputs

Be decomposed

Be taked the magnitude and be computed the local average response over 4*4 windows.

PCA

orientation

Page 16: ICCV  &  CVPR  paper reading

Gist (2/2)

Top row: original images. Bottom row: noise images coerced to have the same global features (N=64) as the target image.

The gist feature encodes edges and textures information in the original image coarsely

Page 17: ICCV  &  CVPR  paper reading

Image Descriptor

Prototype Image T

ROI 1

ROI 2

ROI 3 ROI 4

ROI 5

ROI mk

……

Global spatial properties

Contained objects

How to represent global spatial properties?——Using Gist descriptor

How to represent each ROI?——Using a spatial pyramid of visual words

Page 18: ICCV  &  CVPR  paper reading

ROI DescriptorA spatial pyramid of visual wordsThe visual words are obtained by

creating vector quantized Sift descriptors by applying K-means to a random subset of images.

The color of each pixel represents the visual word to which it was assigned.

Page 19: ICCV  &  CVPR  paper reading

Image Descriptor

Prototype Image T

ROI 1

ROI 2

ROI 3 ROI 4

ROI 5

ROI mk

……

Global spatial properties

Contained objects

How to represent global spatial properties?——Using Gist descriptor

How to represent each ROI?——Using a spatial pyramid of visual words

Page 20: ICCV  &  CVPR  paper reading

Given:A training set of n pairs of labeled images

A set of p segmented images which we called prototypes.

Goal: To use D and S to learn a mapping h : X→R

Model Formulation

1 21 2, , , , , ,

n nD y y yx x x

1 2, , ,

pS T T T

Page 21: ICCV  &  CVPR  paper reading

Contained object information

Model Formulation

The mapping should capture the fact that images containing similar objects must have similar scene labels and that some objects are more important than others in defining a scenes’ identity.

min ,kj skj s

x df t x

where tkj represents the jth ROI of kth prototype image, xs represents the most similar segment with tkj in image x.

Distances between two regions are computed using histogram intersections.

Page 22: ICCV  &  CVPR  paper reading

Searching StrategyWhen meet a new image, how to find its ROIs that similar with the ROIs in the given prototype image T?

Searching around a small window relative to the original location in prototype image T.

Histogram intersection function:

1

, = min , s s

D

x kj x kji

H H H i H i

Page 23: ICCV  &  CVPR  paper reading

Searching Strategy

Figure 5. Example of detection of similar image patches. The top three images correspond to the query patterns. For each image, the algorithm tries to detect the selected region on the query image.The next three rows show the top three matches for each region. The last row shows the three worst matching regions.

Page 24: ICCV  &  CVPR  paper reading

Global spatial information

Model Formulation

For some scene categorieds global image information can be very important.

2( ) ( )k kg x Gist x Gist T

Global information is computed as L2 norm between the Gist representation of image x and the Gist representation of prototype k.

Page 25: ICCV  &  CVPR  paper reading

Model Formulation

11

= exp k

p

kj kGjk kj kk

mh x x xf g

Contained object information

Global spatial information

Captures the importance of a particular ROI inside a given prototype.

Parameters

How relevant the similarity to a prototype k is for predicting the scene label.

The importance of global features when considering the kth prototype.

Page 26: ICCV  &  CVPR  paper reading

Learning

Model Formulation

How to estimate the model parameters from a training set D?

2 2

1

, ,n

i i b li

L l h x y C C

Loss function measuring the error that the classifier incurs on training example D.

Regularization terms and the constants Cb and Cl dictate the amount of regularization in the model

1 21 2, , , , , ,

n nD y y yx x x

Page 27: ICCV  &  CVPR  paper reading

Learning

Model Formulation

Using training set D and a gradient-based method to estimate the model parameters: 1

21exp km

i kj kj i b kjik

Ly f x C

k

L

kj

L

1

1exp

2km

i k kj i kj kj i l kjjikj

Ly f x f x C

Δ is the set of indices of examples in D that attain non-zero loss.

Page 28: ICCV  &  CVPR  paper reading

Model Formulation

11

= exp k

p

kj kGjk kj kk

mh x x xf g

The number in parenthesis is the classification confidence.

Page 29: ICCV  &  CVPR  paper reading

How is the performance?

Page 30: ICCV  &  CVPR  paper reading

Indoor Database

Compared with state of the art :

• The largest one available: 67 categories, 15620 images.• More difficulte: In-class variability

Figure 2. Summary of the 67 indoor scene categories used in our study. To facilitate seeing the variety of different scene categories considered here we have organized them into 5 big scene groups. The database contains 15620 images. All images have a minimum resolution of 200 pixels in the smallest axis.

Page 31: ICCV  &  CVPR  paper reading

Results (1/3)

Four different variation of the model.

Manually Annotated

ROIs

Local features

Segmented ROIs

Both Local and Global

features

Page 32: ICCV  &  CVPR  paper reading

Results (1/3)

Four different variation of the model.

•Both local and global information are useful for the indoor scene recognition task.

•Using automatic segmentations instead of manual segmentations cause only a small drop in performance

Page 33: ICCV  &  CVPR  paper reading

Results (2/3)

Figure 7. The 67 indoor categories sorted by multiclass averageprecision (training with 80 images per class and test is done on 20images per class).

Page 34: ICCV  &  CVPR  paper reading

Results (3/3)

How is the preformance of the proposed model affected by the number of prototypes used?

We observed a logarithmic growth of the average precision as a function of the number of prototypes.

Exploit more prototypes might be able to further improve the performance.

Page 35: ICCV  &  CVPR  paper reading

Conclusion (1/3)

Prototype Image T

ROI 1

ROI 2

ROI 3 ROI 4

ROI 5

ROI mk

……

Global spatial properties

Contained objects

Combination of global information and contained object information

Page 36: ICCV  &  CVPR  paper reading

Conclusion (2/3)

11

= exp k

p

kj kGjk kj kk

mh x x xf g

Contained object information

Global spatial information

Page 37: ICCV  &  CVPR  paper reading

Conclusion (3/3)

11

= exp k

p

kj kGjk kj kk

mh x x xf g

Page 38: ICCV  &  CVPR  paper reading

ICCV09, Learning to Predict Where

Humans Look

Page 39: ICCV  &  CVPR  paper reading

Learning to Predict Where Humans LookTilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba

Tilke JuddPh.D studentMIT Computer Science and Artificial Intelligence Laboratory(CSAIL)

Education BackgroundMassachusetts Institute of Technology, Cambridge, MA •Ph.D. candidate in Computer Science (Graphics) Expected graduation June 2010, •Masters of Science, Computer Science, Jan 2007 •Bachelors of Science in Mathematics, June 2003.École Polytechnique, Palaiseau, France •International Program, Computer Science Major, Sept 2003 to April 2004 Cambridge University, Cambridge, England • Junior Year Abroad, Read Part IB Mathematics Tripos, Sept 2001 to June 2002

Page 40: ICCV  &  CVPR  paper reading

Learning to Predict Where Humans LookTilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba

Tilke JuddPh.D studentMIT Computer Science and Artificial Intelligence Laboratory(CSAIL)

Research Interests•Computer Graphics•Computational Photography•Image Processing•Perception•Non-Photorealistic Rendering

Page 41: ICCV  &  CVPR  paper reading

Learning to Predice Where Humans LookTilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba

Tilke JuddPh.D studentMIT Computer Science and Artificial Intelligence Laboratory(CSAIL)

•Judd, T, Ehinger, K, Durand, F, Torralba, A. Learning to Predict Where People Look, ICCV 2009.•Judd, T., Durand, F., Adelson, T. Apparent Ridges for Line Drawing. Proceedings of ACM Siggraph 2007 •Judd, Tilke. Apparent Ridges for Line Drawing. Masters Thesis, Computer Science, MIT, Jan 2007 •Ju, W., R. Hurwitz, T. Judd, B. Lee. CounterActive: An Interactive Cookbook for the Kitchen Counter. Proceedings of SIGCHI 2001, Short Papers and Abstracts, Seattle WA, April 2001. p 269 •Ju, W., L. Bonanni, R Fletcher, R. Hurwitz, T. Judd, J. Yoon E.R. Post, M. Reynolds. Origami Desk. Exhibited SIGGRAPH 2001, Los Angeles CA. SIGRAPH Conference Abstracts and Applications, August 2001, p.280 •Judd, Tilke. The JPEG Compression Algorithm. The MIT Undergraduate Mathematics Journal. Vol 5, p.119

Page 42: ICCV  &  CVPR  paper reading

Learning to Predict Where Humans LookTilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba

Erista EhingerGraduate StudentDepartment of Brain & Cognitive Sciences at MIT

?Education Background•University of Edinburgh, Edinburgh, UK2007 B.Sc. Psychology •California Institute of Technology, Pasadena, CA, USA2003 B.S. Engineering & Applied Science

Page 43: ICCV  &  CVPR  paper reading

Learning to Predict Where Humans LookTilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba

Frédo DurandAssociate Professor Computer Graph Group,CSAIL,MIT.

Education Background•He received his PhD from Grenoble University, France, in 1999.•From 1999 till 2002, he was a post-doc in the MIT Computer Graphics Group

Page 44: ICCV  &  CVPR  paper reading

Learning to Predict Where Humans LookTilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba

Frédo DurandAssociate Professor Computer Graph Group,CSAIL,MIT.

Research Interests•Synthetic image generation•Computational photography

Page 45: ICCV  &  CVPR  paper reading

Learning to Predict Where Humans LookTilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba

Frédo DurandAssociate Professor Computer Graph Group,CSAIL,MIT.

•Co-organized the first Symposium on Computational Photography and Video in 2005,•Co-organized the first International Conference on Computational Photography in 2009, •Was on the advisory board of the Image and Meaning 2 conference.•Received an inaugural Eurographics Young Researcher Award in 2004,•Received an NSF CAREER award in 2005,•Received an inaugural Microsoft Research New Faculty Fellowship in 2005,•Received a Sloan fellowship in 2006, •Received a Spira award for distinguished teaching in 2007.

Page 46: ICCV  &  CVPR  paper reading

?How to understand where humans look in a scenes without an eye tracking?Figure 2. Current saliency models do not accurately predict human fixations. In row one, the low-level model selects brigh spots of light as salient while viewers look at the human. In row two, the low level model selects the building’s strong edges and windows as salient while viewers fixate on the text.

Page 47: ICCV  &  CVPR  paper reading

Abstract

•For many applications in graphics,design,and human computer interaction,it is essential to understand where humans look in a scene.•Models of saliency can be used to predict fixation locations. •A sailency model based on both the top-down information and bottom up information•A large eye tracking database.

Page 48: ICCV  &  CVPR  paper reading

Database of Eye Tracking Data

15 objects1003 random imagesFree view3 seconds per imageRecording the gaze path

Page 49: ICCV  &  CVPR  paper reading

Database of Eye Tracking Data

Collect the object’s fixations.

Convolve a gaussian filter across the object’s fixation.Then average all the objects’ data to obtain a continuous saliency map.

Select the top n percent salient locations to generate a binary map

Page 50: ICCV  &  CVPR  paper reading

Analysis of Dataset

• For some images,all viewers fixate on the same locations,while in other images viewers’s fixations are dispersed all over the image.

• The fixation in the database have a strong bias towards the center.

• Fixations from the database are often on animals,cars,and human body parts like eyes and hands.

• There is a certain size for a region of interest(ROI)that a person fixates on.

Page 51: ICCV  &  CVPR  paper reading

How to use the analysis above?

Page 52: ICCV  &  CVPR  paper reading

Features Used for Machine LearningLow-level features: • Local energy of the steerable pyramid filters[3],• Features used in a simple salency model described by Torralba[1] and Rosenholtz[2],• Orientation and color contrast,• Values of the red,green and blue channels,as well as the probabilities of each of these channels as features[4].• The probability of each color as computed from 3D color histograms of the image filtered with a median filter at 6 different scales.

Mid-level features• The location of horizon.

Page 53: ICCV  &  CVPR  paper reading

Features Used for Machine Learning

High-level features:• Runing the Viola Jones face detector[5] and the Felzenszwalb person detector[6].

Center prior:• The distance between each pixel to the center.

Page 54: ICCV  &  CVPR  paper reading

Features Used for Machine Learning

Fig 8. Features.A sample image(bottom right) and 33 of the features that we use to train the model.

Page 55: ICCV  &  CVPR  paper reading

How to use the eye data?

Page 56: ICCV  &  CVPR  paper reading

Features Used for Machine Learning

Using binary map to generate positive label and negitive label

Page 57: ICCV  &  CVPR  paper reading

Features Used for Machine Learning

Positive labeled pixels

negtive labeled pixels

Binary saliency map

Page 58: ICCV  &  CVPR  paper reading

How is the performance?

Page 59: ICCV  &  CVPR  paper reading

Training

Positively labeled pixels

negtively labeled pixels

Binary saliency map

10 positive pixels per image

10 pixels per image

903 training images(9030 positive and 9030 negitive training samples )and 100 testing images with a liblinear SVM.

Page 60: ICCV  &  CVPR  paper reading

Testing

Figure 9. Comparison of saliency maps. Each row of images compares the predictors of our SVM saliency model, the Itti saliency map, the center prior, and the human ground truth, all thresholded to show the top 10 percent salient locations.

Page 61: ICCV  &  CVPR  paper reading

Performance On Testing Images

Figure 10. The ROC curve of performances for SVMs trained oneach set of features individually and combined together. We also plot human performance and chance for comparison.

Page 62: ICCV  &  CVPR  paper reading

Application

Rendering more details at the location users fixated on and less detial in the rest of the image.

Page 63: ICCV  &  CVPR  paper reading

Conclusion (1/4)

Created database containing true eye data

Page 64: ICCV  &  CVPR  paper reading

Conclusion (2/4)

•Low-level features•Mid-level features•High-level features•Center prior

Page 65: ICCV  &  CVPR  paper reading

Compared the effect of each subset of whole features on saliency map.

Conclusion (3/4)

Page 66: ICCV  &  CVPR  paper reading

Conclusion (4/4)

Given an example of the model’s application.