robust classification of objects, faces, and flowers using natural image statistics

Robust Classification of Objects, Faces, and Flowers

Using Natural Image Statistics

主讲人：王崇秀

Outline

Authors Abstract Background Framework and Implementation Experiments and Results Conclusions

23/4/20 2

Authors

Christopher Kanan: Ph.D. student at the University of California, San Diego (UCSD), and

intend to graduate in 2012. Research Interests: fuses findings and methods from computer vision,

machine learning, psychology, and computational neuroscience. Homepage: http://cseweb.ucsd.edu/~ckanan/index.html Email: [email protected]

23/4/20 3

Authors

Garrison Cottrell: Professor in the Computer Science & Engineering Department at

UCSD . Research: His research is strongly interdisciplinary. It concerns using

neural networks as a computational model applied to problems in cognitive science and artificial intelligence, engineering and biology. He has had success in using them for such disparate tasks as modeling how children acquire words, studying how lobsters chew, and nonlinear data compression.

23/4/20 4

Outline


23/4/20 5

Abstract Classification of images in many category datasets has rapidly improved in

recent years. However, systems that perform well on particular datasets typically have one or more limitations such as a failure to generalize across visual tasks (e.g., requiring a face detector or extensive retuning of parameters), insufficient translation invariance, inability to cope with partial views and occlusion, or significant performance degradation as the number of classes is increased.

Here we attempt to overcome these challenges using a model that combines sequential visual attention using fixations with sparse coding. The model’s biologically-inspired filters are acquired using unsupervised learning applied to natural image patches. Using only a single feature type, our approach achieves 78.5% accuracy on Caltech-101 and 75.2% on the 102 Flowers dataset when trained on 30 instances per class and it achieves 92.7% accuracy on the AR Face database with 1 training instance per person. The same features and parameters are used across these datasets to illustrate its robust performance.

23/4/20 6

摘要最近在很多分类数据集上，图像的分类性能在快速的提升。但是，在某一特定数据集上性能很好的系统往往有一个或者多个限制，例如在视觉任务中难以推广 (需要一个人脸检测器或者额外的参数返回 )，平移不变性不足，不能处理局部遮挡以及随着类别的增多，性能显著下降。

在这里，我们试图使用一个模型来克服这些挑战，该模型结合了顺序视觉注意中的稀疏编码的视点。该模型的生物启发的滤波器是在自然图像块上通过无监督学习得到。仅使用一种特征，每类使用 30个样本来训练，该方法在 caltech101上达到 78.5%的识别率；在 102类的花数据库上达到 75.2%的识别率。每个人使用 1个训练样本，该方法在 AR人脸数据库上达到 92.7%的识别率。在这些数据机上，使用的特征和参数都是一致的，展示了该方法的鲁棒性。

23/4/20 7

Outline


23/4/20 8

Background——Using Natural Image Statistics

Hand-designed features: Haar, DOG, Gabor, HOG, SIFT and so on;

Self-taught learning: Applied to unlabeled natural images to learn basis vectors/filters that are

good for representing natural images. The training data is generally distinct from the datasets the system will

be evaluated on. Self-taught learning works well because it represents natural scenes

efficiently, while not overfitting to a particular dataset.

Sparse coding

23/4/20 9

Background——Visual Attention

Visual Attention A saliency map is a topologically organized map that indicates

interesting regions in an image based on the spatial organization of the features and an agent’s current goal.

Computational model: There are many computational models, typically produce maps that assign high saliency to regions with rare features.

23/4/20 10

Background——Sequential Object Recognition

Sequential Object Recognition Many algorithms for saliency maps have been used to predict the

location of human eye movements, little work has been done on how they can be used to recognize individual objects.

a few notable exceptions [1, 23, 27, 15] and these approaches have several similarities.

Framework: extract features -> saliency maps based on features -> extract small window representing a fixation and classify these fixations to made subsequent fixations -> mechanisms used to combine information across fixations.

NIMBLE framework

23/4/20 11

Outline


23/4/20 12

Framework and Implementation

High level description of the model: Pre-processing image to cope with luminance variation. Sparse ICA features are then extracted from the image. Sparse ICA features are used to compute a saliency map, which

is treated as a probability distribution, and locations are randomly sampled from the map.

Fixations are extracted from the feature maps at the sampled location, followed by probabilistic classification.

23/4/20 13


Image pre-processing: Resizing to ensure smallest dimension is 128 with other dimension

resized accordingly to maintain its aspect ratio. Grayscale images are converted to color. RGB → LMS: is a color space represented by the response of the three t

ypes of cones of the human eye, named after their responsivity (sensitivity) at long, medium and short wavelengths.

Normalization to [0,1]:

where , is a pixel of the image in LMS color space at location z. Note that as well.

23/4/20 14

5.00 ]1,0[)( zrlinear]1,0[)( znonlinearr


Image pre-processing:

23/4/20 15


Feature learning: To learn ICA filters, we preprocess 584 images from the McGill color

image dataset. From each image, 100 b*b*3 patches are extracted from random

locations. The channel mean (L, M, and S) computed across images is subtracted

from each patch. Each patch is then treated as a 3b^2 dimensional vector.

PCA is applied to the patch collection to reduce the dimensionality (discard the first principal component, retain the rest d principal components).

Apply fastICA → d ICA filters. m*n*3 images → m*n*d filter responses, sparse representation.

23/4/20 16


Feature learning:

the ICA filters learned.

23/4/20 17


Saliency Maps: Use SUN model to generate saliency map.

23/4/20 18


Spatial Pooling: Normalize saliency map to sum to one, and then treated as a

probability distribution. Randomly sampled T times, during each fixation t, a location is

chosen according to the saliency map. w*w*d(w=51) stack of filter responses. Reduce the dimension of the stack by spatial subsampling it using

a spatial pyramid, which divide each w*w filter responses into 1*1, 2*2, 4*4 grids, and the mean filter responses in each grid cell is computed and concatenated to form a vector, and normalized to unit length.

This reduces the dimensionality of the from w*w*d(51^2d) to 21d. is normalized by the height and width of the image and stored along with the corresponding features.

23/4/20 19

tl

tl


Spatial Pooling: After acquiring T fixations from every training image, PCA is applied to

the collected feature vectors. The first 500 principal components are retained, and then whitened. Finally, the post-PCA fixation features, denoted , are each made unit

length.

23/4/20 20

ikw ,


Training and Classification Naïve Bayes’ assumption: is the vector of fixation features.

Bayes’ rule:

P (C = k) is uniform and we fix T = 100, which would be about 30 s of viewing time for a person assuming 3 fixations/second.

23/4/20 21

tg

Outline


23/4/20 22

Experiments and Results

Caltech101 results

23/4/20 23


Caltech256 results

23/4/20 24


AR face database

23/4/20 25


102 Flower database

23/4/20 26

Outline


23/4/20 27

Conclusions

One of the reasons we think our approach works well is because it employs a nonparametric exemplar-based classifier.

The Naïve Bayes’ assumption is obviously false and learning a more flexible model could lead to performance improvements.

23/4/20 28

robust classification of objects, faces, and flowers using natural image statistics

Documents