Transcript
  • Efficient Image Search and Retrieval using Compact Binary Codes Rob Fergus (NYU)Antonio Torralba (MIT)Yair Weiss (Hebrew U.)

  • How can we search them, based on visual content?Large scale image searchInternet contains many billions of imagesThe Challenge:Need way of measuring similarity between imagesNeeds to scale to Internet

  • Existing approaches to Content-Based Image RetrievalFocus of scaling rather than understanding imageVariety of simple/hand-designed cues:Color and/or Texture histograms, Shape, PCA, etc.Various distance metricsEarth Movers Distance (Rubner et al. 98)

    Most recognition approaches slow (~1sec/image)

  • Our ApproachLearn the metric from training data

    DO BOTH TOGETHER

    Use compact binary codes for speed

  • Large scale image/video searchRepresentation must fit in memory (disk too slow)

    Facebook has ~10 billion images (1010)PC has ~10 Gbytes of memory (1011 bits) Budget of 101 bits/image

    YouTube has ~ a trillion video frames (1012)Big cluster of PCs has ~10 Tbytes (1014 bits) Budget of 102 bits/frame

  • Binary codes for imagesWant images with similar content to have similar binary codes

    Use Hamming distance between codesNumber of bit flipsE.g.:

    Semantic Hashing [Salakhutdinov & Hinton, 2007]Text documents

    Ham_Dist(10001010,10001110)=1Ham_Dist(10001010,11101110)=3

  • Semantic HashingAddress SpaceSemantically similar imagesQuery addressSemantic Hash FunctionQuery ImageBinary codeImages in database[Salakhutdinov & Hinton, 2007] for text documentsQuite different to a (conventional) randomizing hash

  • Semantic HashingEach image code is a memory addressFind neighbors by exploring Hamming ball around query address

    Address SpaceQuery addressImages in databaseChooseCode lengthRadiusLookup time is independent of # of data pointsDepends on radius of ball & length of code:

  • Code requirementsSimilar images Similar CodesVery compact (
  • Input Image representation: Gist vectorsPixels not a convenient representationUse Gist descriptor instead (Oliva & Torralba, 2001)512 dimensions/image (real-valued 16,384 bits)L2 distance btw. Gist vectors not bad substitute for human perceptual distanceOliva & Torralba, IJCV 2001NO COLOR INFORMATION

  • 1. Locality Sensitive HashingGionis, A. & Indyk, P. & Motwani, R. (1999)

    Take random projections of dataQuantize each projection with few bits101No learning involvedGist descriptor

  • 2. BoostingModified form of BoostSSC [Shaknarovich, Viola & Darrell, 2003]Positive examples are pairs of similar imagesNegative examples are pairs of unrelated images

    Learn threshold & dimension for each bit (weak classifier)

  • 3. Restricted Boltzmann Machine (RBM)Type of Deep Belief NetworkHinton & Salakhutdinov, Science 2006 Single RBM layerAttempts to reconstruct input at visible layer from activation of hidden layerW

  • Multi-Layer RBM: non-linear dimensionality reduction512512w1Input Gist vector (512 dimensions)Layer 1512256w2Layer 2256Nw3Layer 3Output binary code (N dimensions)Linear units at first layer

  • Training RBM models1st Phase: Pre-training

    Unsupervised

    Can use unlabeled data (unlimited quantity)

    Learn parameters greedily per layer

    Gets them to right ballpark2nd Phase: Fine-tuning

    Supervised

    Requires labeled data(limited quantity)

    Back propagate gradients of chosen error function

    Moves parameters to local minimum

  • Greedy pre-training (Unsupervised)512512w1Input Gist vector (512 real dimensions)Layer 1

  • Greedy pre-training (Unsupervised)Activations of hidden units from layer 1 (512 binary dimensions)512256w2Layer 2

  • Greedy pre-training (Unsupervised)Activations of hidden units from layer 2 (256 binary dimensions)256Nw3Layer 3

  • Fine-tuning: back-propagation of Neighborhood Components Analysis objective 512512Input Gist vector (512 real dimensions)Layer 1512256Layer 2256NLayer 3Output binary code (N dimensions)w3w2w1

  • Neighborhood Components AnalysisGoldberger, Roweis, Salakhutdinov & Hinton, NIPS 2004Tries to preserve neighborhood structure of input spaceAssumes this structure is given (will explain later)Points in output space (coordinate is activation probability of unit) Toy example with 2 classes & N=2 units at top of network:

  • Neighborhood Components AnalysisAdjust network parameters (weights and biases) to move:Points of SAME class closerPoints of DIFFERENT class away

  • Neighborhood Components AnalysisAdjust network parameters (weights and biases) to move:Points of SAME class closerPoints of DIFFERENT class awayPoints close in input space (Gist) will be close in output code space

  • Simple Binarization StrategySet threshold - e.g. use median

    Deliberately add noise

  • Overall Query SchemeQuery ImageRBMCompute GistBinary codeGist descriptorImage 1Semantic HashRetrieved images
  • Retrieval Experiments

  • Test set 1: LabelMe22,000 images (20,000 train | 2,000 test)Ground truth segmentations for allCan define ground truth distance btw. images using these segmentations

  • Defining ground truth Boosting and NCA back-propagation require ground truth distance between imagesDefine this using labeled images from LabelMe

  • Defining ground truth Pyramid Match (Lazebnik et al. 2006, Grauman & Darrell 2005)

  • Defining ground truth Pyramid Match (Lazebnik et al. 2006, Grauman & Darrell 2005)Varying spatial resolution to capture approximate spatial correspondance

  • Examples of LabelMe retrieval12 closest neighbors under different distance metrics

  • LabelMe RetrievalSize of retrieval set % of 50 true neighbors in retrieval set0 2,000 10,000 20,0000

  • LabelMe RetrievalSize of retrieval set % of 50 true neighbors in retrieval set0 2,000 10,000 20,0000Number of bits% of 50 true neighbors in first 500 retrieved

  • Test set 2: Web images12.9 million imagesCollected from InternetNo labels, so use Euclidean distance between Gist vectors as ground truth distance

  • Web images retrieval% of 50 true neighbors in retrieval setSize of retrieval set

  • Web images retrievalSize of retrieval set % of 50 true neighbors in retrieval set% of 50 true neighbors in retrieval setSize of retrieval set

  • Examples of Web retrieval12 neighbors using different distance metrics

  • Retrieval Timings

  • SummaryExplored various approaches to learning binary codes for hashing-based retrievalVery quick with performance comparable to complex descriptors

    More recent work on binarizationSpectral Hashing (Weiss, Torralba, Fergus NIPS 2009)

    *********


Top Related