# 16 17 bag_words

Post on 20-May-2015

148 views

Embed Size (px)

DESCRIPTION

Lecture 8 & 9 for Computer Vision by Dr. Usman GhaniTRANSCRIPT

<ul><li> 1. Bag-of-Words modelsComputer Vision Lecture 8Slides from: S. Lazebnik, A. Torralba, L. Fei-Fei, D. Lowe, C. Szurka</li></ul>
<p> 2. Bag-of-features models 3. Overview: Bag-of-features models Origins and motivation Image representation Discriminative methods Nearest-neighbor classification Support vector machines Generative methods Nave Bayes Probabilistic Latent Semantic Analysis Extensions: incorporating spatial information 4. Origin 1: Texture recognition Texture is characterized by the repetition of basicelements or textons For stochastic textures, it is the identity of the textons,not their spatial arrangement, that mattersJulesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003 5. Origin 1: Texture recognitionhistogramUniversal texton dictionaryJulesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003 6. Origin 2: Bag-of-words models Orderless document representation: frequenciesof words from a dictionary Salton & McGill (1983) 7. Origin 2: Bag-of-words models Orderless document representation: frequenciesof words from a dictionary Salton & McGill (1983)US Presidential Speeches Tag Cloud http://chir.ag/phernalia/preztags/ 8. Origin 2: Bag-of-words models Orderless document representation: frequenciesof words from a dictionary Salton & McGill (1983)US Presidential Speeches Tag Cloud http://chir.ag/phernalia/preztags/ 9. Origin 2: Bag-of-words models Orderless document representation: frequenciesof words from a dictionary Salton & McGill (1983)US Presidential Speeches Tag Cloud http://chir.ag/phernalia/preztags/ 10. Bags of features for imageclassification1. Extract features 11. Bags of features for imageclassification1. Extract features2. Learn visual vocabulary 12. Bags of features for imageclassification1. Extract features2. Learn visual vocabulary3. Quantize features using visual vocabulary 13. Bags of features for imageclassification1. Extract features2. Learn visual vocabulary3. Quantize features using visual vocabulary4. Represent images by frequencies of visual words 14. 1. Feature extraction Regular grid Vogel & Schiele, 2003 Fei-Fei & Perona, 2005 15. 1. Feature extraction Regular grid Vogel & Schiele, 2003 Fei-Fei & Perona, 2005 Interest point detector Csurka et al. 2004 Fei-Fei & Perona, 2005 Sivic et al. 2005 16. 1. Feature extraction Regular grid Vogel & Schiele, 2003 Fei-Fei & Perona, 2005 Interest point detector Csurka et al. 2004 Fei-Fei & Perona, 2005 Sivic et al. 2005 Other methods Random sampling (Vidal-Naquet & Ullman, 2002) Segmentation-based patches (Barnard et al. 2003) 17. 1. Feature extractionCompute SIFT descriptorNormalize patch [Lowe99]Detect patches [Mikojaczyk and Schmid 02] [Mata, Chum, Urban & Pajdla, 02] [Sivic & Zisserman, 03] Slide credit: Josef Sivic 18. 1. Feature extraction 19. 2. Learning the visual vocabulary 20. 2. Learning the visual vocabulary Clustering Slide credit: Josef Sivic 21. 2. Learning the visual vocabulary Visual vocabularyClustering Slide credit: Josef Sivic 22. K-means clustering Want to minimize sum of squaredEuclidean distances between points xi andtheir nearest cluster centers mkD( X , M ) = ( xi mk ) 2 cluster k point i in cluster k Algorithm: Randomly initialize K cluster centers Iterate until convergence: Assign each data point to the nearest center Recompute each cluster center as the mean ofall points assigned to it 23. From clustering to vector quantization Clustering is a common method for learning a visualvocabulary or codebook Unsupervised learning process Each cluster center produced by k-means becomes a codevector Codebook can be learned on separate training set Provided the training set is sufficiently representative, the codebook will be universal The codebook is used for quantizing features A vector quantizer takes a feature vector and maps it to the index of the nearest codevector in a codebook Codebook = visual vocabulary Codevector = visual word 24. Example visual vocabularyFei-Fei et al. 2005 25. Image patch examples of visual words Sivic et al. 2005 26. Visual vocabularies: Issues How to choose vocabulary size? Too small: visual words not representative of allpatches Too large: quantization artifacts, overfitting Generative or discriminative learning? Computational efficiency Vocabulary trees(Nister & Stewenius, 2006) 27. 3. Image representationfrequency..codewords 28. Image classification Given the bag-of-features representations ofimages from different classes, how do welearn a model for distinguishing them? 29. Discriminative and generative methods for bags of features Zebra Non-zebra 30. Discriminative methods Learn a decision rule (classifier) assigning bag-of-features representations of images todifferent classesDecisionZebraboundaryNon-zebra 31. Classification Assign input vector to one of two or moreclasses Any decision rule divides input space intodecision regions separated by decisionboundaries 32. Nearest Neighbor Classifier Assign label of nearest training data pointto each test data pointfrom Duda et al.Voronoi partitioning of feature space for two-category 2D and 3D dataSource: D. Lowe 33. K-Nearest Neighbors For a new point, find the k closest points from trainingdata Labels of the k points vote to classify Works well provided there is lots of data and the distancefunction is good k=5Source: D. Lowe 34. Functions for comparing histograms L1 distance ND(h1 , h2 ) = | h1 (i ) h2 (i ) | i =1D(h1 , h2 ) = N ( h1 (i) h2 (i) ) 2 2 distancei =1 h1 (i ) + h2 (i ) 35. Linear classifiers Find linear function (hyperplane) to separatepositive and negative examples x i positive : xi w + b 0 x i negative : xi w + b < 0Which hyperplaneis best? Slide: S. Lazebnik 36. Support vector machines Find hyperplane that maximizes the marginbetween the positive and negative examplesC. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining andKnowledge Discovery, 1998 37. Support vector machines Find hyperplane that maximizes the marginbetween the positive and negative examples x i positive ( yi = 1) : xi w + b 1 x i negative ( yi = 1) :x i w + b 1 For support vectors,x i w + b = 1 Distance between point| xi w + b | and hyperplane: || w || Therefore, the margin is 2 / ||w|| Support vectorsMarginC. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining andKnowledge Discovery, 1998 38. Finding the maximum marginhyperplane1. Maximize margin 2/||w||2. Correctly classify all training data: x i positive ( yi = 1) :xi w + b 1 x i negative ( yi = 1) : x i w + b 1C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining andKnowledge Discovery, 1998 39. Nonlinear SVMs Datasets that are linearly separable work out great: 0x But what if the dataset is just too hard?0 x We can map it to a higher-dimensional space:x2 0x Slide credit: Andrew Moore 40. Nonlinear SVMs General idea: the original input space canalways be mapped to some higher-dimensional feature space where the trainingset is separable: : x (x) Slide credit: Andrew Moore 41. Nonlinear SVMs The kernel trick: instead of explicitlycomputing the lifting transformation (x),define a kernel function K such thatK(xi, xj) = (xi ) (xj) (to be valid, the kernel function must satisfyMercers condition) This gives a nonlinear decision boundary inthe original feature space: y K ( x , x) + b ii iiC. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining andKnowledge Discovery, 1998 42. Kernels for bags of features Histogram intersection kernel:NI (h1 , h2 ) = min(h1 (i ), h2 (i )) i =1 Generalized Gaussian kernel: 12K (h1 , h2 ) = exp D(h1 , h2 ) A D can be Euclidean distance, 2 distanceJ. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, Local Features and Kernels for Classifcationof Texture and Object Categories: A Comprehensive Study, IJCV 2007 43. What about multi-class SVMs? Unfortunately, there is no definitive multi-class SVMformulation In practice, we have to obtain a multi-class SVM bycombining multiple two-class SVMs One vs. others Traning: learn an SVM for each class vs. the others Testing: apply each SVM to test example and assign to it the class of the SVM that returns the highest decision value One vs. one Training: learn an SVM for each pair of classes Testing: each learned SVM votes for a class to assign to the test example Slide: S. Lazebnik 44. SVMs: Pros and cons Pros Many publicly available SVM packages:http://www.kernel-machines.org/software Kernel-based framework is very powerful, flexible SVMs work very well in practice, even with verysmall training sample sizes Cons No direct multi-class SVM, must combine two-class SVMs Computation, memory During training time, must compute matrix of kernel values for every pair of examples Learning can take a very long time for large-scale problemsSlide: S. Lazebnik 45. Summary: Discriminative methods Nearest-neighbor and k-nearest-neighbor classifiers L1 distance, 2 distance, quadratic distance, Support vector machines Linear classifiers Margin maximization The kernel trick Kernel functions: histogram intersection, generalized Gaussian, pyramid match Multi-class Of course, there are many other classifiers out there Neural networks, boosting, decision trees, 46. Generative learning methods for bags of features Model the probability of a bag of featuresgiven a class 47. Generative methods We will cover two models, both inspired bytext document analysis: Nave Bayes Probabilistic Latent Semantic Analysis 48. The Nave Bayes model Assume that each feature is conditionallyindependent given the class N p ( f1 , , f N | c) = p ( f i | c)i =1 fi: ith feature in the image N: number of features in the imageCsurka et al. 2004 49. The Nave Bayes model Assume that each feature is conditionallyindependent given the class NM p( f1 , , f N | c) = p ( f i | c) = p ( w j | c)n(w j ) i =1 j =1 fi: ith feature in the image N: number of features in the imagewj: jth visual word in the vocabularyM: size of visual vocabularyn(wj): number of features of type wj in the image Csurka et al. 2004 50. The Nave Bayes model Assume that each feature is conditionallyindependent given the class NM p( f1 , , f N | c) = p ( f i | c) = p ( w j | c) n(w j ) i =1 j =1 No. of features of type wj in training images of class cp(wj | c) = Total no. of features in training images of class c Csurka et al. 2004 51. The Nave Bayes model Maximum A Posteriori decision:M c* = arg max c p (c) p ( w j | c)n(w j )j =1 M= arg max c log p (c) + n( w j ) log p ( w j | c) j =1 (you should compute the log of the likelihood instead of the likelihood itself in order to avoid underflow) Csurka et al. 2004 52. The Nave Bayes model Graphical model: c w N Csurka et al. 2004 53. Probabilistic Latent Semantic Analysis= p1+ p2+ p3Image zebragrass treevisual topicsT. Hofmann, Probabilistic Latent Semantic Analysis, UAI 1999 54. Probabilistic Latent Semantic Analysis Unsupervised technique Two-level generative model: a document is amixture of topics, and each topic has its owncharacteristic word distributiondzwdocument topicwordP(z|d)P(w|z)T. Hofmann, Probabilistic Latent Semantic Analysis, UAI 1999 55. Probabilistic Latent Semantic Analysis Unsupervised technique Two-level generative model: a document is amixture of topics, and each topic has its owncharacteristic word distributiondzw Kp ( wi | d j ) = p ( wi | z k ) p ( z k | d j )k =1T. Hofmann, Probabilistic Latent Semantic Analysis, UAI 1999 56. The pLSA modelKp ( wi | d j ) = p ( wi | z k ) p ( z k | d j )k =1Probability of word iProbability of Probability of in document jword i giventopic k given(known) topic kdocument j(unknown)(unknown) 57. The pLSA modelK p ( wi | d j ) = p ( wi | z k ) p ( z k | d j )k =1 documents topicsdocumentswordswordstopicsp(zk|dj)p(wi|dj)=p(wi|zk)Observed codeword Codeword distributionsClass distributions distributions per topic (class)per image(MN) (MK)(KN) 58. Learning pLSA parametersMaximize likelihood of data: Observed counts of word i in document j M number of codewords N number of imagesSlide credit: Josef Sivic 59. Inference Finding the most likely topic (class) for an image: z = arg max p ( z | d )z 60. Inference Finding the most likely topic (class) for an image: z = arg max p ( z | d ) z Finding the most likely topic (class) for a visualword in a given image: z = arg max p ( z | w, d ) z 61. Topic discovery in imagesJ. Sivic, B. Russell, A. Efros, A. Zisserman, B. Freeman, Discovering Objects and their Locationin Images, ICCV 2005 62. Application of pLSA: Action recognitionSpace-time interest pointsJuan Carlos Niebles, Hongcheng Wang and Li Fei-Fei, Unsupervised Learning of Human ActionCategories Using Spatial-Temporal Words, IJCV 2008. 63. Application of pLSA: Action recognitionJuan Carlos Niebles, Hongcheng Wang and Li Fei-Fei, Unsupervised Learning of Human ActionCategories Using Spatial-Temporal Words, IJCV 2008. 64. pLSA modelKp ( wi | d j ) = p ( wi | z k ) p ( z k | d j )k =1Probability of word iProbability of Probability of in video j word i giventopic k given(known) topic kvideo j(unknown)(unknown) wi = spatial-temporal word dj = video n(wi, dj) = co-occurrence table(# of occurrences of word wi in video dj) z = topic, corresponding to an action 65. Action recognition example 66. Multiple Actions 67. Multiple Actions 68. Summary: Generative models Nave Bayes Unigram models in document analysis Assumes conditional independence of words givenclass Parameter estimation: frequency counting Probabilistic Latent Semantic Analysis Unsupervised technique Each document is a mixture of topics (image is amixture of classes) Can be thought of as matrix decomposition</p>