d5.1i: visual search tools – report – version a -...

27
D5.1: Visual Search Tools – Report – Version A Project ref. no. FP7-ICT-610691 Project acronym BRIDGET Start date of project (duration) 2013-11-01 (36 months) Document due Date: 2015-04-30 Actual date of delivery 2015-04-30 Leader of this document Miroslaw Bober Reply to Document status Draft for QA

Upload: duongquynh

Post on 06-Mar-2018

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

D5.1: Visual Search Tools – Report – Version A

Project ref. no. FP7-ICT-610691

Project acronym BRIDGET

Start date of project (duration) 2013-11-01 (36 months)

Document due Date: 2015-04-30

Actual date of delivery 2015-04-30

Leader of this document Miroslaw Bober

Reply to

Document status Draft for QA

Page 2: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A

Deliverable Identification Sheet

Project ref. no. FP7-ICT-610691

Project acronym BRIDGET

Project full title BRIDging the Gap for Enhanced broadcasT

Document name D5.1: Visual Search Tools – Report – Version A (First set of Visual Search Tools)

Security (distribution level) PU

Contractual date of delivery 30/04/2015

Actual date of delivery 30/04/2015

Document number

Type Deliverable

Status & version Version 2.0

Number of pages 27

WP / Task responsible WP5

Other contributors Gianluca Francini, Syed Husain, Skjalg Lepsoy, Simone Madeo, Stavros Paschalakis.

Author(s) Miroslaw Bober

Project Officer Alberto Rabbachin

Abstract This report describes version A of BRIDGET’s Visual Search Tools.

Keywords

Sent to peer reviewer 2015/04/23

Peer review completed 2015/04/29

Circulated to partners 2015/04/29

Read by partners 2015/04/30

Mgt. Board approval N/A

BRIDGET D5.1 Final.docx 2/27

Page 3: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A

Version Date Reason of change

0.1 2014-04-11 Miroslaw Bober – Initial draft

0.2 2014-12-13 Contribution from S.Lepsoy (TI) added

0.3 2014-12-13 Contribution from S.Paschalakis (VA) added

1.0 2014-12-30 Miroslaw Bober – Integration and first review

1.1 2015-01-12 S.Lepsoy, G. Francini– Additions to Sections 5 and 6

1.2 2015-01-13 Miroslaw Bober – second review, cross-references

1.3 2015-01-13 QA review (M.Bober, S.Paschalakis, S.Lepsoy)

1.4 2015-01-15 QA review (M.Bober, S.Paschalakis, S.Lepsoy)

1.5 2015-01-15 Minor corrections, S. Lepsoy

1.6 2015-04-15 M. Bober – Major update

2.0 2015-04-23 M. Bober - Integrated with contributions from VA and TI

2.01 2015-04-23 M. Bober – minor comments from partners integrated

2.02 2015-04-30 M. Bober – Addressing comments from QA

BRIDGET D5.1 Final.docx 3/27

Page 4: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A

Table of Contents 1 Executive summary ............................................................................................................................................... 6 2 Introduction ............................................................................................................................................................. 7

2.1 Algorithmic workflow ................................................................................................................................................. 7 3 RVD – Robust Aggregation of Local features ................................................................................................ 8

3.1 New aggregation mechanism ................................................................................................................................... 8 3.2 Rank-based assignment .............................................................................................................................................. 9 3.3 Scalability via cluster selection and bit selection mechanisms ............................................................... 10 3.4 Experimental evaluation ......................................................................................................................................... 10 3.5 Conclusions ................................................................................................................................................................... 13

4 Local descriptors, descriptor compression schemes and local binary descriptors ..................... 13 4.1 Prior art evaluation ................................................................................................................................................... 13 4.2 Joint Transform and Scalar Quantisation Coding (JTSQC) ........................................................................ 15 4.3 Conclusions ................................................................................................................................................................... 16

5 Development of a new tool for the geometric consistency check ....................................................... 16 5.1 Multiframe DISTRAT ................................................................................................................................................. 16 5.2 DISTRAT: Histograms and logarithmic distance ratios .............................................................................. 17 5.3 The method: DISTRAT for multiple image pairs ........................................................................................... 18

5.3.1 When values from different sources are measured ........................................................................... 18 5.3.2 How the method works .................................................................................................................................. 19

5.4 Why it works ................................................................................................................................................................ 19 5.4.1 Robustness increases with shot length ................................................................................................... 20 5.4.2 The peak in the pmf often moves little .................................................................................................... 20

5.5 Additional processing step ..................................................................................................................................... 20 5.6 Experiment: test on video sequences ................................................................................................................ 21

5.6.1 The alternative methods: late fusion strategies .................................................................................. 21 5.6.2 The outcome ....................................................................................................................................................... 21

5.7 Conclusion ..................................................................................................................................................................... 22 6 Fast computation of Scale Space ..................................................................................................................... 22 7 Conclusions ............................................................................................................................................................ 25 8 References .............................................................................................................................................................. 26

BRIDGET D5.1 Final.docx 4/27

Page 5: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A

Table of Figures Figure 1: BRIDGET Visual Search Engine descriptor extraction pipeline (work ongoing) ................................ 7 Figure 2: RVD processing pipeline ............................................................................................................................................ 8 Figure 3: RVD aggregation approach: (a) rank-based cluster assignment and L1 normalisation, (b)

Rank-1 aggregation (NR1), (c) Rank-2 aggregation, and (d) combination of ranks. ............................... 9 Figure 4: Example retrieval results for the RVD and FV descriptors on the Oxford1M dataset. For

each Query image (left) the corresponding ranked lists are shown for the RVD (top centre-right) and FV (bottom centreright); images correctly retrieved are marked with a red border. ................................................................................................................................................................................... 12

Figure 5: Performance of the state-of-the art local binary descriptors on MPEG CDVS patches. ................ 14 Figure 6: Performance of selected new designs for local descriptors (non-binary) on MPEG CDVS

patches................................................................................................................................................................................... 15 Figure 7: (a) SIFT descriptor layout and example histogram pairs for joint transform (b) single

histogram bin layout (c) example descriptor elements. ................................................................................... 16 Figure 8: True positive rates are provided at 0.01 false positive rate. ................................................................... 22 Figure 9. Scale space representation .................................................................................................................................... 23 Figure 10 Gaussian kernels used by the ALP keypoint detector adopted in MPEG CDVS ............................... 23 Figure 11 The Gaussian kernel of the first scale, its second derivatives and the approximation by

means of box filtering ...................................................................................................................................................... 24 Figure 12. The Gaussian kernel of the first scale, its second derivatives and the approximation by

means of the BRIDGET’s Fast Gaussian approach ............................................................................................... 25

BRIDGET D5.1 Final.docx 5/27

Page 6: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A

1 Executive summary This deliverable presents the main technical developments and research achievements within WP5 – Visual Search Tools. The key objective of WP5 is to develop advanced tools for automated Visual Search (VS) in image and video databases (Broadcast, Internet), enabling fast creation of content-based links for the Authoring Tools (ATs) and user-originated search capabilities for the Player. This deliverable is structured based on active WP5 Tasks and focuses on main results of the research between M3 and M18 of the project and on the deployment within the BRIDGET pipeline. Robust Visual Search methods were advanced significantly in several areas. Firstly, a novel descriptor aggregation scheme called Robust Visual Descriptor (RVD) had been developed. It advances significantly beyond SOTA in terms of recognition performance and speed, and form a basis for a scalable, binary global descriptor. Secondly, we completed a comprehensive benchmarking and complexity evaluation of the state of the art local descriptors, descriptor compression schemes and local binary descriptors as a precursor to a new design specifically for the project. A new local descriptor compression scheme, called Joint Transform and Scalar Quantisation Coding (JTSQC) has been developed which maintains the per-formance and scalability of the SIFT descriptor and TSQC (CDVS standard-based) compression combina-tion, but achieves a 50% reduction in computational complexity compared to TSQC. Thirdly, work is advanced on a new approach for fast method to determine geometric consistency and its extension to video, called DISTRAT. Finally, computationally efficient methods to extract descriptors in the scale-space have been successfully developed and integrated. On the standardisation font, the project was deeply engaged in the CDVS (Compact Descriptors for Visual Search) and CDVA (Compact Descriptors for Visual Analysis) groups in MPEG. Seven contributions to the MPEG CVDS standardization work with significant impact were submitted, including (1) Bit-selection mechanism adopted for CDVS global descriptor, (2) re-factoring of the CDVS Test Model into a new struc-ture with an abstract C++ API and (3) MATLAB interface to the CDVS Test Model.

BRIDGET D5.1 Final.docx 6/27

Page 7: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A

2 Introduction The key objective of WP5 is to develop advanced tools for automated Visual Search (VS) in image and video databases (Broadcast, Internet), enabling fast creation of content-based links for the Authoring Tools (ATs) and user-originated search capabilities for the Player. In particular this WP is developing and implementing efficient and robust VS tools suitable for ultra-large databases of videos and images with focus on following component technologies:

• local descriptors, global (aggregated) descriptors, and matching algorithms and related tools for VS and object recognition in large image databases characterised by high true-positive detection rate at very low false-alarm rate supporting ultra-fast matching;

• fast algorithms for VS in video content, including video-to-video matching, utilising fast and ro-bust 2D and 3D tracking based on local feature descriptors and segmentation;

• tools supporting suitable feature selection and matching for use in 3D model extraction in WP6; • dedicated data structures and indexing schemes to be used in ATs, terminals and service data-

bases.

WP5 is building and improving on the existing solutions, in particular on VS pipeline developed within MPEG’s CDVS subgroup. The WP also seeks standardisation of the developed VS and recognition tools, to increase the value of the developed technologies and boost their exploitation potential. We first outline the current algorithmic flow for the Visual Search Engine under development and then present the key innovation and achievements in key elements of the processing pipeline.

2.1 Algorithmic workflow The BRIDGET Visual Search Engine employs a classical processing flow; Figure 1 shows the current descriptor extraction pipeline. In order to achieve our objectives of significantly improved recognition and processing speed, WP5 is developing several components with BSOTA performance, namely the global aggregation mechanism, local descriptors and a new local descriptor compression.

Figure 1: BRIDGET Visual Search Engine descriptor extraction pipeline (work ongoing)

The current BRIDGET pipeline first detects feature-points using ALP detector, then selects a subset of more robust features (Key-point selection block) and extracts the corresponding SIFT descriptors. The descriptors are compressed using the Transform and Scalar Quantisation Coding method (TSQC), which offers a scalable, high-performance compression with ternary representation. Global descriptors are

BRIDGET D5.1 Final.docx 7/27

Page 8: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A computed using a novel RVD* aggregation method developed in the project, as described in Section 3. The ALP detector and TSQC compression were developed by two consortium members prior to the start of the BRIDGET project and now form a part of the MPEG CVDS specification. As per project plan, in the reporting period the work concentrated around the following four areas:

1. Development of a novel approach to local descriptor aggregation in order to derive a global de-scriptor with beyond the state-of-the art recognition performance (UNIS);

2. Evaluation of the local descriptors, descriptor compression schemes and local binary descriptors (UNIS + VA);

3. Development of a new tool for the geometric consistency check (TI) capable of processing multi-ple frames, and

4. Extraction pipeline speedup (TI). The following sections describe progress in each of the above areas.

3 RVD – Robust Aggregation of Local features A major progress has been achieved in the development of a novel approach to local descriptor aggrega-tion. A novel descriptor – called Robust Visual Descriptor (RVD) - has been proposed which combines rank-based multi-assignment with robust residual aggregation framework, followed by sign-based bi-narisation. Furthermore, to enable descriptor size scalability new cluster-level and bit-level element selection mechanisms were developed. RVD pipeline is shown in Figure 2.

Figure 2: RVD processing pipeline

3.1 New aggregation mechanism We have developed a novel approach to aggregation [1] [2] based on the concepts from Robust Statistics. The central idea is that the aggregated representations should match even if only a small proportion of the component local descriptors, corresponding to the objects or regions present in both images, are matching. In other words, aggregated representation should be robust to outliers (i.e. local vectors that do not match). Since it is not possible to determine apriori (i.e. at the extraction stage and before a search query is posed) which are the matching descriptors, it follows that the aggregation scheme should be designed so that all residual vectors have the same influence on the aggregated representation. This has not been the case in prior-art methods, such as VLAD and Fisher Kernels. In the conventional VLAD representation, local vectors located further from the class centres have larger influence, which is something we prefer to avoid. In RVD, the robustness objective is achieved by apply-ing a L1 normalisation to the residual vectors before aggregation, which limits the influence of outliers on the representation.

BRIDGET D5.1 Final.docx 8/27

Page 9: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A

Figure 3: RVD aggregation approach: (a) rank-based cluster assignment and L1 normalisation, (b)

Rank-1 aggregation (NR1), (c) Rank-2 aggregation, and (d) combination of ranks.

In the example on Figure 3, local vectors x1, x4, x6 belong to rank-1 neighbourhood NR1 of C1 and their corresponding residual vectors are L1 normalised before aggregation into rv1, as shown in subfigure (b). Similarly, local vectors x2, x3, x5 belonging to NR2 of C1 are separately aggregated into rv2, as shown in (c). Subsequently, rv1 and rv2 are combined with weights 2 and 1, as shown in (d).

3.2 Rank-based assignment Prior art methods employ either hard-assignment or soft-assignment based on the distance to the near-est neighbour. We propose a rang-based assignment for RVD, where the assignment weight is constant within each rank. The assignment process works as follows. First, K-means Clustering is performed to learn a codebook K = k1, …, kn of n cluster centres, typically between 64 and 512, and 256 for our current design. The codebook size is selected to give a good trade-off between the performance, extraction speed, complexity and memory use. For each local descriptor, the distances to all centres are computed and ranked in the increasing order. Each descriptor is then rank-assigned to KN closest visual words (typical-ly KN=3) and the rank information is subsequently used for aggregation. By increasing the number of vectors assigned to each cluster and using the rank information for aggregation our scheme delivers

BRIDGET D5.1 Final.docx 9/27

Page 10: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A improved performance. Note that our assignment weight depends only on the rank information (NR) independently of the actual distance from the respective cluster centres as in the case of soft assignment. The rank based approach also differs from multiple assignment with equal weights because rank infor-mation guides the aggregation process. In our robust framework we note that rank-based multiple as-signment brings significant benefits especially in the binary domain.

3.3 Scalability via cluster selection and bit selection mechanisms The cluster occupancy and rank are used to estimate reliability of each cluster level representations, which is used to select a subset of clusters with high reliability to form the image-level RVD descriptor. It is additionally also used for rate control of the produced RVD representation. A particular cluster is rejected if the number of local descriptors assigned to that cluster is less than cluster selection threshold Cth. The threshold value Cth is selected based on typical (median) cluster occupancy to achieve the re-quired size of RVD descriptor for each bitrate. The RVD vector is binarised by applying the sign function which assigns the value 1 to any non-negative values, and the value 0 to any negative values, respectively. To further compress RVD a subset of bits from the aforementioned binary representation is selected from each cluster, based on separability criteria. We select those bits which provide best separability between hamming distances for matching and non-matching pairs (trained off-line).

3.4 Experimental evaluation Extensive performance evaluation has been performed, including comparisons to the state-of-the art pipeline developed by the MPEG group standardising Compact Descriptors for Visual Search (CVDS) and with latest published work using reference databases including Holidays, Oxford and UKB. The evalua-tion has clearly demonstrated that RVD descriptors significantly and consistently outperforms all known techniques at comparable bitrate, including the BoW, VLAD, Fisher Vector and the most recent Democrat-ic Aggregation. The results were presented at the ICIP 2014 [2] and also presented to the MPEG CVDS group [1]. Additional changes were recently introduced, leading to further improvements in performance – indicated as RVD* in the table, the work is still ongoing.

Table 1 compares binary RVD representations with leading methods on Oxford, Holidays and UKB da-taset. Compared to state-of-the-art binary Fisher Vector (bottom part of the table), RVD shows consistent and significant improvement on all datasets, in fact RVD even at 0.5kB outperforms FV at the 1kB operat-ing point. To get a full picture of relative performance, we also show results for non-binary representa-tions of the same vector dimensionality. These are at least an order of magnitude larger (floating point representation uses 32 bits vs one bit per dimension for binary representation) and significantly slower in matching (Hamming pop-count vs L2 norm). We note that even in this case RVD outperforms all refer-ence methods, except for the case of the Holiday dataset where non-binary VLAD* representation achieves 62.2% vs binary RVD 61% (non-binary RVD achieves 73.2% in a comparable pipeline). We treat the non-binary results for reference only as they do not conform to the low-memory and speed require-ments of our project.

BRIDGET D5.1 Final.docx 10/27

Page 11: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A

Method Dimension Size Oxford 5k Holidays UKB BoW [3] 20k 10kB 35.4 43.7 2.87 BoW [3] 200k 12kB 36.4 54.0 2.81 VLAD [3] 8k 32kb 37.8 55.6 3.28 FV [3] 8k 32kb 41.8 60.5 3.35 VLAD Intra [4] 8k 32kb 44.8 56.5 - VLAD* [4] 8k 32kb 50.0 62.2 - RVD* 8k 32kB 59.5 73.2 3.59 BoW Binary [5] 20k 8.3kB - 45.8 3.02 FV Binary [6] 8k 1kB - 58 3.25 FV Binary [6] 4k 0.5kB 57.4 3.21 RVD Binary 8k 1kB 50.3 61.0 3.34 RVD Binary 4k 0.5kB 44.3 58.9 3.28 RVD* Binary 8k 1kB 59 62.1

RVD* Binary 4k 0.5kB 57 60.7

Table 1: Comparison of RVD with the state of the art

Finally, in order to demonstrate the performance of RVD* global descriptor in a retrieval system, we show raw retrieval results in a case when only global descriptor is used for search (i.e. where short-list is not re-ranked based on local descriptor matching and without geometric verification). The database used here is Oxford + 1million distractors. We observed that, out of total 55 queries, RVD obtains better recall on 20 queries and fisher has better recall on 7 queries; the ratio of RVD outperforming FV is approxi-mately 3:1. For an intuitive understanding, Figure 4 shows three queries where the difference in recall between RVD* and FV is most significant and one query where the difference in recall between FV and RVD* is the biggest (preserving the ratio 3:1 for fairness). We show the query and the top 4 ranked re-sults obtained by the RVD and FV methods for each query, marked in red.

BRIDGET D5.1 Final.docx 11/27

Page 12: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A

Figure 4: Example retrieval results for the RVD and FV descriptors on the Oxford1M dataset. For each Query image (left) the corresponding ranked lists are shown for the RVD (top centre-right) and FV (bottom centreright); images correctly retrieved are marked with a red border.

BRIDGET D5.1 Final.docx 12/27

Page 13: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A

3.5 Conclusions In summary we have developed a novel descriptor aggregation scheme with significantly improved performance. Based on this approach we have devised a compact, binary and scalable image representa-tion that is easy to compute, fast to match and delivers beyond state-of-the-art performance in visual recognition of objects, buildings and scenes. These new methods form the core of the BRIDGET Visual Search Engine. The bit selection mechanism from our proposal was in fact adopted in CDVS to improve the performance of the CVDS standard. The RVD representation is now being further extended to video shots and will be submitted to the forthcoming CDVA standardisation.

4 Local descriptors, descriptor compression schemes and local binary descriptors

4.1 Prior art evaluation Within T5.1 of WP5, we defined, implemented and conducted a very detailed benchmarking evaluation of the state of the art local descriptors, descriptor compression schemes and local binary descriptors. The descriptors implemented and evaluated include: SIFT [7], ALOHA [17], BRIEF [19], Binboost [12], BRIGHT [14], BRISK [13], D-BRIEF [19], FREAK [18], LDB [15], ORB [20], and HOG [10]. Our study includes both patch-level ROC evaluation (including robustness to image noise, compression and other image deterioration), image-level evaluation and database-level performance evaluation. The analysis has shown that the Transform and Scalar Quantisation Coding (TSQC) of the SIFT descriptor, designed by the consortium members before the start of BRIDGET project and used in MPEG-7 CDVS standard [27], offers the best performance-size trade-off for SIFT descriptors. Furthermore, it also matches the performance the best binary descriptors. Here we present a summary of the results. Figure 5 below shows the performance of the state-of-the art local binary descriptors on the MPEG-7 CDVS patches; it also shows the original SIFT descriptor (128 dimensions, float) for reference. The SIFT descriptor compressed using TSQC at 213bit size is marked as CDVS-TM [213] on this Figure (this representation is used in CDVS-TM). It can be seen that four de-scriptors perform well: Brisk (512bits), CDVS-TM (213bits), Binboost (256bits) and Bright-plus (162bits). Bright-plus is our modified version of Bright with improved performance.

BRIDGET D5.1 Final.docx 13/27

Page 14: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A

Figure 5: Performance of the state-of-the art local binary descriptors on MPEG CDVS patches.

We also extensively evaluated some new descriptor designs for non-binary descriptors, such as Lattice-HOG [10], LIOP [9], MROUGH and MRRID [8] – results are shown on Figure 6. It can be seen that LIOP offers performance improvements over SIFT at lower false-alarm rates (FAR<0.001, TPR~0.86). Howev-er, LIOP belongs to the class of segmentation-based descriptors and increases extraction complexity significantly.

BRIDGET D5.1 Final.docx 14/27

Page 15: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A

Figure 6: Performance of selected new designs for local descriptors (non-binary) on MPEG CDVS patches.

From the results of our detailed evaluation we concluded that the SIFT descriptor combined with Trans-form and Scalar Quantisation Coding (TSQC), as in MPEG-7 CDVS, achieves close to optimal performance and scalability, and is also competitive with the latest designs of binary descriptors. The evaluation of newly proposed non-binary descriptors showed there may be some scope for improvement over the SIFT design, however there are associated complexity costs.

4.2 Joint Transform and Scalar Quantisation Coding (JTSQC) Based on the conclusions of our evaluation, T5.1 of WP5 saw the development of a new local descriptor compression scheme, which maintains the performance and scalability of the SIFT descriptor and TSQC compression combination, but achieves approximately 50% reduction in computational complexity compared to TSQC. The main idea behind the new compression scheme, Joint Transform and Scalar Quantisation Coding (JTSQC), is to generate a compressed descriptor through joint transformation of the histograms compris-ing a SIFT descriptor, followed by coarse scalar quantisation of the transform values. This is illustrated in Figure 7. Compared with TSQC, which transforms each histogram individually using multiple sets of transform functions of varying complexity, the joint histogram transformation process of JTSQC proved to be more robust, allowing for simple transform functions and a reduction in computational complexity while main-taining the performance and scalability of TSQC. Table 2 shows a comparison of TSQC and JTSQC in the MPEG-7 CDVS evaluation framework [28].

BRIDGET D5.1 Final.docx 15/27

Page 16: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A

Figure 7: (a) SIFT descriptor layout and example histogram pairs for joint transform (b) single

histogram bin layout (c) example descriptor elements.

Size TPR % at 1% FPR MAP % PTM %

TSQC JTSQC TSQC JTSQC TSQC JTSQC

512B 88.0 87.6 77.2 76.8 84.3 84.2

1KB 91.0 90.7 82.5 81.9 89.1 88.7

2KB 93.7 93.7 85.1 85.0 91.6 91.4

1KB/4KB 91.5 91.4 NA NA NA NA

2KB/4KB 94.2 94.1 NA NA NA NA

4KB 95.3 95.3 87.9 88.0 93.4 93.5

8KB 96.2 96.1 88.1 88.2 93.5 93.6

16KB 96.6 96.6 88.1 88.2 93.5 93.6

Table 2: A comparison of TSQC and JTSQC in the MPEG-7 CDVS evaluation framework.

4.3 Conclusions A new local descriptor compression scheme, called Joint Transform and Scalar Quantisation Coding (JTSQC) has been developed which maintains the performance and scalability of the SIFT descriptor and TSQC compression combination, but achieves approximately 50% reduction in computational complexity compared to TSQC.

5 Development of a new tool for the geometric consistency check A new geometric consistency check is under development: the multiframe DISTRAT. It combines multiple frames from each shot both as queries and as reference shots (e.g. in a database). Its goal is to discover whether point-wise matches across all combinations of frames arise from a coherent motion, indicating a match, as opposed to random matches caused by false-alarm rates. This analysis is done simultaneously for all combinations of frames – which boosts the discriminative power. This technique helps discovering matches also when the searched object is small or poor in details. To our knowledge no other existing technique can check the geometry consistency simultaneously over multiple frames.

5.1 Multiframe DISTRAT

BRIDGET D5.1 Final.docx 16/27

Page 17: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A Most modern methods for determining whether two images display the same object involve detection and matching of interest points followed by a test to see whether some of the matches may correspond to a common image transformation. This latter operation is often called a geometric consistency test. We present a geometric consistency test for video shots, understood as

[...] one or more frames generated and recorded contiguously and representing a continuous ac-tion in time and space [21].

The method compares a query image to a shot, and the goal is to establish whether the same object can be seen in both. A shot that displays the same object as that in a query will be called a matching shot, and a shot that does not will be called non-matching. Traditionally, in a retrieval setting, a shot is represented by a single frame (key frame). The goal is usual-ly to detect whether a certain object is present in the shot by comparing the keyframe to a query that depicts the desired object. This may cause some difficulties, notably when the object depicted in the query image is visible in parts of the shot but not in the keyframe. Another difficulty arises when the object is small, blurred, or is poor in detail, such that a simple comparison between two images (query and key frame) does not reveal the presence of the object. We propose to overcome these problems by letting a shot be represented by several frames, obtained by sampling the shot over its length (uniformly in time or by some other rule). The proposed test thus ac-cumulates evidence from interest point matching over several image pairs (query, a frame in the shot). The method is robust, in the sense that the presence of common objects is detected also when the ratio of correct to incorrect matches is low across all image pairs. In the first sub-section, we review the DISTRAT geometric consistency check. We then introduce the proposed method and provide arguments for why it should be successful. Finally, we report on an exper-iment comparing the method to some extensions to the state of the art for single image pair matching.

5.2 DISTRAT: Histograms and logarithmic distance ratios A histogram provides a count of how many times, of a total of N trials, a measured value z is contained within each of K intervals (or bins) 𝜍𝜍1, … , 𝜍𝜍𝐾𝐾 . For each bin, the histogram component 𝑔𝑔(𝑘𝑘) contains the frequency

𝑔𝑔(𝑘𝑘) = 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑡𝑡𝑡𝑡𝑛𝑛𝑛𝑛𝑡𝑡 𝑡𝑡ℎ𝑎𝑎𝑡𝑡 𝑧𝑧 ∈ 𝜍𝜍𝑘𝑘 , 𝑘𝑘 = 1, … ,𝐾𝐾. The sum of histogram components is equal to the number of trials,

𝑔𝑔(1) + ⋯+ 𝑔𝑔(𝐾𝐾) = 𝑁𝑁. Histograms are useful when there is uncertainty about which values will be measured: this is the case when the sequence of measurements cannot be predicted from first principles but seems to be guided by chance. It may often be possible to make hypotheses about how often the measurements will be within certain intervals, in the form of the probability for outcome 𝑘𝑘

𝑝𝑝(𝑘𝑘) = 𝑃𝑃(𝑧𝑧 ∈ 𝜍𝜍𝑘𝑘). Together, the probabilities 𝑝𝑝(𝑘𝑘) for all the bins 𝑘𝑘 = 1, … ,𝐾𝐾 are called a probability mass function. The number of trials and the probability mass function 𝑁𝑁, 𝑝𝑝(1), … ,𝑝𝑝(𝐾𝐾) form the parameters for a multi-nomial random variable [𝐺𝐺(1) ⋯ 𝐺𝐺(𝐾𝐾)]. The joint probability

𝑃𝑃(𝐺𝐺(1) = 𝑔𝑔(1), … , 𝐺𝐺(𝐾𝐾) = 𝑔𝑔(𝑘𝑘)) (i.e. that a given combination of values [𝑔𝑔(1) ⋯ 𝑔𝑔(𝐾𝐾)] is found in the histogram) is a multinomial. The matches of interest points in images can be analyzed with histograms. Such matches form a set

{(𝑥𝑥1,𝑦𝑦1), … , (𝑥𝑥𝑁𝑁 ,𝑦𝑦𝑁𝑁)}, where 𝑥𝑥𝑛𝑛 denotes the coordinates of a point in a first image that is matched to a point in a second image with the coordinates 𝑦𝑦𝑛𝑛. For any pair of matches (𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖) and �𝑥𝑥𝑗𝑗 ,𝑦𝑦𝑗𝑗�, the logarithm of the distance ratio (LDR) is

𝑧𝑧𝑖𝑖𝑗𝑗 = log�𝑥𝑥𝑖𝑖 − 𝑥𝑥𝑗𝑗��𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑗𝑗�

.

BRIDGET D5.1 Final.docx 17/27

Page 18: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A For a set of M matches, there are 𝑁𝑁 = 𝑀𝑀(𝑀𝑀 − 1)/2 distinct pairs of matches, yielding N LDRs. The fre-quencies with which these LDRs occur in a set of intervals 𝜍𝜍1 … , 𝜍𝜍𝐾𝐾 can be represented by a histogram. The hypothesis that all matches are wrong (outliers) is represented by a certain probability mass func-tion, and the histogram may be tested against this hypothesis. This method, called DISTRAT, is presented in [22].

5.3 The method: DISTRAT for multiple image pairs The method performs interest point matches independently for all pairs of query and frames from the shot. These matches are processed according to the DISTRAT method, producing a sequence of histo-grams and a sequence of probability mass functions. This part will be called the preliminary operation. The method then proceeds to the decision operation, which merges both the histograms and the probabil-ity mass functions and then performs a hypothesis test to decide whether the shot matches the query.

5.3.1 When values from different sources are measured Suppose that M image pairs are available for comparison, and that the first step in the preliminary opera-tion produces the individual histograms

[𝑔𝑔1(1) ⋯ 𝑔𝑔1(𝐾𝐾)]⋮

[𝑔𝑔𝑀𝑀(1) ⋯ 𝑔𝑔𝑀𝑀(𝐾𝐾)]

Each of these represents the frequencies of LDR values for one image pair. Also, the numbers of LDR values taken for the individual image pairs are 𝑁𝑁1, … ,𝑁𝑁𝑀𝑀 , and the total number of LDR values is their sum

𝑁𝑁 = 𝑁𝑁1 + ⋯+ 𝑁𝑁𝑀𝑀. According to the DISTRAT method, the hypothetical probability mass function (representing the case where all matches are outliers) is specific for an image pair, since it depends on the actual configuration of interest points in the two images. It follows that M different pmfs are needed to model the outlier behaviour in the M image pairs

𝑝𝑝1 = [𝑝𝑝1(1) ⋯ 𝑝𝑝1(𝐾𝐾)]⋮

𝑝𝑝𝑀𝑀 = [𝑝𝑝𝑀𝑀(1) ⋯ 𝑝𝑝𝑀𝑀(𝐾𝐾)]

We shall call these the individual outlier pmfs. The histogram that represents frequencies over the whole set of image pairs is obtained by summing the individual histograms

𝑔𝑔 = 𝑔𝑔1 + ⋯+ 𝑔𝑔𝑀𝑀. This histogram will be called the accumulated histogram. To arrive at an outlier pmf for the total set of image pairs, we must take into account the changing pmfs. The appropriate model in this case is a mixture, such that the pmf is a linear combination of the individu-al outlier pmfs,

𝑝𝑝(𝑘𝑘) = � 𝑃𝑃(𝑛𝑛) ∙ 𝑝𝑝(𝑘𝑘|𝑛𝑛)𝑀𝑀

𝑚𝑚=1

where 𝑃𝑃(𝑛𝑛) is the probability (i.e. relative frequency) that an LDR value is taken from image pair m, and 𝑝𝑝(𝑘𝑘|𝑛𝑛) is the probability that the LDR value is within bin 𝜍𝜍𝑘𝑘 provided that the LDR is taken from image pair m. Since the relative frequency of values in image pair m is the fraction of the number of LDR values to the total,

𝑃𝑃(𝑛𝑛) =𝑁𝑁𝑚𝑚𝑁𝑁

,𝑛𝑛 = 1, … ,𝑀𝑀.

The conditional probability mass functions are simply the individual pmfs 𝑝𝑝(𝑘𝑘|𝑛𝑛) = 𝑝𝑝𝑚𝑚(𝑘𝑘),𝑛𝑛 = 1, … ,𝑀𝑀.

BRIDGET D5.1 Final.docx 18/27

Page 19: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A Therefore, the outlier pmf for the total set of image pairs is

𝑝𝑝(𝑘𝑘) =1𝑁𝑁� 𝑁𝑁𝑚𝑚 ∙ 𝑝𝑝𝑚𝑚(𝑘𝑘)𝑀𝑀

𝑚𝑚=1

.

We may now introduce the method.

5.3.2 How the method works The method consists of a preliminary operation (known from previous techniques) and a hypothesis test that uses accumulated information produced in the preliminary operation. It can be summarized as follows.

Procedure 1 [multiframe DISTRAT for video shots]

Let a query and a video shot be given. Select a subset of frames from the shot, creating pairs each con-sisting of the query and a selected frame. PRELIMINARY OPERATION The following iterations coincide with the initial processing in DISTRAT, see [22]. The output is a set of histograms and probability density functions.

1. For each image in (query and frames), detect local features. 2. For each image pair, match local features to produce lists of coordinate matches. 3. For each list of coordinate matches, compute a histogram of logarithmic distance ratios. 4. For each list of coordinate matches, compute a probability mass function for outliers (incorrect

matches).

DECISION OPERATION The following steps prepare and carry out a hypothesis test for establishing whether the shot depicts an object depicted in the query.

1. Sum the previously computed LDR histograms. 2. Compute a mixture of the previously computed outlier pmfs. 3. Compute Pearson's test statistic [24]. 4. If Pearson's test statistic exceeds a threshold, declare that the shot matches the query. Otherwise,

declare that the shot does not. 5. An algorithmic step that may not yet be disclosed, as a patent application is under preparation.

5.4 Why it works There are two mechanisms at work ensuring that the proposed works well. The first is that it effectively provides a large number of observations that increase the reliability of the hypothesis test. The second mechanism is that the LDR statistics show small variation across all the considered image pairs – such that similar histograms are accumulated. We discuss these two mechanisms in the following paragraphs.

BRIDGET D5.1 Final.docx 19/27

Page 20: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A 5.4.1 Robustness increases with shot length We examine the expected value of Pearsons's test statistic for two cases: in the first case, the histogram is exclusively due to outliers; in the second case, there are some inliers. A histogram due to outliers is modelled by a multinomial random variable 𝐻𝐻 with parameters 𝑁𝑁,𝑝𝑝(1), … ,𝑝𝑝(𝐾𝐾). The expected value of the test statistic 𝐶𝐶 in this case is, 𝐸𝐸(𝐶𝐶) = 𝐾𝐾 − 1, which depends on the number of bins and not on the number of samples used for making the histogram. In the second case with some inliers present, the histogram is modelled by a multinomial random varia-ble G with parameters 𝑁𝑁, 𝑞𝑞(1), … , 𝑞𝑞(𝐾𝐾), such that the probabilities 𝑞𝑞(𝑘𝑘) are different from the probabili-ties 𝑝𝑝(1), … ,𝑝𝑝(𝐾𝐾) that characterize the LDR for outliers. If the two mass functions 𝑝𝑝 and 𝑞𝑞 are held fixed, then the expected value of the test statistic 𝐶𝐶 grows as a function of 𝑁𝑁,

𝐸𝐸(𝐶𝐶) = 𝑜𝑜(𝑁𝑁) = 𝑎𝑎 ∙ 𝑁𝑁 + 𝑛𝑛 where the two constants

𝑎𝑎 = �(𝑞𝑞(𝑘𝑘) − 𝑝𝑝(𝑘𝑘))2

𝑝𝑝(𝑘𝑘)

𝐾𝐾

𝑘𝑘=1

𝑛𝑛 = �𝑞𝑞(𝑘𝑘)(1 − 𝑞𝑞(𝑘𝑘))

𝑝𝑝(𝑘𝑘)

𝐾𝐾

𝑘𝑘=1

are positive. A proof is straightforward. Thus, by just letting 𝑁𝑁 become large enough, the expected value of 𝐶𝐶 in the case of inliers may exceed any threshold. Since the constants are functions of 𝑝𝑝 and 𝑞𝑞, of course the rate at which this happens will vary: if 𝑞𝑞 is close to the outlier pmf 𝑝𝑝 (meaning that there are few inliers), then the number 𝑁𝑁 of matches will have to be large -- but eventually, 𝐸𝐸(𝐶𝐶|𝑞𝑞) will become larger than any fixed threshold. Therefore, if the pmfs 𝑝𝑝 and 𝑞𝑞 remain unchanged, then increasing the number 𝑁𝑁 of samples will help separating the values of the test statistic for matching images (some inliers) from the values for non-matching images (no inliers).

5.4.2 The peak in the pmf often moves little Inliers give rise to local maxima in the probability mass functions (and therefore peaks in the LDR histo-grams). For all pairs (query, frame) taken from one shot and containing the same object, these local maxima will be found over roughly the same bins. As a consequence the accumulated probability mass function in 𝑞𝑞 will also display such a local maximum. This is rooted in the nature of video shots, as stated by Cotsaces et al [23]:

[...] video content within a shot tends to be continuous, due to the continuity of both the physical scene and the parameters (motion, zoom, focus) of the camera that images it.

This continuity is manifested as a small and gradual change in the individual LDR histograms across the sequence. Since the logarithm compresses the range of the LDR, this will most often lead to narrow peaks over the same bins for all histograms over the sequence.

5.5 Additional processing step A processing step has been developed in order to provide an advantage over alternative methods (see section 5.6.1). This new step ensures that Multiframe DISTRAT surpasses the alternatives at a low num-ber of frames taken from each shot. Details will be provided at a later stage, when a patent application has been filed.

BRIDGET D5.1 Final.docx 20/27

Page 21: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A 5.6 Experiment: test on video sequences The method has been tested on a large set of video shots, extracted from 180 video sequences contained in the CTurin180 dataset, used in the development of the CDVS standard, by dividing each sequence into 10 shots. For the test, 16200 matching pairs (query image, shot) and 72000 nonmatching pairs have been identified. The method, as well as two alternative methods for comparison, have been applied to this material. All frames are represented by CDVS descriptors of length 512 bytes, that is, the shortest of the alterna-tives of the standard.

5.6.1 The alternative methods: late fusion strategies The test framework for CDVS represents the state of the art for matching pairs of single images. The score used for declaring match in CDVS is a sum of distinctness scores for those interest point matches that are estimated to be correct. This method therefore involves more algorithmic steps than the basic DISTRAT method. We consider extensions of this technique to multiple image pairs by using late fusion methods, commonly used for combining scores in information retrieval [25] [26]. The methods are combMAX and combSUM applied to the CDVS scores for all pairs (query, frame). The first method produces a score as the maxi-mum over the individual scores in a set of (query, frame). The second method produces a score as the sum of the individual scores.

5.6.2 The outcome The experiment regards cases in which the shots are represented by 1,2,…,15 frames. Results are pre-sented in terms of True Positive Rate for thresholds such that the False Positive Rate is 0.01. (The True Positive Rate is the fraction of positive outcomes across the matching pairs. The False Positive Rate is the fraction of positive outcomes – wrongly declared matches – across the nonmatching pairs.) Figure 8 summarizes the results.

BRIDGET D5.1 Final.docx 21/27

Page 22: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A

Figure 8: True positive rates are provided at 0.01 false positive rate.

It is seen that the proposed method significantly outperforms late fusion of the CDVS scores when shots are represented by more than 4 frames. The advantage in TPR is approximately 2% when shots are represented by more than 5 frames. It may be noted that when using only 1 frame from any shot, then CDVS has an advantage of 5%. In this case, the proposed method boils down to the first of several steps in the CDVS process, and the 5% ad-vantage is exactly the gain provided by the subsequent steps in CDVS.

5.7 Conclusion We propose a method to identify match between a query image and a video shot, meaning that the two display the same object. The method consists in comparing a histogram for logarithmic ratio distances to a certain probability mass function. The proposed method outperforms late fusion of CDVS scores over a large dataset. As soon as a patent application under development has been filed, an additional processing step will be disclosed.

6 Fast computation of Scale Space Reliable keypoint detectors, as the detectors used in MPEG CDVS (ALP) or in SIFT, are based on the scale space representation of the image. This representation is computed smoothing the image with a set of Gaussian filters, a computationally intensive operation that heavily influences the total time required to extract local features from the image.

BRIDGET D5.1 Final.docx 22/27

Page 23: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A

Figure 9. Scale space representation

For example, in MPEG CDVS the input image is smoothed with four Gaussian kernels with ascending sigma values 1.5199, 1.6, 2.2627 and 3.2. The number of taps used to represent each kernel is nTaps = 2 * max(ceil(4*sigma)) + 1. The resulting kernels then uses 15, 15, 21 and 27 taps and are shown in the following figure.

Figure 10 Gaussian kernels used by the ALP keypoint detector adopted in MPEG CDVS

The filtering is performed first by rows and then by columns, exploiting the separability of the filters. This means that for each pixel of the input image, 156 multiplications and 148 sums are computed, a huge amount of operations that affect the speed of the process. The scale space computation is based on a pyramidal approach and for all the following octaves only three Gaussian filters are applied, because the first scale of the octave is obtained decimating one scale of the previous octave. This means that the number of operations per pixel for the octaves following the first is reduced to 126 multiplications and 120 sums, but most of the time used in the scale space computation is used to filter the first octave due to its bigger resolution. To speed-up Gaussian filtering, many approaches have been developed. One of the fastest and most used is based on Box filtering: by iterating convolutions of box kernels it is possible to approximate a Gaussian filter. The box convolution is easily computed by means of integral images; with this technique each convolution requires for each pixel three sums (two for computing the integral image) and two subtrac-tions. To approximate a Gaussian kernel the number of box convolutions typically varies from 4 to 6, with a total number of sums and subtractions from 20 to 30. Compared to applying the actual Gaussian ker-nels this represents a huge saving in processing time, but this approach has a limit when used for compu-ting the scale space. The problem is related to the small values of sigma adopted for calculating the first scales of the octave, because the convolution of a small box of 3x3 provides approximations of Gaussian with sigmas that are greater than what is adopted in a keypoint detector, as one can see in Table 3.

BRIDGET D5.1 Final.docx 23/27

Page 24: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A

Table 3: Equivalent sigmas obtained convolving a 3x3 box

To worsen the situation, the scale space is based on the second derivative of the Gaussian filter and this requires one or two additional convolutions for obtaining a good approximation of the derivative. In the following figure, the original Gaussian kernel of the first scale, its second derivatives and the relative approximations are shown.

Figure 11 The Gaussian kernel of the first scale, its second derivatives and the approximation by

means of box filtering

As one can appreciate, the amount of error in the approximation is high and increases with the second derivative. To overcome this problem, in BRIDGET a new algorithm called Fast Gaussian has been developed, with the objective of maintaining a speed close to box filtering while obtaining a very good approximation. The new technique is based only on integer operations and requires 70 sums for computing all the scales of the first octaves. The details of this innovative technique will be disclosed as soon as the patent filing will be completed. To give an idea of the results, for computing the first scale of the first octave only 16 sums are performed, with an approximation that is much closer to the original Gaussian kernel than what one can obtain with box filtering, as shown in the following figure.

BRIDGET D5.1 Final.docx 24/27

Page 25: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A

Figure 12. The Gaussian kernel of the first scale, its second derivatives and the approximation by

means of the BRIDGET’s Fast Gaussian approach

Using BRIDGET’s Fast Gaussian approach, the original Gaussian function and its second derivative are far better approximated than when using box filtering.

7 Conclusions In the first 18 month of the project, WP5 has made a significant progress in the development of the com-ponent technologies with BSOTA performance and integrated them into the first BRIDGET VS engine. The most significant results include: (1) Developed novel descriptor aggregation scheme – Robust Visual Descriptor (RVD*) – which advances

significantly beyond SOTA in terms of recognition performance and speed. (2) Scalability via bit selection mechanism adopted in MPEG CDVS standard. (3) Research papers published at ICIP [2]. (4) Conducted a comprehensive benchmarking and complexity evaluation of the state of the art local

descriptors, descriptor compression schemes and local binary descriptors (5) Developed a new local descriptor compression scheme, JTSQC, which maintains the performance and

scalability of the SIFT descriptor and TSQC compression combination, but achieves approximately 50% reduction in computational complexity compared to TSQC.

(6) Developed a new geometric consistency check: the multi-frame DISTRAT, with unique features and promising performance.

(7) Development of the accelerated component tools, including scale-space computation. (8) Seven contributions to the MPEG CVDS standardization work with significant impact, including: (1)

Bit-selection mechanism adopted for CDVS global descriptor, (2) re-factoring of the CDVS Test Model into a new structure with an abstract C++ API; (3) MATLAB interface to the CDVS Test Model

(9) Developed a complete Visual Search pipeline (mark 1), performed testing and integration into the first release of the BRIDGET AT.

BRIDGET D5.1 Final.docx 25/27

Page 26: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A

8 References [1] M. Bober, S. Husain, S. Paschalakis, and K. Wnukowicz, “Improvements to TM6.0 with a robust

visual descriptor proposal from University of Surrey and Visual Atoms,” in MPEG Standardisation con-tribution : ISO/IEC JTC1/SC29/WG11 CODING OF MOVING PICTURES AND AUDIO, M30311, Vienna, Austria, 2013.

[2] S. Husain and M. Bober, “Robust and scalable aggregation of local features for ultra large scale retrieval,” in IEEE International Conference on Image Processing (to appear), Paris, France, Oct 2014.

[3] H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and C. Schmid, “Aggregating local image descriptors into compact codes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 9, pp.1704–1716, Sept. 2012.

[4] J. Delhumeau, P. Gosselin, H. Jegou, and P. Perez, “Revisiting the VLAD image representation,” in ACM Multimedia, Barcelona, Spain, Oct. 2013.

[5] Herve Jegou, Matthijs Douze, and Cordelia Schmid, “Packing bag-of-features,” in IEEE Internation-al Conference on Computer Vision, Sep 2009, pp. 2357–2364.

[6] F. Perronnin, Y. Liu, J. Snchez, and H. Poirier, “Large-scale image retrieval with compressed fisher vectors.” in CVPR. 2010, pp. 3384–3391, IEEE.

[7] D.G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60, 2, pp. 91-110, 2004

[8] B. Fan, F. Wu, and Z. Hu. Rotationally Invariant Descriptors Using Intensity Order Pooling. In "Pattern Analysis and Machine Intelligence, IEEE Transactions on", vol.34, no. 10, pages 2031--2045, Oct. 2012.

[9] Z. Wang, B. Fan, and F. Wu. Local intensity order pattern for feature description. In IEEE ICCV, 2011, pages 603--610.

[10] V.Chandrasekhar, G. Takacs, D. Chen, S. Tsai, M. Makar, and B. Girod. Feature matching perfor-mance of compact descriptors for visual search. In "IEEE Data Compression Conference (DCC)", 2014.

[11] H. Bay, A. Ess, T. Tuytelaars, and Luc Van Gool.Speeded-Up Robust Features (SURF). Comput. Vis. Image Underst. 110, 3 (June 2008), pages 346--359.

[12] T. Trzcinski, M. Christoudias, V. Lepetit and P. Fua. Boosting Binary Keypoint Descriptors. In "Computer Vision and Pattern Recognition", 2013.

[13] S. Leutenegger, M. Chli, and R. Siegwart. Brisk: Binary robust invariant scalable keypoints. In "Computer Vision (ICCV), 2011 IEEE International Conference on", pages 2548--2555, Nov 2011.

[14] K. Iwamoto, R. Mase, and T. Nomura. Bright: A scalable and compact binary descriptor for low-latency and high accuracy object identification. In "Image Processing (ICIP), 2013 20th IEEE Interna-tional Conference on", pages 2915--2919, Sept 2013.

[15] X. Yang and K. Cheng. Ldb: An ultra-fast feature for scalable augmented reality on mobile devices. In "Mixed and Augmented Reality (ISMAR), 2012 IEEE International Symposium on, pages 49--57, Nov 2012.

[16] CDVS Standard Specification [17] S. Saha and V. Demoulin. Aloha: An efficient binary descriptor based on haar features. In "Image

Processing (ICIP), 2012 19th IEEE International Conference on", pages 2345--2348, Sept 2012. [18] R. Ortiz. Freak: Fast retina keypoint. In "Proceedings of the 2012 IEEE Conference on Computer

Vision and Pattern Recognition (CVPR)", CVPR '12, pages 510--517, Washington, DC, USA, 2012. IEEE Computer Society.

[19] M. Calonder, V. Lepetit, C. Strecha, and P. Fua. Brief: Binary robust independent elementary fea-tures. In "Proceedings of the 11th European Conference on Computer Vision: Part IV", ECCV'10, pages 778--792. Springer-Verlag, 2010.

[20] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: An efficient alternative to sift or surf. In "Proceedings of the 2011 International Conference on Computer Vision", ICCV '11, pages 2564--2571, Washington, DC, USA, 2011. IEEE Computer Society.

BRIDGET D5.1 Final.docx 26/27

Page 27: D5.1i: Visual Search Tools – Report – Version A - BRIDGETict-bridget.eu/...D5.1_Visual_Search_Tools-Report-Version_A.pdf · 1.4 2015-01-15 QA review ... 3.2 Rank-based assignment

FP7-ICT-610691 D5.1: Visual Search Tools – Report – Version A [21] Davenport, G., Smith, T. A. & Pincever, N.Cinematic primitives for multimedia. IEEE Computer

Graphics and Applications, 11(4), 1991, pp. 67-74. [22] S. Lepsoy, G. Francini, G. Cordara, P.P. de Gusmao. Statistical modelling of outliers for fast visual

search. In: IEEE International Conference on Multimedia and Expo (ICME), 2011. IEEE. p. 1-6. [23] COTSACES, Costas; NIKOLAIDIS, Nikos; PITAS, Ioannis. Video shot detection and condensed

representation. A review. Signal Processing Magazine, IEEE, 2006, 23.2: 28-37. [24] R.J. Larsen and M.L. Marx. An Introduction to Mathematical Statistics and its Applications. Pren-

tice-Hall, 1986. [25] FOX, Edward A.; SHAW, Joseph A. Combination of multiple searches. NIST SPECIAL PUBLICATION

SP, 1994, 243-243. [26] ZHOU, Xin; DEPEURSINGE, Adrien; MULLER, Henning. Information fusion for combining visual

and textual image retrieval. In: Pattern Recognition (ICPR), 2010 20th International Conference on. IEEE, 2010. p. 1590-1593.

[27] Stavros Paschalakis,, Miroslaw Bober, Giovanni Cordara, Gianluca Francini (eds.), “Text of ISO/IEC FDIS 15938-13 Compact descriptors for visual search”, MPEG output doc. N14956, 110th MPEG mtg., Strasbourg, FR, October 2014.

[28] “Evaluation Framework for Compact Descriptors for Visual Search”, MPEG output doc. N12202, 97th MPEG mtg., Torino, IT, July 2011.

BRIDGET D5.1 Final.docx 27/27