visual object analysis using regions and local features

Visual Object Analysis using Regions and Local

FeaturesCarles Ventura Royo

Co-advisorsXavier Giró i Nieto

Verónica Vilaplana Besler

TutorFerran Marqués Acosta

2

Outline• Introduction• Part I: Context Analysis in semantic segmentation• Part II: Multiresolution co-clustering for uncalibrated multiview

segmentation• Conclusions

3

Outline• Introduction• Part I: Context Analysis in semantic segmentation• Introduction• Related Work• Contributions• Experiments• Conclusions

• Part II: Multiresolution co-clustering for uncalibrated multiview segmentation• Conclusions

4


segmentation• Introduction• Related Work• Contributions• Experiments• Conclusions

• Conclusions

5

Introduction: Semantic segmentation

Instancesegmentation

Classsegmentation

boat

6

Introduction: Semantic segmentation

Part I: Single view Part II: Multiview

STATE OF THE ART

OUR RESULTS

7

Introduction: Visual Object Analysis

vs

Objects Scene

8

Introduction: Regions

9


1 2

9

6

7

3

45

810

11

9 2

3

12 10

15 14

4 13

5 1

16 7

18 17

8 6

19

BINARY PARTITION TREE

10


1 2

9

6

7

3

45

810

9

2

310

4

5

1

7

8

6

REGION ADJACENCY GRAPH

11

Introduction: Local Features

Local Features Global Features

12

Introduction: Local Features Aggregation• Bag of Features (BoF) [1]

vectorquantization

codebook

Bag of Features

[1] G Csurka et al, Visual Categorization with Bags of Keypoints. ECCV’04

13

Introduction: Local Features Aggregation• Pooling

1𝑁∑

𝑖=1

𝑁

𝑥 𝑖

1𝑁∑

𝑖=1

𝑁

𝑥 𝑖 𝑥𝑖𝑇

First Order Average Pooling (O1P) [1]

Second Order Average Pooling (O2P) [2]𝑥𝑖 : 𝑙𝑜𝑐𝑎𝑙 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠

No need of codebook High dimensionality

[1] Y Boureau et al, A Theoretical Analysis of Feature Pooling in Visual Recognition. ICML’10[2] J Carreira et al, Semantic segmentation with second-order pooling. ECCV’12

Part IContext analysis

in semantic segmentation

15



16

Introduction: Context

[2] A Rabinovich et al, Objects in Context. ICCV’07

Semantic context [1,2] Spatial context

[1] M Bar, Visual Objects in Context. Nature Reviews Neuroscience 2004

GOAL: Analyze the influence of the spatial context in object recognition

17



18

Related Work: Ideal scenarioGroundtruthobjectlocation

[1] J.R.R. Uijlings et al., The Visual Extent of an Object. IJCV’12

Conclusion: Aggregating the local features over three region pools (interior, border and surround) increases the performance [1]

19

Related Work: Realistic scenario• Pipeline [1]

Input image

Generate object

candidates

Rank object

candidates

Predict class

scores

Aggregate high-rank

candidates

[1] J Carreira et al, Object Recognition as Ranking Holistic Figure-Ground Hypotheses. CVPR’10

Semantic partition

20

Related Work: Realistic scenario• How is each class predictor trained? [1]

0.81790.6861

0.9013

0.73810.7105

0.6462

TRAI

NIN

GDA

TA

A SVR is used to learn the function that predicts the overlap for each class

GOAL: CHANGE SPATIAL CODIFICATION

O2PF O2PG

overlapscore

os_1os_2

os_N

SVR os = f([O2PF O2PG])

[O2PF_1 O2PG_1] [O2PF_2 O2PG_2]

[O2PF_1 O2PG_1]

…

[1] J Carreira et al, Semantic segmentation with second-order pooling. ECCV’12

21



22

Contributions• Figure-Border-Ground spatial pooling in the realistic scenario

os_1os_2

os_N

SVR os = f([O2PF O2PB O2PG])

[O2PF_1 O2PB_1 O2PG_1] [O2PF_2 O2PB_2 O2PG_2]

[O2PF_N O2PB_N O2PG_N]

…

23

Contributions• Contour-based spatial pyramid [1]: crown-based

os_1os_2

os_N

SVR os = f([O2PF O2PSR1 O2PSR2 O2PSR3 O2PSR4])

[O2PF_1 O2PSR1_1 O2PSR2_1 O2PSR3_1 O2PSR4_1] [O2PF_2 O2PSR1_2 O2PSR2_2 O2PSR3_2 O2PSR4_2]

[O2PF_N O2PSR1_N O2PSR2_N O2PSR3_N O2PSR4_N] [1] S Lazebnik et al, Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. CVPR’06

…

24

Contributions• Contour-based spatial pyramid [1]: Cartesian-based

os_1os_2

os_N

SVR os = f([O2PF O2PSR1 O2PSR2 O2PSR3 O2PSR4])

[O2PF_1 O2PSR1_1 O2PSR2_1 O2PSR3_1 O2PSR4_1] [O2PF_2 O2PSR1_2 O2PSR2_2 O2PSR3_2 O2PSR4_2]

[O2PF_N O2PSR1_N O2PSR2_N O2PSR3_N O2PSR4_N] [1] S Lazebnik et al, Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. CVPR’06

…

25



26

Experiments• Pascal VOC segmentation challenge 2011 & 2012 [1]• Train, validation and test subsets• Train: 1,112 (2011) / 1,464 (2012)• Validation: 1,111 (2011) / 1,449 (2012)• Test: 1,111 (2011) / 1,456 (2012)

• 20 semantic classes• aeroplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, dinningtable, dog,

horse, motorbike, person, pottedplant, sheep, sofa, train, tvmonitor

• Evaluation measure: Average Accuracy Classification

[1] M Everingham et al, The PASCAL Visual Object Classes (VOC) Challenge. IJCV’10

27

Experiments: Local Features Aggregation• Pooling

1𝑁∑

𝑖=1

𝑁

𝑥 𝑖

1𝑁∑

𝑖=1

𝑁

𝑥 𝑖 𝑥𝑖𝑇

First Order Average Pooling (O1P) [1]

Second Order Average Pooling (O2P) [2]𝑥𝑖 : 𝑙𝑜𝑐𝑎𝑙 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠

No need of codebook High dimensionality

[1] Y Boureau et al, A Theoretical Analysis of Feature Pooling in Visual Recognition. ICML’10[2] J Carreira et al, Semantic segmentation with second-order pooling. ECCV’12

28

Experiments• Ideal scenario• Train set: train11• Test set: val11

F [1] F-B F-G [1] F-B-G

eSIFT [1] 63.9 66.2 66.4 68.6

eMSIFT [1] 64.8 68.9 67.7 70.8


29


F [1] F-B F-B-G

Non SP 64.8 68.9 70.8

Crown-based SP 68.7 71.1 71.7

Cartesian-based SP 67.7 71.6 72.7


30


Figure SP (Figure) Border Ground AAC

eSIFT+eMSIFT+eLBP eSIFT 72.98 [1]

eSIFT+eMSIFT eSIFT+eMSIFT eSIFT+eMSIFT 73.84

eSIFT+eMSIFT+eLBP eMSIFT eSIFT+eMSIFT eSIFT+eMSIFT 75.86


31

Experiments• Realistic scenario (CPMC [1])• Train set: train11• Test set: val11

Figure SP (Figure) Border Ground AAC

eSIFT eSIFT 28.6 [2]

eSIFT eSIFT eSIFT 34.8

eSIFT+eMSIFT+eLBP eSIFT 37.2 [2]

eSIFT eSIFT eSIFT eSIFT 37.4

eSIFT+eMSIFT+eLBP eSIFT eSIFT eSIFT 39.6


[1] J Carreira et al, Constrained parametric min-cuts for automatic object segmentation. CVPR’10

32

Experiments• Realistic scenario (CPMC [1])• Train set: trainval11/12• Test set: test11/12


F-G [2] F-B-G SP(F)-B-G

VOC11 38.8 43.8 40.3

VOC12 39.9 42.2 40.8

[1] J Carreira et al, Constrained parametric min-cuts for automatic object segmentation. CVPR’10

33

Experiments• Realistic scenario (MCG [1])• Train set: train11• Test set: val11


F-G [2] F-B-G SP(F)-B-G

CPMC 37.2 38.9 39.6

MCG 30.9 34.1 36.1

[1] P Arbeláez et al, Multiscale combinatorial grouping. CVPR’14

34

Experiments: Qualitative evaluationF-G F-B-G F-G F-B-G

aeroplanebicycle bicycle

cat bird

motorbike boat

bottle

busbus

motorbike car

chaircat

chair chair

horse bird

cow

35

Experiments: Qualitative evaluationF-G F-B-G F-G F-B-Gchair

diningtable

cow dog

person

horseperson motorbike

motorbikemotorbike

person

pottedplant bottle

sheep

sofacat

bus

train train

tvmonitor

36



37

Conclusions• Figure-Border-Ground spatial pooling improves the original Figure-

Ground pooling in both ideal and realistic scenarios• The Border region pool carries the richest contextual information• The Cartesian-based spatial pyramid outperforms the crown-based

spatial pyramid, but both of them may result in overfitting• Both Figure-Border-Ground pooling and Cartesian-based spatial

pyramid have been validated with MCG object candidates• Published in ICIP’15

Part IIMultiresolution co-clustering for

uncalibrated multiview segmentation

39



• Conclusions

40

IntroductionST

ATE

OF

THE

ART

OU

R RE

SULT

S

41

Introduction• First goal: improving generic segmentation• Motion-based region adjacency graph• New resolution parameterization• Relaxing hierarchical constraints with a two-step architecture• Practical framework for a global optimization

• Second goal: improving semantic segmentation• Semantic-based generic segmentation• Automatic resolution selection technique• Generic segmentation based semantic segmentation

42

Introduction• Co-segmentation

• Video segmentation

• Co-clustering

43



• Conclusions

44

Related Work: Co-clustering framework [1,2]• Objective: Find the clusters that define the coherent regions across

the different views at multiple resolutions

[2] D Varas et al, Multiresolution hierarchy co-clustering for semantic segmentation in sequences with small variations. ICCV’15[1] D Glasner et al, Contour-based joint clustering of multiple segmentations. CVPR’11

LEAV

ES

PART

ITIO

NS

CO-CLUSTERED PARTITIONS

INPU

T IM

AGES

HIER

ARCH

IES

45

Related Work: Co-clustering framework [1,2]• Objective: Find the clusters that define the coherent regions across

the different views

view 1 view 2 view 1 view 2

LEAVES PARTITIONS CO-CLUSTERED PARTITIONS

[2] D Varas et al, Multiresolution hierarchy co-clustering for semantic segmentation in sequences with small variations. ICCV’15[1] D Glasner et al, Contour-based joint clustering of multiple segmentations. CVPR’11

R2

46

Related Work: Co-clustering framework• Representation with boundary variables• Intra-image boundary variables: D1,2, D1,3, D2,3, D4,5, D5,6

• Inter-image boundary variables: D1,4, D1,5, D2,4, D2,5, D3,6

view 1 view 2 view 1 view 2


D1,2 = 0 D1,4 = 0D1,3 = 1 D1,5 = 0D2,3 = 1 D2,4 = 0D4,5 = 0 D2,5 = 0D5,6 = 1 D3,6 = 0

R2

47

Related Work: Co-clustering framework• How are the values of the boundary variables chosen?

view 1 view 2

LEAVES PARTITIONS

INTRA INTERACTIONS INTER INTERACTIONS

Q1,2, Q1,3, Q2,3, Q4,5, Q5,6 Q1,4, Q1,5, Q2,4, Q2,5, Q3,6

R2

48

Related Work: Co-clustering framework• Hierarchical constraint

view 1 view 2

1 2

7 3

8

4 5

9 6

10

Co-clustered partitions cannot violate the hierarchical structures

R2

49


view 1 view 2

1 3

7 2

8

4 5

9 6

10

Co-clustered partitions cannot violate the hierarchical structures

R2

50

Related Work: Co-clustering framework• Multiresolution parameterization

view 1 view 2

LEAVES PARTITIONS

…

R2

51

Related Work: Co-clustering framework• Iterative approach

52



• Conclusions

53

Contribution I: Motion-based adjacency

View #i View #i-1

54

Contribution I: Motion-based adjacency• Similarity computation• RAG definition

View #i View #i-1

55

Contribution II: Resolution parameterization

view 1 view 2

LEAVES PARTITIONS…

Original parameterization

Proposed parameterization

= ???

= 2

R2

56

Contribution III: Two-step iterative architecture• Hierarchical constraints are not imposed in a second step

57

Contribution III: Two-step iterative architecture

First step Second step

58

Contribution III: Two-step iterative architecture

59

Contribution IV: Generic global co-clustering

• All co-clustered partitions resulting from the iterative architecture are fed into a global optimization

• The reduction on the number of regions makes the global optimization feasible

60

Contribution V: Semantic global co-clustering

• Semantic information is introduced in the global optimization

61

Contribution V: Semantic global co-clustering

GENERICCO-CLUSTERING

SEMANTIC SEGMENTATIONS

SEMANTIC CO-CLUSTERING

62

Contribution VI: Automatic resolution selection

view 1 view 2

LEAVES PARTITIONS…

MULTIRESOLUTIONCO-CLUSTERING

• We propose a method that automatically selects the resolution that best fits with the semantic information

SEMANTICPARTITIONS

SINGLE RESOLUTIONCO-CLUSTERING

R2

63

Contribution VII: Coherent semantic partitions

view 1 view 2LEAVES PARTITIONS

SEMANTIC PARTITIONS

SINGLE RESOLUTIONCO-CLUSTERING

COHERENTSEMANTIC PARTITIONS

R2

64

Contribution VII: Coherent semantic partitions

STATE OF THE ART [1]

OUR RESULTS

[1] S Zheng et al, Conditional Random Fields as Recurrent Neural Networks. ICCV’15

65



• Conclusions

66

Experiments: Dataset• Multiview dataset [1]

[1] A. Kowdle et at, Multiple view object cosegmentation using appearance and stereo cues (ECCV’12)

67

Experiments: Generic co-clusteringCo-segmentation techniques

Video segmentation techniques

Co-clustering techniques• I-1S: Motion-compensated one-step

iterative (baseline)• I-2S: Two-step iterative• UCM+I-1S: First step is replaced by a cut

from a hierarchical segmentation algorithm• I-2S+GG: Two-step iterative followed by

generic global optimization

68

Experiments: Generic co-clustering

I-2S UCM+I-1S I-2S+GG

[KX12] [JBP12] [XXC12] [GKHE10] [GCS13] UCM+Pr I-1S

BMW 0.72 0.68 0.70 0.42 0.56 0.70 0.65 0.63 0.62 0.67

Chair 0.79 0.77 0.76 0.53 0.78 0.80 0.76 0.47 0.59 0.78

Couch 0.93 0.95 0.94 0.78 0.90 0.85 0.88 0.73 0.89 0.90

GardenChair 0.84 0.63 0.87 0.31 0.52 0.70 0.68 0.63 0.84 0.80

Motorbike 0.76 0.77 0.77 0.39 0.39 0.71 0.73 0.46 0.54 0.70

Teddy 0.92 0.92 0.92 0.69 0.87 0.88 0.84 0.85 0.82 0.90

Average 0.83 0.79 0.83 0.52 0.67 0.77 0.76 0.63 0.72 0.79

CO-CLUSTERING CO-SEGMENTATION VIDEO SEGMENTATION BASELINES

• Two-step iterative co-clustering techniques (I-2S and I-2S+GG) outperform other state-of-the-art techniques

69

Experiments: Semantic co-clusteringCo-clustering techniques• I-2S+GG(MR): Multiresolution global

generic co-clustering• I-2S+SG(MR): Multiresolution global

semantic co-clustering• I-2S+GG(SR): Single resolution global

generic co-clustering• I-2S+SG(SR): Single resolution global

semantic co-clustering

Semantic segmentation techniques• SCSS: Semantic co-clustering based

semantic segmentation• GCSS: Generic co-clustering based

semantic segmentation• [ZJRP+15]: state-of-the-art

[ZJRP+15] S Zheng et al, Conditional Random Fields as Recurrent Neural Networks. ICCV’15

70

Experiments: Qualitative assessment

71


72


leaves partition

I-2S I-2S+GG I-2S+SG SCSS [ZJRP+15]


73


leaves partition

I-2S I-2S+GG I-2S+SG SCSS


[ZJRP+15]

74


Occlusion/Object Boundary Detection Dataset [GVB11] Ballet and Breakdancers datasets [ZKU+04]

75



• Conclusions

76

Conclusions• The use of motion cues significantly improved the performance• The new resolution parameterization allowed us to have a more uniform

distribution of resolutions• The two-step architecture improved the performance of the original one-

step architecture • Although global optimization is now feasible, there is no clear gain for

generic co-clustering. However, it is useful for semantic co-clustering.• A small decrease in performance is achieved as a result of applying the

resolution selection technique• Submitted to ECCV’16 (waiting decision)

77

Future Work• Extending experiments to video datasets• VSB100 (Video Segmentation Benchmark) [1]• Cityscapes [2]

• Extending experiments to calibrated scenarios

• Training end-to-end CNNs for multiview semantic segmentation

[1] F Galasso et al, A Unified Video Segmentation Benchmark: Annotation, Metrics and Analysis. ICCV’13

[2] M Cordts et al, The cityscapes dataset for semantic urban scene understanding. CVPR’16

78



• Conclusions

79

Conclusions• Results achieved in the first part by considering new spatial

configurations are now obsolete after the outstanding results achieved by deep learning techniques.• Results from deep learning techniques were used in the second part.• The proposed multiresolution co-clustering has improved state-of-

the-art results, but we should consider an end-to-end deep learning approach to achieve a more significant improvement.• Semantic segmentation techniques evolve really fast, making this field

very competitive and challenging.

80

Publications• Related with the Thesis

• C. Ventura, D. Varas, X. Giro-i-Nieto, V. Vilaplana, F. Marques. Semantically driven multiresolution co-clustering for uncalibrated multiview segmentation. Submitted to the European Conference on Computer Vision (ECCV) 2016. In process of review.

• C. Ventura, X. Giro-i-Nieto, V. Vilaplana, K. McGuinness, F. Marques, Noel E O'Connor. Improving spatial codication in semantic segmentation. International Conference on Image Processing (ICIP) 2015.

• C. Ventura. Visual object analysis using regions and interest points. ACM international conference on Multimedia 2013.

81

Publications• Other publications:

• K. McGuinness, E. Mohedano, Z. Zhang, F. Hu, R. Albatal, Cathal Gurrin, N.E O'Connor, A. F. Smeaton, A. Salvador, X. Giro-i-Nieto, C. Ventura. Insight Centre for Data Analytics (DCU) at TRECVid 2014: instance search and semantic indexing tasks. TRECVID Workshop 2014.

• C. Ventura, V. Vilaplana, X. Giro-i-Nieto, F. Marques. Improving retrieval accuracy of Hierarchical Cellular Trees for generic metric spaces. Multimedia Tools and Applications, 2014.

• C. Ventura, X. Giro-i-Nieto, V. Vilaplana, D. Giribet, E. Carasusan. Automatic keyframe selection based on mutual reinforcement algorithm. International Workshop on Content-Based Multimedia Indexing (CBMI) 2013.

• C. Ventura, M. Tella-Amo, X. Giro-i-Nieto. UPC at MediaEval 2013 Hyperlinking Task. MediaEval 2013.

• C. Ventura, M. Martos, X. Giro-i-Nieto, V. Vilaplana, F. Marques. Hierarchical navigation and visual search for video keyframe retrieval. International Conference on Multimedia Modeling 2012.

83


Source: A. Oliva and A. Torralba, The role of context in object recognition

84


Source: A. Oliva and A. Torralba, The role of context in object recognition

85


Source: T. Malisiewicz and A. A. Efros, Improving spatial support for objects via multiple segmentations.

86

Related Work: Realistic scenario

Source: J. Carreira et al., Semantic segmentation with second-order pooling

Input image

Object segment hypotheses

Ranked object segment hypotheses (class independent)

object plausibility

score

87

Related Work: Realistic scenario

Source: J. Carreira et al., Semantic segmentation with second-order pooling

Predict overlap estimate of each segment to each object class and sort segments by maximal score

Aggregate high-rank segments

88

Related Work: Realistic scenario0.8179

0.68610.9013

0.73810.7105

0.6462

TRAI

NIN

GDA

TATE

STDA

TA ?0.4905


89

Related Work: Co-clustering framework• What are the contour elements?

view 1 view 2

LEAVES PARTITIONS Which contour elements are considered to compute Q1,4?• Contour elements of R1

• Contour elements of R4

90

Related Work: Co-clustering framework

INTRA INTERACTIONS INTER INTERACTIONS

91


92


LINEAR PROGRAMMING RELAXATION

93


12

3 4

5

Intra: Q1,2 = -0.81 Q3,4 = -0.81, Q3,5 = -0.81, Q4,5 = -0.49Inter: Q1,3 = 2.81e+03 Q1,4 = -1.36e+03 Q1,5 = -1.45e+03 Q2,3 = -2.81e+03 Q2,4 = 1.36e+03 Q2,5 = 1.45e+03

x 0

x 0

x 1

Q4,5 = -0.49 D4,5 = 1 ??𝐷4,5≤𝐷4,2+𝐷2,5

D4,2 = 0, D2,5 = 0 D4,5 = 0

94



95


PARENT NODE 11

Inter-sibling boundaries:

Intra-sibling boundaries:

96

Related Work: Co-clustering framework• Multiresolution parameterization

: Number of active contours to encode leave contours

: Maximum fraction to describe the r-th coarse level

: Maximum difference between consecutive levels

= 9 = 0.5 = 0.1

4.53.6

97

Related Work: Co-clustering framework• Iterative approach

98

Contribution II: Resolution parameterization

Selected inter-sibling boundaries:

99

Contributions• Semantic global co-clustering

1. Class assignment to regions 3. Optimization constraints• Regions from same partition

with same class

• Regions from different partitions with diferent class

2. Similarity penalizations• Regions from same partition

with different classes

100

Contribution VI: Automatic resolution selection• Some applications require a single resolution

l1

l2

C1

C2

C3

l1 C1 C2U

l2 C2

C2 l1 or l2 ? l1

101

Experiments: Semantic co-clustering

102

Conclusions• Multiresolution co-clustering framework for uncalibrated multiview

sequences• Two-step architecture• Global optimization• Semantic-based co-clustering with resolution selection

• Submitted to ECCV’16 (waiting decision)

103

Conclusions• Part I: Improving spatial codification in semantic segmentation• Figure-Border-Ground in realistic scenario• Contour-based spatial pyramid

• Part II: Multiresolution co-clustering for uncalibrated multiview segmentation• Results from Part I are replaced by SoA deep learning techniques• Generic co-clustering for multiview sequences• Semantic co-clustering for multiview sequences

visual object analysis using regions and local features

Data & Analytics