vireo@trecvid 2016: multimedia event detection, ad-hoc video … · concept-bank vs object-vlad...
TRANSCRIPT
RESEARCH POSTER PRESENTATION DESIGN © 2015
www.PosterPresentations.com
Hao Zhang, Lei Pang, Yi-Jie Lu, Chong-Wah NgoVideo Retrieval Group (VIREO), City University of Hong Kong
http://vireo.cs.cityu.edu.hk
VIREO@TRECVID 2016: Multimedia Event Detection, Ad-hoc Video Search,Video to Text Description
Multimedia Event Detection (MED) Video to Text Description (VTT)
Motivation
Framework
Experiments
Conv 1Max Pool
Conv 2Max Pool
Conv 3Conv 4Conv 5
Extarction Scaling
Scaling
RoI
RoI
Feature Map ofScaled RoI
Feature Map
Feature Map of RoI
Scaled Feature Map of RoI
Image
RoI
Dee
p F
eatu
re E
xtra
ctio
n
Conv 1Max Pool
Conv 2Max Pool
Conv 3Conv 4Conv 5
COS Similarity Only around 40% of Regional Information left
Discriminatory power
of deep features consistently improves
0.4
0.45
0.5
0.55
0.6
0.65
0.7
conv1 conv2 conv3 conv4 conv5
AlexNet
cos-sim
0.4 0.5 0.6 0.7 0.8 0.9
conv1-1conv1-2conv2-1conv2-2conv3-1conv3-2conv3-3conv3-4conv4-1conv4-2conv4-3conv4-4conv5-1conv5-2conv5-3conv5-4
VGG-19-Net
cos-sim
0.4
0.5
0.6
0.7
0.8
0.9
1
conv1Res2ARes2BRes2CRes3ARes3BRes3CRes3DRes4ARes4BRes4CRes4DRes4ERes4FRes5ARes5BRes5C
Deep-ResNet-50
cos-sim
Video Tim
eFrame
Selective search
Candidate Objects
Temporal line
Spatial & temporal features clustering
Primary Scenes
Regional objects
Deep features K-means & VLAD
Video Tim
e
Frame People
Concept Bank
Baby Cake People
Time
Sequential Features
Average pooling
Chi2 SVM
Concept Bank Vocabulary
Object-VLAD Spatial Pyramid Pooling VLAD Details
Deep Feature MapExtraction
Feature Map7 X 7
Spatial Pyramid Pooling
7 X 7 6 X 6 5 X 5 2 X 2
50
de
scri
pto
rs
VLAD
Max pool filter:
50
de
scri
pto
rs
Concept-Bank VS Object-VLAD
CNN-VLAD
CNN-VLAD VS Object-VLAD
0
5
10
15
20
25
30
35
40
45
Team
2
VIR
EO
Team
3
Team
4
Team
5
Team
6
Team
7
Team
8
Team
9
Team
10
Team
11
MED16-EvalSub (MinfAP200%)
25
27
29
31
33
35
37
39
41
Team2 VIREO Team3 Team4
MED16-EvalFull (MinfAP200%)
Evaluation: PS-10Ex (Object-VLAD + Concept-Bank)
Evaluation: AH-10Ex
(Object-VLAD + Concept-Bank)
25
30
35
40
45
50
Team2 VIREO Team3 Team4
MED16-EvalFull (MinfAP200%)
0
5
10
15
20
25
30
35
40
45
50MED16-EvalSub (MinfAP200%)
Evaluation: PS-100Ex (Object-VLAD)
Pri
mary
Ru
ns
Pri
mary
Ru
ns
Pri
mary
Ru
ns
Framework
Concept-based Text Description Matching
Stacked Attention Model
Pool5 Feature Map
Image embedding layer
Some guy is playing Frisbee with a dog
Text CNN Layer
Soft-m
ax
Soft-m
ax
Multi-modal embedding
layer
Multi-modal embedding
layer
tanh
hA P1
vT u
hA
(2)
PI
(2)
vI
u(2)
S(vI, vT )
: feature for each region
1st Attention Layer 2nd Attention Layer
Objective-Function
Training Dataset Deep Features
Experiments
Ad-hoc Video Search (AVS)
OnlineOnline
OfflineOffline
Semantic Query Generation
Event QueryCleaning an appliance Tokens
NLP Parser
Concept matching and selection
Shot Corpus
$
€
₤
ƒ
$
¥
Concept Bank
Shot Ranking
Concept-based Video Representation
cleaning
refrigeratordishwasherstovemicrowave
spray bottle
kitchen
Semantic Query
Vocabulary
Framework
Performance for different concept settings
in Video-Search 2008 dataset (MinfAP%)
Experiments
Performance for different concept settings in
AVS 2016 evaluation dataset (MinfAP%)
Performance with and without highlighted concept
ranking (HCR)
Highlighted Concept Reranking (HCR)
guitar (#RC)
acoustic guitar (#I-1K)
electric guitar (#I-1K)
daytime outdoor (#SIN)
outdoor (#SIN)
outdoor (#RC)
a person playing guitar outdoors
Rerank 1
500
Without reranking With reranking