discovery and fusion of salient multi-modal features ... winston h.-m. hsu - digital video |...
TRANSCRIPT
- Winston H.-M. Hsu -
digital video | multimedia lab
COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK
Discovery and Fusion of Salient Multi-modal Features towards News Story Segmentation - @TRECVID 2003 Workshop
Winston Hsu1, Shih-Fu Chang1, Lyndon Kennedy1, Chih-wei Huang1, Ching-Yung Lin2, and Giridharan Iyengar3
11/17/2003
1Dept. of Electrical Engineering, Columbia University, New York, NY
2IBM T. J. Watson Research Center, Hawthorne, NY
3IBM T. J. Watson Research Center, Yorktown Heights, NY
-2-- Winston H.-M. Hsu -digital video | multimedia lab
COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK
n Story definition (from LDC)n A News story is defined as a segment of a news
broadcast with a coherent news focus which contains at least two independent, declarative clauses.
n Misc. segments like commercials, reporter chitchat, station identifications, public service, long musical (>9 sec), interludes, etc…
Story Segmentation
N N N N N N NM M M NM
-3-- Winston H.-M. Hsu -digital video | multimedia lab
COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK
Challenging problems due to diverse syntax
music/animation
: visual anchors
: story
sports
>> samples
* Visual anchors alone account for 51% and 67% of stories only on ABC/CNN
0.510.380.80CNN
0.670.670.67ABCAnchor Face
F1RPSetModalities
-4-- Winston H.-M. Hsu -digital video | multimedia lab
COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK
Our Goaln A robust statistical framework to fuse diverse
features from different modalitiesn An unified framework that can be adapted to different
new video sourcesn Automatically generate customized models (parameters) for
CNN and ABC channels with the same framework
n An efficient mechanism for inducing dominant features for any specific domainn Allow us to handle large pools of features smoothlyn Allow us to incorporate computational noisy feature
detectors::More information,"Discovery and Fusion of Salient Multi-modal Features towards News Story Segmentation," invited talk, Jan. 18-22, San Jose, SPIE/Electronic Imaging 2004.
-5-- Winston H.-M. Hsu -digital video | multimedia lab
COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK
n Issue: a story boundary at the candidate point ?n Use the perceptual multi-modal features computed from
surrounding windows to infer decisionswith observation xk to estimate posterior probability q(b|xk)
a anchor face?
motion energy changes?a commercial starts in 15 sec.
change from music to speech?significant pause occurs?just starts a speech segment?
{cue phrase}i appears {cue phrase}j appears
tk tk+1tk-1
Bp BnBc
Need for Multi-modal Fusionkt
-6-- Winston H.-M. Hsu -digital video | multimedia lab
COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK
Our Proposed Framework –Exponential Model w/ Perceptual Binary Features
0001010100101010100…
00101010001010101009
00010001010000001008
00010010001010001007
00010010101010001006
00010110000010101005
00010000001001001004
01011010101101000003
00010001000000101102
00010000001000001001
0001000000100000100b
training samples
( , )1( | ) , ( , ), {0,1}
( )
i ii
f x b
iq b x e f x b bZ x
λ
λλ
⋅∑= ∈
Raw features:
§Face
§Motion
§Significant Pause
§Speech segment
§Commercial
§Text segmentation score
§… * Use supervised learning to find optimal exponentialweight for binary featureiλ i
if
-7-- Winston H.-M. Hsu -digital video | multimedia lab
COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK
n Estimate from training data by minimizing Kullback-Leibler divergence, defined as
n Also maximize the log-likelihood (with max extreme “0”)
n Iteratively find
Parameter Estimation
( | )( || ) ( , ) log
( | )
( , ) log ( | ) constant( )x b
x b
p b xD p q p b x
q b x
p x b q b x p
λλ
λ
=
= − +
∑∑
∑∑
%% %
% %
iii λλλ ∆+=′
iλ
( | )q b xλ {( , )}k kT x b=
,
,
( , ) ( , )1log
( ) ( | ) ( , )ix b
iix b
p x b f x b
M p x q b x f x bλ
λ ∆ =
∑∑
%%
( ) ( , ) log ( | )px b
L q p x b q b xλ λ≡ ∑∑% %empirical distribution
estimated model
•Because of the convexity of objective function, the iterative process is guaranteed to the global optima.•Our Matlab implementations show efficient convergence in ~30 mins when using 30 features and 11,705 training samples
( || )D p q%p% q
-8-- Winston H.-M. Hsu -digital video | multimedia lab
COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK
n Input: collection of candidate features, training samples, and the desired model size
n Output: selected features and their corresponding exponential weights
n Current model augmented with feature with weight ,
n Select the candidate which improves current model the most, at each iteration;
Feature Selection
{ }{ }{ }{ }
*,
,
arg max sup ( || ) ( || )
arg max sup ( ) ( )
hh C
p h ph C
h D p q D p q
L q L q
αα
αα
∈
∈
= −
= −% %
% %
( , )
,
( | )( | )
( )
h x b
h
e q b xq b x
Z x
α
αα
=
h α
q
reduce divergence
increase log-likelihood
q
-9-- Winston H.-M. Hsu -digital video | multimedia lab
COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK
sigpas
music
comm.
text seg. score
face
shot
motion
Examples of Raw Features
booleanpointcombinatorial Misc.
booleansegmentsports
realpointtext seg. score
realsegmentmotion
Videobooleanpointshot boundary
realsegmentface
booleansegmentcommercial
realpointpause
Audiorealpointpitch jump
realpointsignificant pause
booleansegmentmusc./spch. disc.
realsegmentspch seg./rapidity
booleanpointASR cue terms
Text booleanpointV-OCR cue terms
ValueTime IndexRaw FeaturesModality § Features exist at different time scales, asynchronous points§Need an unified wrapper to convert them to consistent representation & imitate human perceptions
candidate point
-10-- Winston H.-M. Hsu -digital video | multimedia lab
COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK
Feature Wrapper
Feature Library
{ }( )rif t
{ }( )rif ⋅
MaximumEntropy
( | )q b ⋅Feature Wrapper
( , , , , )rw i kF f t dt v B
{ }jg
{ };k kh λ
rif
rif∆
dt : delta operation
v : binarization thresholds
B1B2
B3
110
B1B2
11
B1 1
ndelta interval:nobservation windows:n binarization levels:n raw features:n candidate point:
dt
vB
rif
kt
-11-- Winston H.-M. Hsu -digital video | multimedia lab
COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK
Selected Features (from CNN)
0.0008
0.0016
0.0022
0.0015
0.0015
0.0019
0.0024
0.0058
0.0160
0.3879
gain
The surrounding observation window has a pause with the duration larger than 0.25 second.
0.0939Pause10
A speech segment starts in the surrounding observation window0.3734Speech segment6
A commercial starts in 15 to 20 seconds after the candidate point.1.0782Commercial7
A speech segment ends after the candidate point-0.4127Speech segment8
A speech segment before the candidate point-0.3566 Speech segment 5
An anchor face segment occupies at least 10% of next window0.7251Anchor face9
An audio pause with the duration larger than 2.0 second appears after the boundary point.
0.2434Pause3
An anchor face segment just starts after the boundary point0.4771Anchor Face1
Significant pause
Significant pause & non-commercial
raw feature set
The surrounding observation window has a significant pause with the pitch jump intensity larger than the normalized pitch threshold 1.0 and the pause duration larger than 0.5 second.
0.79474
A significant pause within the non-commercial section appears in surrounding observation window.
0.74712
interpretationno
* The first 10 “A+V”features automatically discovered for the CNN channel
>>
>>>>
λ
-12-- Winston H.-M. Hsu -digital video | multimedia lab
COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK
Significant Pause
::sigpas_seg_0202cnn
UniformSignificant Pause
0.24
0.24
0.22
0.22
R
0.39
0.42
0.22
0.26
F1
0.20
0.20
0.10
0.10
P
0.22
0.22
0.14
0.14
F1
0.430.372.5
0.450.405.0CNN
0.340.162.5
0.380.205.0ABC
RPεSet
-13-- Winston H.-M. Hsu -digital video | multimedia lab
COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK
Precision vs. Recall CurvesA+V (ABC)
A+V+T (ABC)
A+V (CNN)
A+V+T (CNN)
Anchor Face (CNN)
Anchor Face (ABC)
§ Single point P/R is not sufficient for assessment -> need more samples of P/R curves§ Improvement by multi-modal fusion is significant
• CNN improves more in high recall area• ABC improve more in high precision area
0.670.74A+V+T
0.690.71A+V (BM)
A+V+T (BM)
CNNABC
0.73
0.63
0.76
0.69A+V
21
P RF
P R⋅ ⋅
=+
decision boundary: ( | ) mq b b⋅ >
-14-- Winston H.-M. Hsu -digital video | multimedia lab
COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK
Story Typing
0.900.900.91CNN
0.910.940.89ABCA+V+T
0.910.900.92CNN
0.920.920.93ABCA+V
F1RPSetModalities
Match filter
Keyframes of a test video
templates
Binary decision: news/non-news
Median Filters
News detection
result
{ }| |1 N
i Jt
i
i S S
S
typeε
>
= ∩
n A story segment is assigned as “News”if overlapping with the non-commercial segments larger than a threshold
' ( )T TA A M M= •o[ ] [ ]TM u n u n T= − −
o• : Morphological OPEN: Morphological CLOSE
450 (frames)T =
-15-- Winston H.-M. Hsu -digital video | multimedia lab
COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK
Summaryn We have developed a statistical framework that can be
systematically applied to diverse news video sourcesn The results are promising and show multi-modal improvement
n The same framework can be used to select dominant features from any modalities flexibly
n The performance shows room for further research n How to go beyond 75% and reach 90%?
n Evaluation metrics should include complete P/R curvesn Future works
n Address imbalanced data distributionsn Explore temporal dynamics in storiesn Expand feature pool such as speech phoneme rapidity, video OCR,
and high-level concept detection, etc.
-16-- Winston H.-M. Hsu -digital video | multimedia lab
COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK
Acknowledgementsn Thanks to
n Martin Franz of IBM Research for providing an ASR only story segmentation system
n Dongqing Zhang of Columbia University for providing the geometric active contour face detection system
n TRECVID 2003 organizing team for providing the evaluation platform and precious video corpora
- Winston H.-M. Hsu -
digital video | multimedia lab
COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK
Q & AThank You!!
*More information:"Discovery and Fusion of Salient Multi-modal Features towards News Story Segmentation," invited talk, Jan. 18-22, San Jose, SPIE/Electronic Imaging 2004.