discovery and fusion of salient multi-modal features ... winston h.-m. hsu - digital video |...

- Winston H.-M. Hsu -

digital video | multimedia lab

COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK

Discovery and Fusion of Salient Multi-modal Features towards News Story Segmentation - @TRECVID 2003 Workshop

Winston Hsu1, Shih-Fu Chang1, Lyndon Kennedy1, Chih-wei Huang1, Ching-Yung Lin2, and Giridharan Iyengar3

11/17/2003

1Dept. of Electrical Engineering, Columbia University, New York, NY

2IBM T. J. Watson Research Center, Hawthorne, NY

3IBM T. J. Watson Research Center, Yorktown Heights, NY

-2-- Winston H.-M. Hsu -digital video | multimedia lab


n Story definition (from LDC)n A News story is defined as a segment of a news

broadcast with a coherent news focus which contains at least two independent, declarative clauses.

n Misc. segments like commercials, reporter chitchat, station identifications, public service, long musical (>9 sec), interludes, etc…

Story Segmentation

N N N N N N NM M M NM



Challenging problems due to diverse syntax

music/animation

: visual anchors

: story

sports

>> samples

* Visual anchors alone account for 51% and 67% of stories only on ABC/CNN

0.510.380.80CNN

0.670.670.67ABCAnchor Face

F1RPSetModalities



Our Goaln A robust statistical framework to fuse diverse

features from different modalitiesn An unified framework that can be adapted to different

new video sourcesn Automatically generate customized models (parameters) for

CNN and ABC channels with the same framework

n An efficient mechanism for inducing dominant features for any specific domainn Allow us to handle large pools of features smoothlyn Allow us to incorporate computational noisy feature

detectors::More information,"Discovery and Fusion of Salient Multi-modal Features towards News Story Segmentation," invited talk, Jan. 18-22, San Jose, SPIE/Electronic Imaging 2004.



n Issue: a story boundary at the candidate point ?n Use the perceptual multi-modal features computed from

surrounding windows to infer decisionswith observation xk to estimate posterior probability q(b|xk)

a anchor face?

motion energy changes?a commercial starts in 15 sec.

change from music to speech?significant pause occurs?just starts a speech segment?

{cue phrase}i appears {cue phrase}j appears

tk tk+1tk-1

Bp BnBc

Need for Multi-modal Fusionkt



Our Proposed Framework –Exponential Model w/ Perceptual Binary Features

0001010100101010100…

00101010001010101009

00010001010000001008

00010010001010001007

00010010101010001006

00010110000010101005

00010000001001001004

01011010101101000003

00010001000000101102

00010000001000001001

0001000000100000100b

training samples

( , )1( | ) , ( , ), {0,1}

( )

i ii

f x b

iq b x e f x b bZ x

λ

λλ

⋅∑= ∈

Raw features:

§Face

§Motion

§Significant Pause

§Speech segment

§Commercial

§Text segmentation score

§… * Use supervised learning to find optimal exponentialweight for binary featureiλ i

if



n Estimate from training data by minimizing Kullback-Leibler divergence, defined as

n Also maximize the log-likelihood (with max extreme “0”)

n Iteratively find

Parameter Estimation

( | )( || ) ( , ) log

( | )

( , ) log ( | ) constant( )x b

x b

p b xD p q p b x

q b x

p x b q b x p

λλ

λ

=

= − +

∑∑

∑∑

%% %

% %

iii λλλ ∆+=′

iλ

( | )q b xλ {( , )}k kT x b=

,

,

( , ) ( , )1log

( ) ( | ) ( , )ix b

iix b

p x b f x b

M p x q b x f x bλ

λ ∆ =

∑∑

%%

( ) ( , ) log ( | )px b

L q p x b q b xλ λ≡ ∑∑% %empirical distribution

estimated model

•Because of the convexity of objective function, the iterative process is guaranteed to the global optima.•Our Matlab implementations show efficient convergence in ~30 mins when using 30 features and 11,705 training samples

( || )D p q%p% q



n Input: collection of candidate features, training samples, and the desired model size

n Output: selected features and their corresponding exponential weights

n Current model augmented with feature with weight ,

n Select the candidate which improves current model the most, at each iteration;

Feature Selection

{ }{ }{ }{ }

*,

,

arg max sup ( || ) ( || )

arg max sup ( ) ( )

hh C

p h ph C

h D p q D p q

L q L q

αα

αα

∈

∈

= −

= −% %

% %

( , )

,

( | )( | )

( )

h x b

h

e q b xq b x

Z x

α

αα

=

h α

q

reduce divergence

increase log-likelihood

q



sigpas

music

comm.

text seg. score

face

shot

motion

Examples of Raw Features

booleanpointcombinatorial Misc.

booleansegmentsports

realpointtext seg. score

realsegmentmotion

Videobooleanpointshot boundary

realsegmentface

booleansegmentcommercial

realpointpause

Audiorealpointpitch jump

realpointsignificant pause

booleansegmentmusc./spch. disc.

realsegmentspch seg./rapidity

booleanpointASR cue terms

Text booleanpointV-OCR cue terms

ValueTime IndexRaw FeaturesModality § Features exist at different time scales, asynchronous points§Need an unified wrapper to convert them to consistent representation & imitate human perceptions

candidate point



Feature Wrapper

Feature Library

{ }( )rif t

{ }( )rif ⋅

MaximumEntropy

( | )q b ⋅Feature Wrapper

( , , , , )rw i kF f t dt v B

{ }jg

{ };k kh λ

rif

rif∆

dt : delta operation

v : binarization thresholds

B1B2

B3

110

B1B2

11

B1 1

ndelta interval:nobservation windows:n binarization levels:n raw features:n candidate point:

dt

vB

rif

kt



Selected Features (from CNN)

0.0008

0.0016

0.0022

0.0015

0.0015

0.0019

0.0024

0.0058

0.0160

0.3879

gain

The surrounding observation window has a pause with the duration larger than 0.25 second.

0.0939Pause10

A speech segment starts in the surrounding observation window0.3734Speech segment6

A commercial starts in 15 to 20 seconds after the candidate point.1.0782Commercial7

A speech segment ends after the candidate point-0.4127Speech segment8

A speech segment before the candidate point-0.3566 Speech segment 5

An anchor face segment occupies at least 10% of next window0.7251Anchor face9

An audio pause with the duration larger than 2.0 second appears after the boundary point.

0.2434Pause3

An anchor face segment just starts after the boundary point0.4771Anchor Face1

Significant pause

Significant pause & non-commercial

raw feature set

The surrounding observation window has a significant pause with the pitch jump intensity larger than the normalized pitch threshold 1.0 and the pause duration larger than 0.5 second.

0.79474

A significant pause within the non-commercial section appears in surrounding observation window.

0.74712

interpretationno

* The first 10 “A+V”features automatically discovered for the CNN channel

>>

>>>>

λ



Significant Pause

::sigpas_seg_0202cnn

UniformSignificant Pause

0.24

0.24

0.22

0.22

R

0.39

0.42

0.22

0.26

F1

0.20

0.20

0.10

0.10

P

0.22

0.22

0.14

0.14

F1

0.430.372.5

0.450.405.0CNN

0.340.162.5

0.380.205.0ABC

RPεSet



Precision vs. Recall CurvesA+V (ABC)

A+V+T (ABC)

A+V (CNN)

A+V+T (CNN)

Anchor Face (CNN)

Anchor Face (ABC)

§ Single point P/R is not sufficient for assessment -> need more samples of P/R curves§ Improvement by multi-modal fusion is significant

• CNN improves more in high recall area• ABC improve more in high precision area

0.670.74A+V+T

0.690.71A+V (BM)

A+V+T (BM)

CNNABC

0.73

0.63

0.76

0.69A+V

21

P RF

P R⋅ ⋅

=+

decision boundary: ( | ) mq b b⋅ >



Story Typing

0.900.900.91CNN

0.910.940.89ABCA+V+T

0.910.900.92CNN

0.920.920.93ABCA+V

F1RPSetModalities

Match filter

Keyframes of a test video

templates

Binary decision: news/non-news

Median Filters

News detection

result

{ }| |1 N

i Jt

i

i S S

S

typeε

>

= ∩

n A story segment is assigned as “News”if overlapping with the non-commercial segments larger than a threshold

' ( )T TA A M M= •o[ ] [ ]TM u n u n T= − −

o• : Morphological OPEN: Morphological CLOSE

450 (frames)T =



Summaryn We have developed a statistical framework that can be

systematically applied to diverse news video sourcesn The results are promising and show multi-modal improvement

n The same framework can be used to select dominant features from any modalities flexibly

n The performance shows room for further research n How to go beyond 75% and reach 90%?

n Evaluation metrics should include complete P/R curvesn Future works

n Address imbalanced data distributionsn Explore temporal dynamics in storiesn Expand feature pool such as speech phoneme rapidity, video OCR,

and high-level concept detection, etc.



Acknowledgementsn Thanks to

n Martin Franz of IBM Research for providing an ASR only story segmentation system

n Dongqing Zhang of Columbia University for providing the geometric active contour face detection system

n TRECVID 2003 organizing team for providing the evaluation platform and precious video corpora

- Winston H.-M. Hsu -

digital video | multimedia lab


Q & AThank You!!

*More information:"Discovery and Fusion of Salient Multi-modal Features towards News Story Segmentation," invited talk, Jan. 18-22, San Jose, SPIE/Electronic Imaging 2004.

discovery and fusion of salient multi-modal features ... winston h.-m. hsu - digital video |...

Documents