multimodal information fusion...ling guan, paisarn muneesawang, yongjin wang, rui zhang, yun tie,...

22/02/2016

1

MultiModalInformation Fusion

Ling GuanRyerson Multimedia Laboratory &

Centre for Interactive Multimedia Information MiningRyerson University,

Toronto, Ontario [email protected]

http://www.rml.ryerson.ca/1

Acknowledgement The presenter would like to thank his former

and current students, P. Muneesawang, Y. Wang, R. Zhang, A. Bulzacki, M.T. Ibrahim, N. Joshi, Z. Xie, L. Gao and N. El Madany for their contributions to this research.

Slides on fusion fundamentals provided by Prof. S-Y Kung of Princeton University are greatly appreciated.

This research is supported by The Canada Research Chair (CRC) Program, Canada Foundation for Innovations (CFI), The Ontario Research Fund (ORF), and Ryerson University

Ryerson University – Multimedia Research LaboratoryL. Guan, P. Muneesawang, Y. Wang, R. Zhang, Y. Tie, A. Bulzacki, M.T. Ibrahim, N. Joshi, Z.Xie, L. Gao, N. El Madany 2

22/02/2016

2

Major Publications Y. Wang, L. Guan and A.N. Venetsanopoulos, “Kernel based fusion with application to

audiovisual emotion recognition”, IEEE Trans. on Multimedia, vol. 14, no. 3, pp. 597-607, Jun 2012.

P. Muneesawang, T. Amin and L. Guan, “A new learning algorithms for the fusion of adaptive audio-visual features for the retrieval and classification of movie clips,” Journal of Signal Processing Systems for Signal, Image and Video Technology, vol. 59, no. 2, pp. 177-188, May 2010.

Y. Wang and L. Guan, “Combining speech and facial expression for recognition of human emotional state,” IEEE Transactions on Multimedia, vol. 10, no. 5, pp. 936 -946, August 2008.

N. Joshi and L. Guan, “ASR with the combination of recognizers under non-stationary noise conditions”, Journal of Signal Processing Systems, vol. 58, no. 3, pp, 359-370, Mar 2010.

L. Guan, P. Muneesawang, Y. Wang, R. Zhang, Y. Tie, A. Bulzacki and M.T. Ibriham, “Multimedia Multimodal Technologies,” Proc. IEEE Workshop on Multimedia Signal Processing and Novel Parallel Computing (In conjunction with ICME 2009), pp. 1600-1603, NYC, USA, Jul 2009 (Overview Paper).

R. Zhang and L. Guan, “Multimodal image retrieval via Bayesian information fusion,” Proc. IEEE Int. Conf. on Multimedia and Expo, pp. 830-833, NYC, USA, Jun/Jul 2009.

Z. Xie and L. Guan“Mutimodal information fusion of audio emotion recognition based on kernel entropy component analysis,” vol. 7, no. 1, pp. 25-42, Int. J. of Semantic Computing, Aug 2013.

Ryerson University – Multimedia Research LaboratoryLing Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 3

Why Multimedia Multimodal Methodology? (revisit)

Multimedia is a domain of multi-facets, e.g., audio, visual, text, graphics, etc.

A central aspect of multimedia processing is the coherent integration of media from different sources or multimodalities.

Easy to define each facet individually, but difficult to consider them as a combined identity

Humans are natural and generic multimedia processing machines

Can we teach computers/machines to do the same (via fusion technologies)?


22/02/2016

3

Human–Computer Interaction Learning Environments Consumer Relations Entertainment Digital Home, Domestic Helper Security/Surveillance Educational Software Computer Animation Call Centers

Potential Applications


Source of Fusion for Classification

Data/Feature #1

Interaction

Score (Decision)

Interaction

Score (Decision)

Data/Feature #2Ryerson University – Multimedia Research LaboratoryLing Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim

6

22/02/2016

4

7Feature (Data) level fusion

Direct Data (Feature) Level Fusion

Prior knowledge can be incorporated into the fusion models by modifying


Vifeatu

features:

22/02/2016

5

9Interaction level fusion

Interaction Level Fusion

HMM (Face Model) Fused HMM


22/02/2016

6

11Decision (Score) level fusion

Modular Networks (Decision Level)

E1

•••

••

•••

••

•••

••

....

RECOGNIZED

EMOTION

........

DEC I SION. .

. .

. . .

.

. . .

.

Er

y1

yj

y2

MODULE

Ynet

E2

NETWORK

INPUT

• Hierarchical Structure

• Each Sub-network Er anexpert system

• The decision moduleclassifies the input vectoras a particular class when

Ynet = arg max yj

A Different interpretation

12

22/02/2016

7

Score Fusion

1a Score Fusion (w/o supervision)• Linear Score Fusion (confidence/prior knowledge) • Nonlinear Score Fusion (ROC-based)

1b Score Fusion (via supervision)• Linear Score Fusion (adaptive supervision)• Nonlinear Score Fusion (adaptive supervision)


Score Fusion Architecture (Audio-Visual)

• The lower layer contains local experts, each produces alocal score based on a single modality

• The upper layer combines the score.

The scores are independently obtained, which are then combined.


22/02/2016

8

Linear Fusion

Score 1

Sco

re 2

Accept

Reject

Non-uniformly weighted

Linear SVM (supervised) Fusion is an appealing alternative.

The most prevailing unsupervised approaches estimate the confidence based on prior knowledge or training data.


(Kernel, SVM)

Score 1

Sco

re 2

Nonlinear Adaptive Fusion (via supervision)


22/02/2016

9

Simple and straightforward (Good) Curse of Dimensionality (Bad) Normalization issue

Case study: Bimodal Human emotion recognition (also with a score fusion flavor)Y. Wang and L. Guan, “Recognizing human emotional state from audiovisual signals,” IEEE Transactions on Multimedia, vol. 10, no. 5, pp. 936 - 946, August 2008.

Data/Feature Fusion (early work)


Indicators of emotion

Major indicators of emotionMajor indicators of emotionMajor indicators of emotionSpeechFacial expressionBody language: highly dependent on personality, gender, age, etcSemantic meaning: two sentences could have the same lexical meaning but different emotional information

......

Major indicators of emotion


22/02/2016

10

19

Objective

To develop a generic language and cultural background independent system for recognition of human emotional state from audiovisual signals


22/02/2016

11

Audio feature extraction

Noise reduction

Input speech

Windowing

Prosodic

MFCC

Formant

Audio feature set

Leading and trailingedge elimination

Waveletthresholding

512 points, 50% overlap

PreprocessingHammingwindow


Visual feature extraction

Key Frame Extraction

Face Detection

Gabor Filter Bank

Feature Mapping

Input Image Sequence

Visual Features

Maximum Speech Amplitude


22/02/2016

12

The recognition system - with Decision Fusion

AN

DI

FE

HA

SA

SU

Decisio

n m

od

ule

Co

rrespo

nd

ing

classifier

Featu

re selection

Key frameextraction

Face detection

Gabor wavelet

Recognizedemotion

Inputvideo

Pre-processing

Windowing

Prosodic

MFCC

Formant

Input speech

Audio feature extraction

Visual feature extraction Classification scheme


Modular Networks

E1

•••

••

•••

••

•••

••

....

RECOGNIZED

EMOTION

........

DEC I SION. .

. .

. . .

.

. . .

.

Er

y1

yj

y2

MODULE

Ynet

E2

NETWORK

INPUT

• Hierarchical Structure

• Each Sub-network Er anexpert system

• The decision moduleclassifies the input vectoras a particular class when

Ynet = arg max yj

24

22/02/2016

13

Experimental results

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

Accura

cy (%

)

Audio 88.46 61.90 59.09 48.00 57.14 80.00 66.43

Visual 34.62 47.62 68.18 52.00 47.62 48.00 49.29

Audiovisual 76.92 76.19 77.27 56.00 61.90 72.00 70.00

Stepwise 80.77 61.90 77.27 80.00 76.19 76.00 75.71

Multi-classifier 88.46 80.95 77.27 80.00 80.95 84.00 82.14

Anger Disgust Fear Happiness Sadness Surprise Overall

Experiments were performed on 500 video samples from 8 subjects, speaking 6 languages Six emotion labels: Anger, Disgust, Fear, Happiness, Sadness, and Surprise 360 samples (from six subjects) were used for training, and the rest 140 (from the remaining two subjects) for testing, there is no overlap between training and testing subjects

25Ryerson University – Multimedia Research LaboratoryLing Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim

In general, not straightforward Fusion takes place on intermediate results

dynamically

Case study: 1. Speech Recognition 2. Image Retrieval with Audio Information

Interaction Fusion


22/02/2016

14

Speech RecognitionFusing information obtained from different

processing methods

Ryerson University – Multimedia Research LaboratoryLing Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim

1. N. Joshi and L. Guan, “ASR with the combination of recognizers under non-stationary noise conditions”, Journal of Signal Processing Systems, vol. 58, no. 3, pp, 359-370, Mar 2010.

2. N. Joshi and L. Guan, “Combination of recognizers and fusion of features approach to missing data ASR under non-stationary noise conditions,” Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. IV, pp. 1041-1044, Honolulu, USA, April 2007.

27

The System Diagram


22/02/2016

15

Two separate HMM based models:

spectral features, missing data (MD),

MFCC features. The Fused HMM

model is used for the interaction level fusion.

Interaction Level Fusion


Speech Recognition


• CMN - Cepstral Mean Normalization

TABLE I. RECOGNITION RESULTS WITH TEST CORPUS + FACTORY NOISE

SNR 18dB SNR 6dB

Conventional MFCC 83.7 64.7

Conventional MFCC CMN 66.0 60.3

Conventional Spectral Features, MD 76.5 67.4

Proposed MFCC+MD 84.5 67.5

Proposed MFCC, CMN+MD 88.6 73.5

30

22/02/2016

16

Image Retrieval with Audio Cues


R. Zhang and L. Guan, “A collaborative Bayesian image retrieval framework,” Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc., pp. 1593-1596, Taipei, Taiwan, Apr 2009.

31

General Idea

Image matching using audio information

Bayesian fusion

Q

Database

Audio information of a query image

QQA

Semantic class

weighting

Weight propagation

QQV

Nearest neighbor retrieval

Visual information of a query image

2/22/2016 32ICME 2009


22/02/2016

17

Experimental Setup

Database◦ 4400 images collected

from Flickr◦ Featuring 44 kinds of

animals Visual feature

selection Audio feature selection◦ MFCC with frame length

equal to 256.

2/22/2016 33ICME 2009


Experimental Results

Precision as a function of the number of retrieval iterations. Observations:1. Major improvement is obtained within the first iterations.2. Information fusion improve the performance and the audio

relevance feedback further improves it.

2/22/2016 34ICME 2009


22/02/2016

18

Interface for Subjective Evaluation

2/22/2016 35ICME 200935

Interface for Subjective Evaluation (2)

Not visually similar

2/22/2016 36ICME 200936

22/02/2016

19

Could be straightforward or involving more analysis.

Rigid due to limit on information left

Case study: 1. Video Retrieval based on Audiovisual Cues

Score Level Fusion


Video Retrieval by

Audiovisual Cues


1. P. Muneesawang, T. Amin and L. Guan, “A new learning algorithms for the fusion of adaptive audio-visual features for the retrieval and classification of movie clips,” Journal of Signal Processing Systems, vol. 59, no. 2, pp. 177-188, May 2010.

2. T. Amin, M. Zeytinoglu and L. Guan, “Application of Laplacian mixture model for image and video retrieval,” IEEE Transactions on Multimedia, vol. 9, no, 7, pp. 1416-1429, November 2007.

3. P. Muneesawang and L. Guan, “Adaptive video indexing and automatic/semi-automatic relevance feedback,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 15, no. 8, pp. 1032-1046, August 2005.

38

22/02/2016

20

2. Video Retrieval

The scores are independently obtained, which are then combined

SVM

Classifier forAudio Channel

Classifier forVideo Channel

Audio Score Visual Score

Fused Score

)( AX )(VX

39Ryerson University – Multimedia Research LaboratoryLing Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 39

SVM Fusion40


22/02/2016

21

Visual Feature Representation

Visual◦ Adaptive Video Indexing (AVI) Using visual templates

TFxIDF Model

Cosine Distance for Similarity Matching

][log

]}[{max

][][

jn

N

jfr

jfrjf to

j

v

2

}1,...,1,0{

)(1

ˆ1minarg ji

Tj Nl

c

i hhh

,ˆ1minarg

2

}{}1,...,1,0{

)(2

)(1

ji

lTj Nl

ic

i hhh

h


Visual Feature Representation42


22/02/2016

22

Audio Feature Representation

Laplacian Mixture Model (LMM) of wavelet coefficients of audio signal

Audio feature vector with model parameter (using EM estimator)

are the model parameters obtained from the i-th high-frequency subband.

1

)|w()|w()w(

21

222111

bpbpp iii

1,...,2,1,,,,, ,2,1,100 Libbm iiia f

iii bb ,2,1,1 ,,


44

LMM vs GMM (2 COMPONENTS)

22/02/2016

23

Video Classification Result

Recognition rate obtained by the SVM based fusion model, using video database of 6,000 clips

Five semantic concepts


Multimodal Human Authentication


withSignature, Iris and Fingerprint

46

22/02/2016

24

47

Signature RecognitionRyerson University – Multimedia Research LaboratoryLing Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim 47

Filter-bank Based

MinutiaeBased

48Ryerson University - Department of Electrical and Computer EngineeringLing Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim

Fingerprint Image Enhancement System Overview

Match Decision

Match Decision

Fusion Final Decision

48

22/02/2016

25

Decision

49

Iris Segmentation System Overview

Feature Extraction

Feature Extraction

Iris CodesFeature Level Fusion

Match


50

Fingerprint/Signature/Iris Fusion

Final Decision

Feature Extraction

Feature Extraction

Feature Extraction

Match Score

Match Score

Match Score

Score Fusion


22/02/2016

26

Body Tracking

Camera Tracking

Movement Recognition

Face Tracking

Emotion Recognition

Hand Gesture

Recognition

Stereo Vision /

Depth Calc.

Robot Applications51


1. Target people group: Elderly and disabled people at homes or community houses

2. Capable of simple gestures and body language

3. Capable of simple, and, sometime, incomplete verbal communications

Robot Application: Domestic Helper- via emotion/intention recognition


22/02/2016

27

Robot Application: Domestic Helper - via emotion/intention recognition

1. Help the elderly and the disabled with their daily life.

2. Entertain the people they look after.

3. Call the nurse or emergency when in need.

Behavior

Emotion

Head tracking


Audio cues

Visual cues

Possible humanintention

Analyze human emotionsAnalyze human emotions

Analyze human actionsAnalyze human actions

Hand gesture

Body movement


Multimodal Fusion for Human Intention Recognition

54

22/02/2016

28

Will fusion help in a problem on hand? What is the best fusion model for a problem on hand?

◦ Data/feature level,◦ Interaction level,◦ Score/Decision level,◦ Or multilevel.

New data analysis and mining tools need to be developed to address the issues. Or the existing tools may be revisited.

Challenges


Challenge 1: Does fusion help?


Fusion

Sensor 1 (audio)

Sensor 2 (visual)

Recognition by audio

Recognition by video

Recognition by fusion

56

22/02/2016

29

Challenge 2: Fusion at which level?

Data/Feature #1

Interaction

Score (Decision)

Interaction

Score (Decision)

Data/Feature #2Ryerson University – Multimedia Research LaboratoryLing Guan, Paisarn Muneesawang, Yongjin Wang, Rui Zhang, Yun Tie, Adrian Bulzacki, Muhammad Talal Ibrahim

57

58

Information Entropy

Entropy is a measure of the uncertainty

associated with a random variable

Uncertainty is useful information

entropy

conditional entropy

22/02/2016

30

59

Multimodal source type

Conflict H(x), H(y) < H(x,y) joint uncertainty is larger

negative relevance

RedundancyH(x,y) < H(x), H(y) joint uncertainty is smaller

positive relevance

ComplementaryH(x) ≈ H(x,y) ≈ H(y) joint uncertainty unchanged

weak relevance

60

Multimodal fusion method

A feature has a conflict with other features If the conflict is beyond threshold, eliminate conflict feature keep the total uncertainty at an acceptable level

Redundancy fusion ( A∩B ) according to ranking of uncertainty value

Complementary fusion ( A B )

22/02/2016

31

61

Mapping from entropy to weight

Weights are the inverse of entropies high entropy -> low confidence -> low weight feature corrupted -> maximum entropy -> zero weight ∑w = 1

H = 0, then w = 1H = Hmax, then w = 0Hi = Hj, then wi = wj = 0.5

62

Entropy vs. Correlation in Fusion

Developed an entropy based method, Kernel Entropy Component Analysis (KECA)Compared with correlation based methods: KPCA – Kernel Principal Component analysis. KCCA – Kernel Canonical correlation analysis

Z. Xie and L. Guan“Mutimodal information fusion of audio emotion recognition based on kernel entropy component analysis,” vol. 7, no. 1, pp. 25-42, Int. J. of Semantic Computing, Aug 2013.

22/02/2016

32

63

• Experimental results

RML database eNTERFACE database

KECA

KPCA KCCA

Fusion – coherent integration of multimedia multimodal information

It is a natural process by human beings, but not straightforward for machines.

It may be carried out at different information levels, but how to choose the right model?

Several case studies are used to demonstrate the power of information fusion

Multiple challenges are waiting to be addressed

Summary


multimodal information fusion...ling guan, paisarn muneesawang, yongjin wang, rui zhang, yun tie,...

Documents