for cross-media retrievalvalser.org/2014/word/feiwu.pdf21世纪是数据关联学习的世纪 terry...

90
Fei Wu College of Computer Science, Zhejiang University 2014, April Deep multi-modal embedding for cross-media retrieval

Upload: others

Post on 12-Feb-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Fei Wu

College of Computer Science, Zhejiang University

2014, April

Deep multi-modal embedding for cross-media retrieval

The shadow multi-modal embedding model:

Statistical dependency modeling

Probabilistic graphical modeling

Other methods: Ranking and Hashing

The deep multi-modal embedding model

Outline

How to utilize data with different

modalities from different sources to

understand our real world becomes

a great challenge

Image Sharing Sites

Social Media

Short Text

Microblog

Other Sensors

… Video

Surveillance

Video Sharing Sites

4

The Emerging Shift: From Multimedia to Cross-media

Webpages

Nowadays, many real-world applications involve multi-modal

data.

Multi-modal data is very useful for the descriptions of events or topics.

It is better to describe social events (e.g., VALSE) by the integration of

webpages, images, video and other data objects …

Webpages/DocumentsImages

Motivation and Background:

Cross-media Retrieval

textual

visual

acoustical

temporal

attributes

Others

……

High-Dimension

Heterogeneous

High-order

Features

Issues:

Feature fusion; Heterogeneous feature selection; Cross-modal

metric learning…

From Multimedia to Cross-media: Three Properties

Heterogeneous features are

obtained from data in different

modalities to denote their

corresponding semantic.

flickr

YouTube

CNN

Yahoo

facebook

twitter

Issues:

Near-duplicated detection; Cross-domain learning; Transfer Learning…

media data about a same

topic/event comes from multiple

sources, such as news media

websites, microblog, mobile

phone, social networking

websites, and photo/video sharing

websites.

From Multimedia to Cross-media: Three Properties

The virtual world (cyberspace) and the real-

world (reality) complement each other,

such as Google Flutrends

Cyberspace Reality Complement

From Multimedia to Cross-media: Three Properties

From Multimedia to Cross-media: Three Properties

“Big data hubris” is the often

implicit assumption that big

data are a substitute for, rather

than a supplement to,

traditional data collection and

analysis...The core challenge is

that most big data that have

received popular attention are

not the output of instruments

designed to produce valid and

reliable data amenable for

scientific analysis.

Lazer, D., Kennedy, R., King, G., Vespignani, A., The Parable of Google Flu: Traps

in Big Data Analysis, Science, 343:1203-1205, 2014

The correlated

data in

different

modalities is

linked

The correlated

data across

multiple

sources

(domains or

collections) is

linked The Cross from data with different modalities

Cross

Cross Cross

The Cross from multiple data collections/domains

Images

Tags

Webpage

图 像

Tags

Audio

Tags

Image

Cross Cross

From Multimedia to Cross-media

The steps of utilization of cross-media

How to leverage different kinds of data across

multiple sources for discovering knowledge:

Collect all of correlated data from multiple sources to

boost the understanding of objects, events, topics and

knowledge.

Audio

Video

Webpage

Heterogeneous Data

Mission: Can we map the heterogeneous data into one

uniform space and perform multi-modal metric learning?

Multi-modal metric learning

Multi-modal embedding

21世纪是数据关联学习的世纪 Terry Speed, A Correlation for the 21st Century, Science, 2011,334,1502-1503

4

加州大学伯克利分校统计系前任系主任Terry Speed教授于2011年12月在Science发表题为“A Correlation for the 21st

Century”的论文,提出“21世纪是关联性学习的时代”,即从庞大数据集中发现数据之间所潜在的重要关系变得十分重要。

注:从1880年提出Pearson

correlation 以来,数据关联学习一直被认为是一个难题。

21世纪是数据关联学习的世纪 亚洲雾霾与太平洋风暴的关系

4

Wang, Y. , M. Wang, R. Zhang, S. Ghan, Y. Lin, J. Hu, B. Pan, M. Levy, J. Jiang, M.J. Molina,

Assessing the Impacts of Anthropogenic Aerosols on Pacific Storm Track Using A Multi-Scale Global

Climate Model, Proc. Natl Acad. Sci(PNAS). USA 111, doi/10.1073/pnas.1403364111 (2014).

How to select most of discriminative features to set up one interpretable

model for semantic understanding.

High-dimensional heterogeneous features are often over-complete for the

representation of certain semantic.

Global Features

Local Features

Color

Texture

Shape

….

SIFT

GLOH

LBP

….

SIFT or other local features?

Color or other global features?

Non-embedding methods: Sparse representation with structural priors

Peng Zhao, Guilherme Rocha, and Bin Yu, The composite absolute penalties family for grouped and hierarchical

variable selection, Annals of Statistics, 37:3468–3497,2009

F. Bach, Structured Sparsity-Inducing Norms through Submodular Functions, Advances in Neural Information

Processing Systems (NIPS), 2010

X. Chen, Q. Lin, S. Kim, J. Carbonell, E.P. Xing, Smoothing proximal gradient method for general structured

sparse learning, Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI), 2011

J. Mairal, B. Yu, Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows ,

arXiv:1204.4539v1. 2012.

Composite

Absolute Penalty Dictionary Tree Tree-guided penalty Path Coding Penalty

The better utilization of natural structures in data is critical to boost

semantic understanding

Structures in data : Group, Graph, Tree, Path…

Non-embedding methods: Sparse representation with structural priors

The number of features (p) is often larger than the number of samples(n) , that

is to say p>>n(High-dimensional features)

Seek after one interpretable model for feature

selection such as lasso (Tibshirani,1996) , subset

selection (Breiman, et al,1996), group lasso (Yuan,

et al ,2006) and elastic net (Zou, et al, 2005)

Heterogeneous feature machines (Cao and Luo et al,

2009) Face Recognition via sparse representation (John and

Ma, 2009)

p>>n(Feature

selection)

Tibshirani, R., Regression Shrinkage and Selection via the Lasso, Journal of the Royal Statistical Society:

Series B (Statistical Methodology), 58(1): 267-288,1996

Breiman, L., Heuristics of Iinstability and Stabilization in Model Selection, The Annals of Statistics,

24(6):2350-2383,1996

L. Cao, J. Luo, F. Liang, and T. Huang, Heterogeneous Feature Machines for Visual Recognition, ICCV,

2009

Wright, J., Yang, A., Ganesh, A., Sastry, S., Ma, Y., Robust face recognition via sparse representation,

IEEE Transactions on Pattern Analysis and Machine intelligence, 31(2):210-227,2009

Non-embedding methods: Sparse representation with structural priors

Highly correlated features

(structure)

avoid the overfitting in situations of large

numbers of highly correlated features such

as Penalized Discriminant Analysis (Hastie,

et al, 1995) , Sparse Discriminant Analysis

(Clemmensen, et al,2008), or introduce a

structural penalty such as Structured

Sparsity-Inducing Norms(Bach, 2010),

Structural Grouping Sparsity( Fei Wu, et

al ,2010)

T. Hastie, A. Buja, and R. Tibshirani, Penalized Discriminant Analysis, The Annals of Statistics,23(1):73–

102, 1995

L. Clemmensen, T. Hastie, and B. Ersbll. Sparse Discriminant Analysis, online: http://www-

stat.stanford.edu/ hastie/Papers/, 2008

F. Bach. Structured Sparsity-Inducing Norms through Submodular Functions, NIPS, 2010

Fei Wu, Yahong Han, Qi Tian, Yueting Zhuang, Multi-label Boosting for Image Annotation by Structural

Grouping Sparsity, ACM Multimedia,2010 (FULL Paper)

Given high-dimensional features, there are many of highly correlated features

(structural priors)

Non-embedding methods: Sparse representation with structural priors

Our Solution: Input-Output structural grouping sparsity for image annotation

What kinds of structures can be conducted

during image annotation

Input (heterogeneous features): naturally grouped due to their

different modalities.

Output (annotated tags): The correlations of tags can be

modeled by a hierarchical tree to reflect their respective strong

or weak correlations.

Non-embedding methods: Sparse representation with structural priors

Input: The high-dimensional heterogeneous features are naturally

encoded into different groups due to their different modalities.

The high-dimensional heterogeneous features are encoded into three groups

First Group Second Group Third Group

Non-embedding methods: Sparse representation with structural priors

Our Solution: Input-Output structural grouping sparsity for image annotation

The grouping effect in feature selection: The highly correlated features

within a same group will tend to be selected together

Some features within a same group could be selected at the same time

Our Solution: Input-Output structural grouping sparsity for image annotation

Non-embedding methods: Sparse representation with structural priors

Different from traditional lasso, group lasso and elastic net, our

structural grouping penalty not only selects the groups of

heterogeneous features, but also identifies the subgroup of

homogeneous features within each selected group.

Group Selection Subgroup Identification High-dimensional

heterogeneous features

Our Solution: Input-Output structural grouping sparsity for image annotation

Non-embedding methods: Sparse representation with structural priors

animals, clouds, plant_life, sky

clouds, sky, structure

people, transport, water

animals, flower, plant

Tree Structure of Annotated Labels

Output: The correlations among tags could be well modeled by a tree

structure by performing hierarchical clustering to boost image

annotation.

Our Solution: Input-Output structural grouping sparsity for image annotation

Non-embedding methods: Sparse representation with structural priors

Heterogeneous group selection

Input

Output

Input-out penalty term Tag correlation by

hierarchical tree

Input-Output structural grouping sparsity

Fei Wu, Yahong Han, Qi Tian, Yueting Zhuang, Multi-label Boosting for Image Annotation by Structural Grouping

Sparsity, ACM Multimedia 2010 (FULL Paper)

Yahong Han, Fei Wu, Qi Tian, Yueting Zhuang, Image Annotation by Input-Output Structural Grouping Sparsity, IEEE

Transactions on Image Processing, 2012,21(6):3066-3079

The experimental result of feature selection with Structural Grouping Sparsity

Non-embedding methods: Sparse representation with structural priors

Fei Wu, Ying Yuan, Yong Rui, Shuicheng Yan, Yueting Zhuang, Annotating Web Images using NOVA: NOn-conVex

group spArsity, ACM Multimedia 2012 (FULL Paper,非凸组稀疏)

Yanan Liu, Fei Wu, Zhihua Zhang, Yueting Zhuang, Shuicheng Yan, Sparse Representation using nonnegative curds

and whey, CVPR 2010, 3578-3585(类别标签)

Yahong Han, Fei Wu, Jinzhu Jia, Yueting Zhuang, Bin Yu, Multi-task Sparse Discriminant Analysis (MtSDA) with

Overlapping Categories, AAAI 2010,469-474(多任务学习)

Fei Wu, Yahong Han, Qi Tian, Yueting Zhuang, Multi-label Boosting for Image Annotation by Structural Grouping

Sparsity, ACM Multimedia 2010(FULL Paper,组稀疏)

Yahong Han, Fei Wu, Jian Shao, Qi Tian, Yueting Zhuang, Graph-Guided Sparse Reconstruction for Region Tagging,

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2012,2981-2988(图结构)

Yahong Han, Fei Wu, Xinyan Lu, Yueting Zhuang, Qi Tian, Jiebo Luo, Correlated Attribute Transfer with Multi-task

Graph-Guided Fusion, ACM 2012 (FULL Paper,图结构)

Depending on if the feature selection is based on individual or group features and if the

regularizers are convex or non-convex, the existing approaches can be classified into four

categories.

Non-embedding methods: Sparse representation with structural priors

Cross-media retrieval

Text-query-image: Finding relevant textual documents

that best match a given image;

Image-query-text: Finding a set of images that visually

best illustrate a given text description.

Query Textual Document Ranked Listwise Image Results

Motivation and Background:

Cross-media Retrieval

Challenge: heterogeneity-gap

Many kinds of heterogeneous features can be obtained from

multi-modal data.

How to compare the similarity between low-level

heterogeneous features?

Bag of visual words

Bag of words

Motivation and Background:

Cross-media Retrieval

CCA (Canonical Correlation Analysis) and its extensions

Kernel CCA, Sparse CCA, Sparse Structure CCA

2D CCA, local 2D-CCA, sparse 2D-CCA, 3-D CCA

Audio

Video

Webpage

Correlated multi-modal Data Find appropriate linear mappings to preserve the

maximum correlation between multi-modal data

Multi-modal embedding:

Statistical dependency modeling

潜在的 统计关系

30

特征

提取 特征

提取

视觉特征向量500维 听觉特征向量400维

(3, 9,5,…,2,8,6)

(34,56,49,…,47,45)

(45,51,43,…,53,59)

(61,41,55,…,43,42)

(54,52,63,…,57,48)

(39,36,46,…,55,56)

(2,8,6,…,3,7,5)

(2,8,6,…,3,8,6)

(4,7, 3,…2,,4,7)

(5,6,8,…, 5,5,6)

Hong Zhang, Yueting Zhuang, Fei Wu, Cross-modal correlation learning for clustering on image-audio dataset, ACM Multimedia 2007, 273-276

Fei Wu, Hong Zhang, Yueting Zhuang, Learning Semantic Correlation for Cross-media Retrieval, 1465-1468 , ICIP 2006

通过典型相关性分析学习不同类型媒体数据在底层特征上的统计相关性,建立跨媒体同构空间,从而实现了不同类型媒体数据度量的有效机制。

Multi-modal embedding:

Statistical dependency modeling

图像数据库 音频数据库

低维的同构子空间

Multi-modal embedding:

Statistical dependency modeling

max

' '

( ', ') ( , )

X XWx Y YWy

X Y X Y

32

从图像训练集中提取的视觉特征矩阵

从音频训练集中提取的听觉特征矩阵

采用典型相关性分析计算两者间的统计关系

通过拉格朗日算法找到两个转换矩阵Wx和Wy

线性降维之后两个矩阵之间的相关性最大程度地与降维之前保持一致

用于子空间映射的矩阵计算过程:

50021

25002221

15001211

,...,,

...

,...,,

,...,,

nnn xxx

xxx

xxx

X

40021

24002221

14001211

,...,,

...

,...,,

,...,,

nnn yyy

yyy

yyy

Y

Multi-modal embedding:

Statistical dependency modeling

Latent Dirichlet Allocation and its extensions probabilistic

graphical

Correspondence LDA, Topic-regression Multi-modal LDA

The lion (Panthera leo) is one of the four

big cats in the genus Panthera and a

member of the family Felidae. With some

males exceeding 250 kg in weight, it is the

second-largest living cat after the tiger.

Wild lions currently exist in sub-Saharan

Africa and in Asia (where an endangered

remnant population resides in Gir Forest

National Park in India) while other types

of lions have disappeared from North

Africa and Southwest Asia in historic

times

model the correlations of multi-modal data at latent semantic (topic) level across modalities

Multi-modal embedding:

Probabilistic graphical modeling

Each topic is a distribution over words

Each document is a mixture of topics

Each word is drawn from one of those topics

D. Blei, A. Ng, and M. Jordan, Latent Dirichlet allocation, Journal of Machine

Learning Research, 3:993–1022, January 2003

Multi-modal embedding:

Probabilistic graphical modeling

相关变量: ◦ 隐变量(Latent Variables) θ,β和z代表主题比例,主题和主题分配; ◦ 可观测变量(Observed Variables) 𝑤表示观察到的单词; ◦ 超参数(Hyper Parameters) 𝛼和𝜂表示主题比例和主题的先验参数; ◦ 此外,不失一般性,𝐷,𝑁和𝐾表示文档数,文档的单词数以及主题个数。

生成模型: 1. 生成每个主题𝛽𝑘 ∼ Dir 𝜂 2. 为每篇文档生成主题比例𝜃𝑑 ∼ Dir(α). 3. 为每篇文档的每个位置(token)采样主题𝑧𝑛 ∼ Mult(𝜃𝑑). 4. 为每篇文档的每个位置生成单词𝑤𝑛 ∼ Mult(𝛽𝑧𝑛).

特点: ◦ 发掘的主题(topic)以及模型具有可解释性; ◦ 能发现单词之间的共现性(Word Co-occurrence); ◦ 具有良好的泛化能力和可扩展性。

Multi-modal embedding:

Probabilistic graphical modeling

D. Blei and M. Jordan, Modeling annotated data, SIGIR 2003

(unsupervised) Correspondence-LDA modeling the joint distribution of an

image and its caption

modeling the conditional distribution of

words given an image

modeling the conditional distribution of

words given a particular region of an

image.

Multi-modal embedding:

Probabilistic graphical modeling

Upstream

Downstream

Multi-modal embedding:

Probabilistic graphical modeling

LDA本身是一个无监督的聚类模型,本身不能直接应用到分类中来。因此,出现了监督式LDA的方法:

Upstream Supervised LDA:视觉单词(单词)的分配是以文档类别为前提。

Downstream Supervised LDA:文档的类别是以视觉单词(单词)的主题分配为前提产生的。

Multi-modal embedding:

Probabilistic graphical modeling

Upstream Supervised LDA:视觉单词(单词)的分配是以文档类别为前提。

Downstream Supervised LDA:文档类别是以视觉单词(单词)的主题分配为前提产生的。

Fei-Fei Li, Pietro Perona, A Bayesian Hierarchical Model for Learning Natural Scene

Categories, CVPR 2005

Chong Wang, David Blei, Li Fei-Fei, Simultaneously Image Classification and Annotation,

CVPR 2009

Upstream Downstream

Multi-modal embedding:

Probabilistic graphical modeling

Multi-Instance Multi-Label LDA

The topic decided by the visual information and the topic decided by the tag

information should be consistent, leading to the correct label assignment.

C.-T. Nguyen, D.-C. Zhan, and Z.-H. Zhou,Multi-modal image annotation with multi-instance

multi-label LDA, In IJCAI,2013

Pairwise ranking: PAMIR [Grangier & Bengio

2008] and SSI [Bai et al. 2010]

Listwise ranking: LSCMR [Lu et al. 2013] or

Bi-directional ranking [Wu et al, 2013] B. Bai, J. Weston, D. Grangier, R. Collobert, K. Sadamasa, Y. Qi, O. Chapelle, and K.

Weinberger, Learning to rank with (a lot of) word features, Information Retrieval,

13(3):291–314, 2010

D. Grangier and S. Bengio. A discriminative kernel-based approach to rank images from text

queries, T-PAMI, 30(8):1371–1384, 2008

Fei Wu, Xinyan Lu, Yin Zhang, Zhongfei Zhang, Shuicheng Yan, Yueting Zhuang, Cross-

Media Semantic Representation via Bi-directional Learning to Rank, ACM Multimedia(Full

Paper),877-886, 2013

Xinyan Lu, Fei Wu, Siliang Tang, Zhongfei Zhang, Xiaofei He, Yueting Zhuang, A low rank

structural large-margin method for cross-modal ranking, SIGIR 2013 (Full Paper),433-

442,2013

Multi-modal embedding:

Ranking based embedding

Pairwise ranking: PAMIR [Grangier & Bengio 2008]

and SSI [Bai et al. 2010]

D. Grangier and S. Bengio, A discriminative kernel-based approach to rank images from text

queries, T-PAMI, 30(8):1371–1384, 2008

B. Bai, J. Weston, D. Grangier, R. Collobert, K. Sadamasa, Y. Qi, O. Chapelle, and K.

Weinberger, Learning to rank with (a lot of) word features, Information Retrieval,

13(3):291–314, 2010

Multi-modal embedding:

Ranking based embedding

Listwise ranking: Learn a multi-modal ranking function to preserve the orders of relevance of multi-modal data.

Latent space embedding: discover the correlations between multi-modal data.

Multi-modal embedding:

Ranking based embedding

Multi-modal embedding is considered from the perspective of

optimizing a listwise ranking and latent space embedding while

taking advantage of bi-directional ranking examples.

bi-directional ranking: both text-query-image and image-

query-text ranking examples are utilized in the training period

to achieve a better performance.

Fei Wu, Xinyan Lu, Yin Zhang, Zhongfei Zhang, Shuicheng Yan, Yueting Zhuang,

Cross-Media Semantic Representation via Bi-directional Learning to Rank, ACM

Multimedia(Full Paper),877-886, 2013

Xinyan Lu, Fei Wu, Siliang Tang, Zhongfei Zhang, Xiaofei He, Yueting Zhuang, A

low rank structural large-margin method for cross-modal ranking, SIGIR 2013 (Long

Paper),433-442,2013

Multi-modal embedding:

Ranking based embedding

Text-query-image ranking examples

Image-query-text ranking examples

Image queries Ranked text documents

Text queries Ranked images

Text-to-image correlation

Image-to-text correlation

Step 1: Modeling of Multi-modal Correlation

Multi-modal embedding:

Ranking based embedding

All the m-dimensional queries q and n-dimensional target documents d are mapped

to a k-dimensional latent space by U and V respectively, in which those data objects

with the same semantics are grouped to minimize certain listwise ranking loss (e.g., MAP) directly

Step 2: The latent space embedding

Only single latent layer: a shadow model

Multi-modal embedding:

Ranking based embedding

Image-query-text direction

bi-directional ranking

examples

bi-directional structural

Learning

Text-query-image direction Image-to-text correlation

and Text-image correlation

Latent semantic embedding

Multi-modal embedding:

Ranking based embedding

How to learn the mapping matrix(i.e, U and V) and perform

latent space embedding?

How to learn the ranking function

Query Textual Document Ranked Listwise Image Results

Ranking

function

One

Multi-modal embedding:

Ranking based embedding

Structural risk

(U and V for latent space embedding)

Bi-directional empirical risk

(upper bound loss)

The constraints for bi-directional

ranking examples

The Bi-directional structural learning to rank is formulated as a supervised structural learning problem:

maximize the margins between the true ranking and all the other

possible rankings when the constraints of bi-directional ranking are

enforced.

Multi-modal embedding:

Ranking based embedding

Here the constraints for image-query-text direction, which means that the slack variable (or empirical risk) is at least the ranking loss if predicting a wrong ranking:

The compatible functions The loss function (MAP in

This paper)

The constraints for text-query-image direction can be defined in a same way.

Multi-modal embedding:

Ranking based embedding

The compatible functions F are respectively defined for two

directions:

Ranking images from

text queries

Ranking texts from

image queries

Note that compatible functions care about the relative ranking position between a

relevant document and an irrelevant document. As a result, the ranking result

which maximizes compatible function F is equal to the ranking result given by

ranking function.

Multi-modal embedding:

Ranking based embedding

optimizing the model

parameters (U and V)

updating the constraints set

with a new batch of rankings

Multi-modal embedding:

Ranking based embedding

Experiments:

Datasets: Wikipedia feature articles and NUS-WIDE

Wikipedia NUS-WIDE

BoVW vocabulary size

(image)

1000 500

BoW vocabulary size (text) 5000 1000

Avg. # of words / image 117.5 7.73

Documents Partition 1500/500/866 2664/23977/106567

Queries Partition 1500/500/866 2664/2000/2000

Partitions are ordered by training/validation/test.

Performance Measurement: MAP@R Mean Average Precision

R is set to 50 or all.

Text

Query

(R=50)

Text

Query

(R=all)

Image

Query

(R=50)

Image

Query

(R=all)

CCA 0.2343 0.1433 0.2208 0.1451

PAMIR 0.3093 0.1734 0.1797 0.1779

SSI 0.2821 0.1664 0.2344 0.1759

Uni-

CMSRM

0.3663 0.2021 0.2570 0.2229

Bi-

CMSRM

0.3981 0.2123 0.2599 0.2528

Text

Query

(R=50)

Text

Query

(R=all)

Image

Query

(R=50)

Image

Query

(R=all)

CCA 0.1497 0.0851 0.1523 0.0883

PAMIR 0.2046 0.1184 0.5003 0.2410

SSI 0.2156 0.1140 0.4101 0.1992

Uni-

CMSRM

0.2781 0.1424 0.4997 0.2491

Bi-

CMSRM

0.3224 0.1453 0.4950 0.2380

Wikipedia dataset in terms of MAP@R NUS-WIDE dataset in terms of MAP@R

Experiments:

Comparative Results

Experiments:

Illustrative examples

Mission: attempt to learn hashing function(s) to faithfully preserve

the intra-modality and inter-modality similarities and map the high-

dimensional multi-modal data to compact binary codes.

Multi-modal Document

(one image with its narrative text)

0 1 1 1 0 1

1 0 0 1 1 1

0 1 1 1 0 0

… Hashing Function

Multi-modal embedding:

Multi-modal hashing

0 1 1 1 0 1

1 0 0 1 1 1

1 1 0 0 0 1

Hashing is promising way to speed up the ANN (approximate nearest

neighbor ) similarity search, which makes a tradeoff between

accuracy and efficiency.

Multi-modal embedding:

Multi-modal hashing

Three kinds of hashing approaches

Locality Sensitive Hashing

Spectral Hashing

Multiple Feature Hashing

Composite Hashing(CHMIS)

Homogeneous

Features

Heterogeneous

Features

Cross Modal Sensitive Similarity

Hashing(CMSSH)

Cross View Hashing(CVH)

Multimodal latent binary

embedding(MLBE)

Multimodal

data

Color

Texture

Shape

….

Image

Text

Audio

….

Multi-modal embedding:

Multi-modal hashing

Multi-modal hashing tends to utilize the intrinsic intra-

modality and inter-modality similarity to learn the

appropriate relationships of the data objects and provide

efficient search across different modalities

Our Approach: Sparse Multi-modal Hashing

Fei Wu, Zhou Yu, Yi Yang, Siliang Tang, Yueting Zhuang, Sparse

multi-modal hashing, IEEE Transactions Multimedia, 16(2):427-

439,2014

Multi-modal embedding:

Multi-modal hashing

Multi-modal

dictionaries

Multi-modal correlation

modeling

dinosaur,jaw,

Jurassic

sport,football,

NFL

Intra-modality similarity

Inter-modality similarity

sport,football,

NFL

dinosaur,jaw,

Jurassic

ImageDictionary

TextDictionary

Multi-modal objects

Step 1: The Joint Learning of Multi-modal Dictionaries

Multi-modal embedding:

Multi-modal hashing

Sparse

Reconstruction

Hypergraph Laplacian

Penalty

Our approach is formulated by coupling the multi-modal

dictionary learning (in terms of approximate reconstruction of

each data object with a weighted linear combination of a small

number of “basis vectors”) and a regularized hypergraph penalty

(in terms of the modeling of multi-modal correlation).

Multi-modal embedding:

Multi-modal hashing

Sparse Reconstruction Sparse codesets Multi-modal objects

Both intra-modality and inter-modality similarities are preserved. For examples, two “dinosaur”

images have the same sparse codeset, and two “dinosaur” images have similar sparse codesets

with their relevant text (dinosaur, ancient and fossil, etc). On the contrary, two “dinosaur”

images have apparently different sparse codesets with their irrelevant text(sport, football, etc).

ImageDictionary

dinosaur,ancient,

fossil

sport,football,

NFL TextDictionary

(3, 4, 6, 7)

(3, 4, 6, 7)

(2, 4, 6, 7)

(1, 4, 5, 8)

dinosaur,jaw,

Jurassicdinosaur,ancient,

fossil

sport,football,

NFL

Step 2: The Generation of Sparse Codesets

Multi-modal embedding:

Multi-modal hashing

62

activate the most relevant component and induce a compact

codeset for each data from its corresponding sparse coefficients

Multi-modal embedding:

Multi-modal hashing

mAP score on NUS-WIDE mAP score on WIKI

Multi-modal embedding:

Multi-modal hashing

extends uni-modal DL to multi-modal DL and jointly

learns a set of mapping functions across different

modalities. Furthermore, SliM2 utilizes the label

information to discover the shared structures inside intra

modalities from the same classes.

Our Approach: Supervised coupled dictionary learning

with group structures

Yueting Zhuang, Yanfei Wang, Fei Wu, Yin Zhang, Weiming Lu,

Supervised Coupled Dictionary Learning with Group Structures for

Multi-modal Retrieval, AAAI 2013

Multi-modal embedding:

Multi-modal hashing

Dictionary learning methods:

Penalty for traditional DL :

Penalty for DL with group norm :

If all of the images in one class (category) is taken as a group, we

have dictionary learning utilize label information as follows:

Multi-modal embedding:

Multi-modal hashing

Group norm with

label

information

which assume data

with same

categories to

have the same

dictionary

entries

Reconstruction

errors of

Different

modalities

Linear relationships

between sparse

coefficients

Relatively simple

mappings

Multi-modal embedding:

Multi-modal hashing

Most of the existing cross-media hashing approaches share the

common idea of learning different hash functions individually for

each modality and map the data from different modalities to a

shared low-dimensional Hamming space. However, such a binary

embedding strategy often results in poor indexing performance for

the shared embedding space is not semantically discriminative,

which is significantly important for cross-media retrieval

Our Approach: Discriminative coupled dictionary hashing

Zhou Yu, Fei Wu, Yi Yang, Qi Tian, Jiebo Luo, Yueting Zhuang,

Discriminative Coupled Dictionary Hashing for Fast Cross-media

Retrieval, SIGIR 2014 (Long Paper, accepted)

Multi-modal embedding:

Multi-modal hashing

To learn the both discriminative and coupled dictionaries.

The discriminative capability indicates that data from same category will have

similar sparse representation (i.e., sparse codes)

The coupling means not only intra-modality similarity but also inter-modality

correlation will be preserved.

Our Approach: Discriminative coupled dictionary hashing

Zhou Yu, Fei Wu, Yi Yang, Qi Tian, Jiebo Luo, Yueting Zhuang,

Discriminative Coupled Dictionary Hashing for Fast Cross-media

Retrieval, SIGIR 2014 (Long Paper, accepted)

Multi-modal embedding:

Multi-modal hashing

Multi-modal embedding:

Multi-modal hashing

The Discriminative Capability of the Coupled Dictionary Space

Multi-modal embedding:

Multi-modal hashing

Deep learning attempts to learn in multiple levels of

representation, corresponding to different levels of

abstraction. The levels in these learned statistical

models correspond to distinct levels of concepts, where

higher-level concepts are defined from lower-level ones,

and the same lower-level concepts can help to define

many higher-level concepts.

Boltzmann machine, auto-encoder, recursive neural network,

convolutional neural network

Bengio, Y. , Learning Deep Architectures for AI, Foundations and Trends in

Machine Learning, 2: 1–15,2009

Nicola Jones, The learning machines, Nature, 08 January 2014

Multi-modal embedding:

From the shadow model to the deep model

Galen Andrew, Raman Arora, Jeff Bilmes and Karen Livescu, Deep Canonical

Correlation Analysis, International Conference on Machine Learning, 2013

Multi-modal embedding:

From the shadow model to the deep model

Multi-modal embedding:

From the shadow model to the deep model

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A.Y. Ng., Multimodal deep learning, ICML 2011

A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov, DeViSE:

A Deep Visual-Semantic Embedding Model, NIPS 2013

Images: Convolutional neural network Documents: Recursive neural network

Multi-modal embedding:

From the shadow model to the deep model

Nitish Srivastava , Ruslan Salakhutdinov , Multimodal Learning with Deep Boltzmann Machines,

NIPS 2012

Multi-modal embedding:

From the shadow model to the deep model

Nitish Srivastava, Ruslan Salakhutdinov, Geoffrey Hinton, Modeling Documents

with Deep Boltzmann Machines, UAI 2013

Multi-modal embedding:

From the shadow model to the deep model

W. Wang, B.C. Ooi, X. Yang, D. Zhang, Y. Zhuang, Effective MultiModal

Retrieval based on Stacked AutoEncoders,VLDB 2014

Multi-modal embedding:

From the shadow model to the deep model

Yan Liu, Sheng-hua Zhong ,Wenjie Li. ,Query-oriented Multi-document

Summarization via Unsupervised Deep Learning, AAAI 2012

Multi-modal embedding:

From the shadow model to the deep model

Dong Yu, Shizhen Wang, and Li Deng, Sequential Labeling Using Deep-Structured

Conditional Random Fields,IEEE Journal of selected topics in signal processing

Multi-modal embedding:

From the shadow model to the deep model

Deep structured models:

Deep conditional random fields

Feature Representation

and Structured Learning

结论:在深度学习的反馈中引入自然先验

在深度学习Top Layer引入概念层次树

Freebase:4千多万个实体(entity),20亿多个实体与实体之间关系

描述的facts

NELL(CMU): Never-Ending

Language Learning:5千多万个实体与实体之间关系描述

ReVerb: 1千5百万条实体与实体之间的关系描述

结论:重视知识图谱的利用

Clickage: Towards bridging semantic and intent gaps via mining click logs of search

engines, ACM MM 2013

结论:重视用户交互行为(群智)

Clickage: Towards bridging semantic and intent gaps cia mining click logs of search

engines, ACM MM 2013

如何更好利用真实世界中搜索引擎提供的海量相关反馈信息

结论:重视用户交互行为(群智)

通过Amazon’s Mechanical Turk、reCAPTCHA以及ESP

Game等获取crowdsourcing行为:

ICML 13 Workshop: Machine Learning Meets Crowdsourcing, NIPS 13 Workshop on Crowdsourcing:

Theory, Algorithms and Applications, ACM MM 2013 CrowdMM'13, Crowdsourcing for Multimedia 2014

如何更好利用群体行为来理解跨媒体大数据

结论:重视用户交互行为(群智)

ICML 13 Workshop: Machine Learning Meets Crowdsourcing, NIPS 13 Workshop on Crowdsourcing:

Theory, Algorithms and Applications, ACM MM 2013 CrowdMM'13, Crowdsourcing for Multimedia 2014

用户交互行为引入(群智计算)

2013年1月IEEE 《Computer》 杂志刊文呼吁:建立计算模式

图灵奖获得者、ACM现任主席 Vinton G. Cerf教授2014年2月在《Communications of the ACM》提出了“ (认知植入)

Noam Miller,et.al, Both information and social cohesion determine collective decisions in

animal groups, PNAS, 110(13):5263-5268,2013,March

美国科学院院刊(PNAS)在2013年3月所发表的文章指出:人类认知中重要的决策(decision making)环节受到个人已有先验知识(past experiences or priors)、数据

(即其他人在网络空间和现实世界中行为)以及人和人之间交往模型 (如孤立、社会认同等)等影响,这些因素对于最后决策均是极其重要的(crucial)。 但是,这篇论文中人和人之间交往模型是描述性的、非可计算模型。

用户交互行为引入(群智计算)

算法、平台和交互三者统一

伯克利在美国NSF “BIGDATA”项目资助下成立的新实验室:AMP Lab(algorithms, machines and people)

Unsupervised Deep learning (bottom-up)/Supervised Deep learning

(up-bottom for tuning)/

Learn representation (feature) from data/ Disentangle structure in data

Big data + Big infrastructure -> Big model + Big learning

Knowledge mining (top-down)/The Construction of Knowledge base

Entities and relations, alterative expressions, etc.

Inference and reasoning

Crowdsourcing(human computation in the loop)

Assist automatic algorithm to achieve higher accuracy

Treat human computer interaction as new way to collect more signals

and labelled data for supervised learning

Put it together (参考马维英博士slides)