for cross-media retrievalvalser.org/2014/word/feiwu.pdf21世纪是数据关联学习的世纪 terry...
TRANSCRIPT
Fei Wu
College of Computer Science, Zhejiang University
2014, April
Deep multi-modal embedding for cross-media retrieval
The shadow multi-modal embedding model:
Statistical dependency modeling
Probabilistic graphical modeling
Other methods: Ranking and Hashing
The deep multi-modal embedding model
Outline
How to utilize data with different
modalities from different sources to
understand our real world becomes
a great challenge
Image Sharing Sites
Social Media
Short Text
Microblog
Other Sensors
… Video
Surveillance
Video Sharing Sites
4
The Emerging Shift: From Multimedia to Cross-media
Webpages
Nowadays, many real-world applications involve multi-modal
data.
Multi-modal data is very useful for the descriptions of events or topics.
It is better to describe social events (e.g., VALSE) by the integration of
webpages, images, video and other data objects …
Webpages/DocumentsImages
Motivation and Background:
Cross-media Retrieval
textual
visual
acoustical
temporal
attributes
Others
……
High-Dimension
Heterogeneous
High-order
Features
Issues:
Feature fusion; Heterogeneous feature selection; Cross-modal
metric learning…
From Multimedia to Cross-media: Three Properties
Heterogeneous features are
obtained from data in different
modalities to denote their
corresponding semantic.
flickr
YouTube
CNN
Yahoo
Issues:
Near-duplicated detection; Cross-domain learning; Transfer Learning…
media data about a same
topic/event comes from multiple
sources, such as news media
websites, microblog, mobile
phone, social networking
websites, and photo/video sharing
websites.
From Multimedia to Cross-media: Three Properties
The virtual world (cyberspace) and the real-
world (reality) complement each other,
such as Google Flutrends
Cyberspace Reality Complement
From Multimedia to Cross-media: Three Properties
From Multimedia to Cross-media: Three Properties
“Big data hubris” is the often
implicit assumption that big
data are a substitute for, rather
than a supplement to,
traditional data collection and
analysis...The core challenge is
that most big data that have
received popular attention are
not the output of instruments
designed to produce valid and
reliable data amenable for
scientific analysis.
Lazer, D., Kennedy, R., King, G., Vespignani, A., The Parable of Google Flu: Traps
in Big Data Analysis, Science, 343:1203-1205, 2014
The correlated
data in
different
modalities is
linked
The correlated
data across
multiple
sources
(domains or
collections) is
linked The Cross from data with different modalities
Cross
Cross Cross
The Cross from multiple data collections/domains
Images
Tags
Webpage
图 像
Tags
Audio
Tags
Image
Cross Cross
From Multimedia to Cross-media
The steps of utilization of cross-media
How to leverage different kinds of data across
multiple sources for discovering knowledge:
Collect all of correlated data from multiple sources to
boost the understanding of objects, events, topics and
knowledge.
Audio
Video
Webpage
Heterogeneous Data
Mission: Can we map the heterogeneous data into one
uniform space and perform multi-modal metric learning?
Multi-modal metric learning
Multi-modal embedding
21世纪是数据关联学习的世纪 Terry Speed, A Correlation for the 21st Century, Science, 2011,334,1502-1503
4
加州大学伯克利分校统计系前任系主任Terry Speed教授于2011年12月在Science发表题为“A Correlation for the 21st
Century”的论文,提出“21世纪是关联性学习的时代”,即从庞大数据集中发现数据之间所潜在的重要关系变得十分重要。
注:从1880年提出Pearson
correlation 以来,数据关联学习一直被认为是一个难题。
21世纪是数据关联学习的世纪 亚洲雾霾与太平洋风暴的关系
4
Wang, Y. , M. Wang, R. Zhang, S. Ghan, Y. Lin, J. Hu, B. Pan, M. Levy, J. Jiang, M.J. Molina,
Assessing the Impacts of Anthropogenic Aerosols on Pacific Storm Track Using A Multi-Scale Global
Climate Model, Proc. Natl Acad. Sci(PNAS). USA 111, doi/10.1073/pnas.1403364111 (2014).
How to select most of discriminative features to set up one interpretable
model for semantic understanding.
High-dimensional heterogeneous features are often over-complete for the
representation of certain semantic.
Global Features
Local Features
Color
Texture
Shape
….
SIFT
GLOH
LBP
….
SIFT or other local features?
Color or other global features?
Non-embedding methods: Sparse representation with structural priors
Peng Zhao, Guilherme Rocha, and Bin Yu, The composite absolute penalties family for grouped and hierarchical
variable selection, Annals of Statistics, 37:3468–3497,2009
F. Bach, Structured Sparsity-Inducing Norms through Submodular Functions, Advances in Neural Information
Processing Systems (NIPS), 2010
X. Chen, Q. Lin, S. Kim, J. Carbonell, E.P. Xing, Smoothing proximal gradient method for general structured
sparse learning, Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI), 2011
J. Mairal, B. Yu, Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows ,
arXiv:1204.4539v1. 2012.
Composite
Absolute Penalty Dictionary Tree Tree-guided penalty Path Coding Penalty
The better utilization of natural structures in data is critical to boost
semantic understanding
Structures in data : Group, Graph, Tree, Path…
Non-embedding methods: Sparse representation with structural priors
The number of features (p) is often larger than the number of samples(n) , that
is to say p>>n(High-dimensional features)
Seek after one interpretable model for feature
selection such as lasso (Tibshirani,1996) , subset
selection (Breiman, et al,1996), group lasso (Yuan,
et al ,2006) and elastic net (Zou, et al, 2005)
Heterogeneous feature machines (Cao and Luo et al,
2009) Face Recognition via sparse representation (John and
Ma, 2009)
p>>n(Feature
selection)
Tibshirani, R., Regression Shrinkage and Selection via the Lasso, Journal of the Royal Statistical Society:
Series B (Statistical Methodology), 58(1): 267-288,1996
Breiman, L., Heuristics of Iinstability and Stabilization in Model Selection, The Annals of Statistics,
24(6):2350-2383,1996
L. Cao, J. Luo, F. Liang, and T. Huang, Heterogeneous Feature Machines for Visual Recognition, ICCV,
2009
Wright, J., Yang, A., Ganesh, A., Sastry, S., Ma, Y., Robust face recognition via sparse representation,
IEEE Transactions on Pattern Analysis and Machine intelligence, 31(2):210-227,2009
Non-embedding methods: Sparse representation with structural priors
Highly correlated features
(structure)
avoid the overfitting in situations of large
numbers of highly correlated features such
as Penalized Discriminant Analysis (Hastie,
et al, 1995) , Sparse Discriminant Analysis
(Clemmensen, et al,2008), or introduce a
structural penalty such as Structured
Sparsity-Inducing Norms(Bach, 2010),
Structural Grouping Sparsity( Fei Wu, et
al ,2010)
T. Hastie, A. Buja, and R. Tibshirani, Penalized Discriminant Analysis, The Annals of Statistics,23(1):73–
102, 1995
L. Clemmensen, T. Hastie, and B. Ersbll. Sparse Discriminant Analysis, online: http://www-
stat.stanford.edu/ hastie/Papers/, 2008
F. Bach. Structured Sparsity-Inducing Norms through Submodular Functions, NIPS, 2010
Fei Wu, Yahong Han, Qi Tian, Yueting Zhuang, Multi-label Boosting for Image Annotation by Structural
Grouping Sparsity, ACM Multimedia,2010 (FULL Paper)
Given high-dimensional features, there are many of highly correlated features
(structural priors)
Non-embedding methods: Sparse representation with structural priors
Our Solution: Input-Output structural grouping sparsity for image annotation
What kinds of structures can be conducted
during image annotation
Input (heterogeneous features): naturally grouped due to their
different modalities.
Output (annotated tags): The correlations of tags can be
modeled by a hierarchical tree to reflect their respective strong
or weak correlations.
Non-embedding methods: Sparse representation with structural priors
Input: The high-dimensional heterogeneous features are naturally
encoded into different groups due to their different modalities.
The high-dimensional heterogeneous features are encoded into three groups
First Group Second Group Third Group
Non-embedding methods: Sparse representation with structural priors
Our Solution: Input-Output structural grouping sparsity for image annotation
The grouping effect in feature selection: The highly correlated features
within a same group will tend to be selected together
Some features within a same group could be selected at the same time
Our Solution: Input-Output structural grouping sparsity for image annotation
Non-embedding methods: Sparse representation with structural priors
Different from traditional lasso, group lasso and elastic net, our
structural grouping penalty not only selects the groups of
heterogeneous features, but also identifies the subgroup of
homogeneous features within each selected group.
Group Selection Subgroup Identification High-dimensional
heterogeneous features
Our Solution: Input-Output structural grouping sparsity for image annotation
Non-embedding methods: Sparse representation with structural priors
animals, clouds, plant_life, sky
clouds, sky, structure
people, transport, water
animals, flower, plant
Tree Structure of Annotated Labels
Output: The correlations among tags could be well modeled by a tree
structure by performing hierarchical clustering to boost image
annotation.
Our Solution: Input-Output structural grouping sparsity for image annotation
Non-embedding methods: Sparse representation with structural priors
Heterogeneous group selection
Input
Output
Input-out penalty term Tag correlation by
hierarchical tree
Input-Output structural grouping sparsity
Fei Wu, Yahong Han, Qi Tian, Yueting Zhuang, Multi-label Boosting for Image Annotation by Structural Grouping
Sparsity, ACM Multimedia 2010 (FULL Paper)
Yahong Han, Fei Wu, Qi Tian, Yueting Zhuang, Image Annotation by Input-Output Structural Grouping Sparsity, IEEE
Transactions on Image Processing, 2012,21(6):3066-3079
The experimental result of feature selection with Structural Grouping Sparsity
Non-embedding methods: Sparse representation with structural priors
Fei Wu, Ying Yuan, Yong Rui, Shuicheng Yan, Yueting Zhuang, Annotating Web Images using NOVA: NOn-conVex
group spArsity, ACM Multimedia 2012 (FULL Paper,非凸组稀疏)
Yanan Liu, Fei Wu, Zhihua Zhang, Yueting Zhuang, Shuicheng Yan, Sparse Representation using nonnegative curds
and whey, CVPR 2010, 3578-3585(类别标签)
Yahong Han, Fei Wu, Jinzhu Jia, Yueting Zhuang, Bin Yu, Multi-task Sparse Discriminant Analysis (MtSDA) with
Overlapping Categories, AAAI 2010,469-474(多任务学习)
Fei Wu, Yahong Han, Qi Tian, Yueting Zhuang, Multi-label Boosting for Image Annotation by Structural Grouping
Sparsity, ACM Multimedia 2010(FULL Paper,组稀疏)
Yahong Han, Fei Wu, Jian Shao, Qi Tian, Yueting Zhuang, Graph-Guided Sparse Reconstruction for Region Tagging,
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2012,2981-2988(图结构)
Yahong Han, Fei Wu, Xinyan Lu, Yueting Zhuang, Qi Tian, Jiebo Luo, Correlated Attribute Transfer with Multi-task
Graph-Guided Fusion, ACM 2012 (FULL Paper,图结构)
Depending on if the feature selection is based on individual or group features and if the
regularizers are convex or non-convex, the existing approaches can be classified into four
categories.
Non-embedding methods: Sparse representation with structural priors
Cross-media retrieval
Text-query-image: Finding relevant textual documents
that best match a given image;
Image-query-text: Finding a set of images that visually
best illustrate a given text description.
Query Textual Document Ranked Listwise Image Results
Motivation and Background:
Cross-media Retrieval
Challenge: heterogeneity-gap
Many kinds of heterogeneous features can be obtained from
multi-modal data.
How to compare the similarity between low-level
heterogeneous features?
Bag of visual words
Bag of words
Motivation and Background:
Cross-media Retrieval
CCA (Canonical Correlation Analysis) and its extensions
Kernel CCA, Sparse CCA, Sparse Structure CCA
2D CCA, local 2D-CCA, sparse 2D-CCA, 3-D CCA
Audio
Video
Webpage
Correlated multi-modal Data Find appropriate linear mappings to preserve the
maximum correlation between multi-modal data
Multi-modal embedding:
Statistical dependency modeling
潜在的 统计关系
30
特征
提取 特征
提取
视觉特征向量500维 听觉特征向量400维
(3, 9,5,…,2,8,6)
(34,56,49,…,47,45)
(45,51,43,…,53,59)
(61,41,55,…,43,42)
(54,52,63,…,57,48)
(39,36,46,…,55,56)
(2,8,6,…,3,7,5)
(2,8,6,…,3,8,6)
(4,7, 3,…2,,4,7)
(5,6,8,…, 5,5,6)
Hong Zhang, Yueting Zhuang, Fei Wu, Cross-modal correlation learning for clustering on image-audio dataset, ACM Multimedia 2007, 273-276
Fei Wu, Hong Zhang, Yueting Zhuang, Learning Semantic Correlation for Cross-media Retrieval, 1465-1468 , ICIP 2006
通过典型相关性分析学习不同类型媒体数据在底层特征上的统计相关性,建立跨媒体同构空间,从而实现了不同类型媒体数据度量的有效机制。
Multi-modal embedding:
Statistical dependency modeling
max
' '
( ', ') ( , )
X XWx Y YWy
X Y X Y
32
从图像训练集中提取的视觉特征矩阵
从音频训练集中提取的听觉特征矩阵
采用典型相关性分析计算两者间的统计关系
通过拉格朗日算法找到两个转换矩阵Wx和Wy
线性降维之后两个矩阵之间的相关性最大程度地与降维之前保持一致
用于子空间映射的矩阵计算过程:
50021
25002221
15001211
,...,,
...
,...,,
,...,,
nnn xxx
xxx
xxx
X
40021
24002221
14001211
,...,,
...
,...,,
,...,,
nnn yyy
yyy
yyy
Y
Multi-modal embedding:
Statistical dependency modeling
Latent Dirichlet Allocation and its extensions probabilistic
graphical
Correspondence LDA, Topic-regression Multi-modal LDA
The lion (Panthera leo) is one of the four
big cats in the genus Panthera and a
member of the family Felidae. With some
males exceeding 250 kg in weight, it is the
second-largest living cat after the tiger.
Wild lions currently exist in sub-Saharan
Africa and in Asia (where an endangered
remnant population resides in Gir Forest
National Park in India) while other types
of lions have disappeared from North
Africa and Southwest Asia in historic
times
model the correlations of multi-modal data at latent semantic (topic) level across modalities
Multi-modal embedding:
Probabilistic graphical modeling
Each topic is a distribution over words
Each document is a mixture of topics
Each word is drawn from one of those topics
D. Blei, A. Ng, and M. Jordan, Latent Dirichlet allocation, Journal of Machine
Learning Research, 3:993–1022, January 2003
Multi-modal embedding:
Probabilistic graphical modeling
相关变量: ◦ 隐变量(Latent Variables) θ,β和z代表主题比例,主题和主题分配; ◦ 可观测变量(Observed Variables) 𝑤表示观察到的单词; ◦ 超参数(Hyper Parameters) 𝛼和𝜂表示主题比例和主题的先验参数; ◦ 此外,不失一般性,𝐷,𝑁和𝐾表示文档数,文档的单词数以及主题个数。
生成模型: 1. 生成每个主题𝛽𝑘 ∼ Dir 𝜂 2. 为每篇文档生成主题比例𝜃𝑑 ∼ Dir(α). 3. 为每篇文档的每个位置(token)采样主题𝑧𝑛 ∼ Mult(𝜃𝑑). 4. 为每篇文档的每个位置生成单词𝑤𝑛 ∼ Mult(𝛽𝑧𝑛).
特点: ◦ 发掘的主题(topic)以及模型具有可解释性; ◦ 能发现单词之间的共现性(Word Co-occurrence); ◦ 具有良好的泛化能力和可扩展性。
Multi-modal embedding:
Probabilistic graphical modeling
D. Blei and M. Jordan, Modeling annotated data, SIGIR 2003
(unsupervised) Correspondence-LDA modeling the joint distribution of an
image and its caption
modeling the conditional distribution of
words given an image
modeling the conditional distribution of
words given a particular region of an
image.
Multi-modal embedding:
Probabilistic graphical modeling
Upstream
Downstream
Multi-modal embedding:
Probabilistic graphical modeling
LDA本身是一个无监督的聚类模型,本身不能直接应用到分类中来。因此,出现了监督式LDA的方法:
Upstream Supervised LDA:视觉单词(单词)的分配是以文档类别为前提。
Downstream Supervised LDA:文档的类别是以视觉单词(单词)的主题分配为前提产生的。
Multi-modal embedding:
Probabilistic graphical modeling
Upstream Supervised LDA:视觉单词(单词)的分配是以文档类别为前提。
Downstream Supervised LDA:文档类别是以视觉单词(单词)的主题分配为前提产生的。
Fei-Fei Li, Pietro Perona, A Bayesian Hierarchical Model for Learning Natural Scene
Categories, CVPR 2005
Chong Wang, David Blei, Li Fei-Fei, Simultaneously Image Classification and Annotation,
CVPR 2009
Upstream Downstream
Multi-modal embedding:
Probabilistic graphical modeling
Multi-Instance Multi-Label LDA
The topic decided by the visual information and the topic decided by the tag
information should be consistent, leading to the correct label assignment.
C.-T. Nguyen, D.-C. Zhan, and Z.-H. Zhou,Multi-modal image annotation with multi-instance
multi-label LDA, In IJCAI,2013
Pairwise ranking: PAMIR [Grangier & Bengio
2008] and SSI [Bai et al. 2010]
Listwise ranking: LSCMR [Lu et al. 2013] or
Bi-directional ranking [Wu et al, 2013] B. Bai, J. Weston, D. Grangier, R. Collobert, K. Sadamasa, Y. Qi, O. Chapelle, and K.
Weinberger, Learning to rank with (a lot of) word features, Information Retrieval,
13(3):291–314, 2010
D. Grangier and S. Bengio. A discriminative kernel-based approach to rank images from text
queries, T-PAMI, 30(8):1371–1384, 2008
Fei Wu, Xinyan Lu, Yin Zhang, Zhongfei Zhang, Shuicheng Yan, Yueting Zhuang, Cross-
Media Semantic Representation via Bi-directional Learning to Rank, ACM Multimedia(Full
Paper),877-886, 2013
Xinyan Lu, Fei Wu, Siliang Tang, Zhongfei Zhang, Xiaofei He, Yueting Zhuang, A low rank
structural large-margin method for cross-modal ranking, SIGIR 2013 (Full Paper),433-
442,2013
Multi-modal embedding:
Ranking based embedding
Pairwise ranking: PAMIR [Grangier & Bengio 2008]
and SSI [Bai et al. 2010]
D. Grangier and S. Bengio, A discriminative kernel-based approach to rank images from text
queries, T-PAMI, 30(8):1371–1384, 2008
B. Bai, J. Weston, D. Grangier, R. Collobert, K. Sadamasa, Y. Qi, O. Chapelle, and K.
Weinberger, Learning to rank with (a lot of) word features, Information Retrieval,
13(3):291–314, 2010
Multi-modal embedding:
Ranking based embedding
Listwise ranking: Learn a multi-modal ranking function to preserve the orders of relevance of multi-modal data.
Latent space embedding: discover the correlations between multi-modal data.
Multi-modal embedding:
Ranking based embedding
Multi-modal embedding is considered from the perspective of
optimizing a listwise ranking and latent space embedding while
taking advantage of bi-directional ranking examples.
bi-directional ranking: both text-query-image and image-
query-text ranking examples are utilized in the training period
to achieve a better performance.
Fei Wu, Xinyan Lu, Yin Zhang, Zhongfei Zhang, Shuicheng Yan, Yueting Zhuang,
Cross-Media Semantic Representation via Bi-directional Learning to Rank, ACM
Multimedia(Full Paper),877-886, 2013
Xinyan Lu, Fei Wu, Siliang Tang, Zhongfei Zhang, Xiaofei He, Yueting Zhuang, A
low rank structural large-margin method for cross-modal ranking, SIGIR 2013 (Long
Paper),433-442,2013
Multi-modal embedding:
Ranking based embedding
Text-query-image ranking examples
Image-query-text ranking examples
Image queries Ranked text documents
Text queries Ranked images
Text-to-image correlation
Image-to-text correlation
Step 1: Modeling of Multi-modal Correlation
Multi-modal embedding:
Ranking based embedding
All the m-dimensional queries q and n-dimensional target documents d are mapped
to a k-dimensional latent space by U and V respectively, in which those data objects
with the same semantics are grouped to minimize certain listwise ranking loss (e.g., MAP) directly
Step 2: The latent space embedding
Only single latent layer: a shadow model
Multi-modal embedding:
Ranking based embedding
Image-query-text direction
bi-directional ranking
examples
bi-directional structural
Learning
Text-query-image direction Image-to-text correlation
and Text-image correlation
Latent semantic embedding
Multi-modal embedding:
Ranking based embedding
How to learn the mapping matrix(i.e, U and V) and perform
latent space embedding?
How to learn the ranking function
Query Textual Document Ranked Listwise Image Results
Ranking
function
One
Multi-modal embedding:
Ranking based embedding
Structural risk
(U and V for latent space embedding)
Bi-directional empirical risk
(upper bound loss)
The constraints for bi-directional
ranking examples
The Bi-directional structural learning to rank is formulated as a supervised structural learning problem:
maximize the margins between the true ranking and all the other
possible rankings when the constraints of bi-directional ranking are
enforced.
Multi-modal embedding:
Ranking based embedding
Here the constraints for image-query-text direction, which means that the slack variable (or empirical risk) is at least the ranking loss if predicting a wrong ranking:
The compatible functions The loss function (MAP in
This paper)
The constraints for text-query-image direction can be defined in a same way.
Multi-modal embedding:
Ranking based embedding
The compatible functions F are respectively defined for two
directions:
Ranking images from
text queries
Ranking texts from
image queries
Note that compatible functions care about the relative ranking position between a
relevant document and an irrelevant document. As a result, the ranking result
which maximizes compatible function F is equal to the ranking result given by
ranking function.
Multi-modal embedding:
Ranking based embedding
optimizing the model
parameters (U and V)
updating the constraints set
with a new batch of rankings
Multi-modal embedding:
Ranking based embedding
Experiments:
Datasets: Wikipedia feature articles and NUS-WIDE
Wikipedia NUS-WIDE
BoVW vocabulary size
(image)
1000 500
BoW vocabulary size (text) 5000 1000
Avg. # of words / image 117.5 7.73
Documents Partition 1500/500/866 2664/23977/106567
Queries Partition 1500/500/866 2664/2000/2000
Partitions are ordered by training/validation/test.
Performance Measurement: MAP@R Mean Average Precision
R is set to 50 or all.
Text
Query
(R=50)
Text
Query
(R=all)
Image
Query
(R=50)
Image
Query
(R=all)
CCA 0.2343 0.1433 0.2208 0.1451
PAMIR 0.3093 0.1734 0.1797 0.1779
SSI 0.2821 0.1664 0.2344 0.1759
Uni-
CMSRM
0.3663 0.2021 0.2570 0.2229
Bi-
CMSRM
0.3981 0.2123 0.2599 0.2528
Text
Query
(R=50)
Text
Query
(R=all)
Image
Query
(R=50)
Image
Query
(R=all)
CCA 0.1497 0.0851 0.1523 0.0883
PAMIR 0.2046 0.1184 0.5003 0.2410
SSI 0.2156 0.1140 0.4101 0.1992
Uni-
CMSRM
0.2781 0.1424 0.4997 0.2491
Bi-
CMSRM
0.3224 0.1453 0.4950 0.2380
Wikipedia dataset in terms of MAP@R NUS-WIDE dataset in terms of MAP@R
Experiments:
Comparative Results
Mission: attempt to learn hashing function(s) to faithfully preserve
the intra-modality and inter-modality similarities and map the high-
dimensional multi-modal data to compact binary codes.
Multi-modal Document
(one image with its narrative text)
0 1 1 1 0 1
1 0 0 1 1 1
0 1 1 1 0 0
… Hashing Function
Multi-modal embedding:
Multi-modal hashing
0 1 1 1 0 1
1 0 0 1 1 1
1 1 0 0 0 1
…
Hashing is promising way to speed up the ANN (approximate nearest
neighbor ) similarity search, which makes a tradeoff between
accuracy and efficiency.
Multi-modal embedding:
Multi-modal hashing
Three kinds of hashing approaches
Locality Sensitive Hashing
Spectral Hashing
Multiple Feature Hashing
Composite Hashing(CHMIS)
Homogeneous
Features
Heterogeneous
Features
Cross Modal Sensitive Similarity
Hashing(CMSSH)
Cross View Hashing(CVH)
Multimodal latent binary
embedding(MLBE)
Multimodal
data
Color
Texture
Shape
….
Image
Text
Audio
….
Multi-modal embedding:
Multi-modal hashing
Multi-modal hashing tends to utilize the intrinsic intra-
modality and inter-modality similarity to learn the
appropriate relationships of the data objects and provide
efficient search across different modalities
Our Approach: Sparse Multi-modal Hashing
Fei Wu, Zhou Yu, Yi Yang, Siliang Tang, Yueting Zhuang, Sparse
multi-modal hashing, IEEE Transactions Multimedia, 16(2):427-
439,2014
Multi-modal embedding:
Multi-modal hashing
Multi-modal
dictionaries
Multi-modal correlation
modeling
dinosaur,jaw,
Jurassic
sport,football,
NFL
Intra-modality similarity
Inter-modality similarity
sport,football,
NFL
dinosaur,jaw,
Jurassic
ImageDictionary
TextDictionary
Multi-modal objects
Step 1: The Joint Learning of Multi-modal Dictionaries
Multi-modal embedding:
Multi-modal hashing
Sparse
Reconstruction
Hypergraph Laplacian
Penalty
Our approach is formulated by coupling the multi-modal
dictionary learning (in terms of approximate reconstruction of
each data object with a weighted linear combination of a small
number of “basis vectors”) and a regularized hypergraph penalty
(in terms of the modeling of multi-modal correlation).
Multi-modal embedding:
Multi-modal hashing
Sparse Reconstruction Sparse codesets Multi-modal objects
Both intra-modality and inter-modality similarities are preserved. For examples, two “dinosaur”
images have the same sparse codeset, and two “dinosaur” images have similar sparse codesets
with their relevant text (dinosaur, ancient and fossil, etc). On the contrary, two “dinosaur”
images have apparently different sparse codesets with their irrelevant text(sport, football, etc).
ImageDictionary
dinosaur,ancient,
fossil
sport,football,
NFL TextDictionary
(3, 4, 6, 7)
(3, 4, 6, 7)
(2, 4, 6, 7)
(1, 4, 5, 8)
dinosaur,jaw,
Jurassicdinosaur,ancient,
fossil
sport,football,
NFL
Step 2: The Generation of Sparse Codesets
Multi-modal embedding:
Multi-modal hashing
62
activate the most relevant component and induce a compact
codeset for each data from its corresponding sparse coefficients
Multi-modal embedding:
Multi-modal hashing
extends uni-modal DL to multi-modal DL and jointly
learns a set of mapping functions across different
modalities. Furthermore, SliM2 utilizes the label
information to discover the shared structures inside intra
modalities from the same classes.
Our Approach: Supervised coupled dictionary learning
with group structures
Yueting Zhuang, Yanfei Wang, Fei Wu, Yin Zhang, Weiming Lu,
Supervised Coupled Dictionary Learning with Group Structures for
Multi-modal Retrieval, AAAI 2013
Multi-modal embedding:
Multi-modal hashing
Dictionary learning methods:
Penalty for traditional DL :
Penalty for DL with group norm :
If all of the images in one class (category) is taken as a group, we
have dictionary learning utilize label information as follows:
Multi-modal embedding:
Multi-modal hashing
Group norm with
label
information
which assume data
with same
categories to
have the same
dictionary
entries
Reconstruction
errors of
Different
modalities
Linear relationships
between sparse
coefficients
Relatively simple
mappings
Multi-modal embedding:
Multi-modal hashing
Most of the existing cross-media hashing approaches share the
common idea of learning different hash functions individually for
each modality and map the data from different modalities to a
shared low-dimensional Hamming space. However, such a binary
embedding strategy often results in poor indexing performance for
the shared embedding space is not semantically discriminative,
which is significantly important for cross-media retrieval
Our Approach: Discriminative coupled dictionary hashing
Zhou Yu, Fei Wu, Yi Yang, Qi Tian, Jiebo Luo, Yueting Zhuang,
Discriminative Coupled Dictionary Hashing for Fast Cross-media
Retrieval, SIGIR 2014 (Long Paper, accepted)
Multi-modal embedding:
Multi-modal hashing
To learn the both discriminative and coupled dictionaries.
The discriminative capability indicates that data from same category will have
similar sparse representation (i.e., sparse codes)
The coupling means not only intra-modality similarity but also inter-modality
correlation will be preserved.
Our Approach: Discriminative coupled dictionary hashing
Zhou Yu, Fei Wu, Yi Yang, Qi Tian, Jiebo Luo, Yueting Zhuang,
Discriminative Coupled Dictionary Hashing for Fast Cross-media
Retrieval, SIGIR 2014 (Long Paper, accepted)
Multi-modal embedding:
Multi-modal hashing
The Discriminative Capability of the Coupled Dictionary Space
Multi-modal embedding:
Multi-modal hashing
Deep learning attempts to learn in multiple levels of
representation, corresponding to different levels of
abstraction. The levels in these learned statistical
models correspond to distinct levels of concepts, where
higher-level concepts are defined from lower-level ones,
and the same lower-level concepts can help to define
many higher-level concepts.
Boltzmann machine, auto-encoder, recursive neural network,
convolutional neural network
Bengio, Y. , Learning Deep Architectures for AI, Foundations and Trends in
Machine Learning, 2: 1–15,2009
Nicola Jones, The learning machines, Nature, 08 January 2014
Multi-modal embedding:
From the shadow model to the deep model
Galen Andrew, Raman Arora, Jeff Bilmes and Karen Livescu, Deep Canonical
Correlation Analysis, International Conference on Machine Learning, 2013
Multi-modal embedding:
From the shadow model to the deep model
Multi-modal embedding:
From the shadow model to the deep model
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A.Y. Ng., Multimodal deep learning, ICML 2011
A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov, DeViSE:
A Deep Visual-Semantic Embedding Model, NIPS 2013
Images: Convolutional neural network Documents: Recursive neural network
Multi-modal embedding:
From the shadow model to the deep model
Nitish Srivastava , Ruslan Salakhutdinov , Multimodal Learning with Deep Boltzmann Machines,
NIPS 2012
Multi-modal embedding:
From the shadow model to the deep model
Nitish Srivastava, Ruslan Salakhutdinov, Geoffrey Hinton, Modeling Documents
with Deep Boltzmann Machines, UAI 2013
Multi-modal embedding:
From the shadow model to the deep model
W. Wang, B.C. Ooi, X. Yang, D. Zhang, Y. Zhuang, Effective MultiModal
Retrieval based on Stacked AutoEncoders,VLDB 2014
Multi-modal embedding:
From the shadow model to the deep model
Yan Liu, Sheng-hua Zhong ,Wenjie Li. ,Query-oriented Multi-document
Summarization via Unsupervised Deep Learning, AAAI 2012
Multi-modal embedding:
From the shadow model to the deep model
Dong Yu, Shizhen Wang, and Li Deng, Sequential Labeling Using Deep-Structured
Conditional Random Fields,IEEE Journal of selected topics in signal processing
Multi-modal embedding:
From the shadow model to the deep model
Deep structured models:
Deep conditional random fields
Feature Representation
and Structured Learning
Freebase:4千多万个实体(entity),20亿多个实体与实体之间关系
描述的facts
NELL(CMU): Never-Ending
Language Learning:5千多万个实体与实体之间关系描述
ReVerb: 1千5百万条实体与实体之间的关系描述
结论:重视知识图谱的利用
Clickage: Towards bridging semantic and intent gaps via mining click logs of search
engines, ACM MM 2013
结论:重视用户交互行为(群智)
Clickage: Towards bridging semantic and intent gaps cia mining click logs of search
engines, ACM MM 2013
如何更好利用真实世界中搜索引擎提供的海量相关反馈信息
结论:重视用户交互行为(群智)
通过Amazon’s Mechanical Turk、reCAPTCHA以及ESP
Game等获取crowdsourcing行为:
ICML 13 Workshop: Machine Learning Meets Crowdsourcing, NIPS 13 Workshop on Crowdsourcing:
Theory, Algorithms and Applications, ACM MM 2013 CrowdMM'13, Crowdsourcing for Multimedia 2014
如何更好利用群体行为来理解跨媒体大数据
结论:重视用户交互行为(群智)
ICML 13 Workshop: Machine Learning Meets Crowdsourcing, NIPS 13 Workshop on Crowdsourcing:
Theory, Algorithms and Applications, ACM MM 2013 CrowdMM'13, Crowdsourcing for Multimedia 2014
用户交互行为引入(群智计算)
2013年1月IEEE 《Computer》 杂志刊文呼吁:建立计算模式
图灵奖获得者、ACM现任主席 Vinton G. Cerf教授2014年2月在《Communications of the ACM》提出了“ (认知植入)
Noam Miller,et.al, Both information and social cohesion determine collective decisions in
animal groups, PNAS, 110(13):5263-5268,2013,March
美国科学院院刊(PNAS)在2013年3月所发表的文章指出:人类认知中重要的决策(decision making)环节受到个人已有先验知识(past experiences or priors)、数据
(即其他人在网络空间和现实世界中行为)以及人和人之间交往模型 (如孤立、社会认同等)等影响,这些因素对于最后决策均是极其重要的(crucial)。 但是,这篇论文中人和人之间交往模型是描述性的、非可计算模型。
用户交互行为引入(群智计算)
Unsupervised Deep learning (bottom-up)/Supervised Deep learning
(up-bottom for tuning)/
Learn representation (feature) from data/ Disentangle structure in data
Big data + Big infrastructure -> Big model + Big learning
Knowledge mining (top-down)/The Construction of Knowledge base
Entities and relations, alterative expressions, etc.
Inference and reasoning
Crowdsourcing(human computation in the loop)
Assist automatic algorithm to achieve higher accuracy
Treat human computer interaction as new way to collect more signals
and labelled data for supervised learning
Put it together (参考马维英博士slides)