multi-view, multi-label learning with deep neural networks
TRANSCRIPT
Multi-view, Multi-label Learning with DeepNeural Networks
Ziqiao Guan1
Research Proficiency Exam
September 2016
AdvisorHong Qin1
Professor
CommitteeMinh Hoai Nguyen1 Dimitris Samaras1 Dantong Yu2, 3
Assistant Professor Associate Professor Associate Professor1 Department of Computer Science, Stony Brook University
2 Martin Tuchman School of Management, New Jersey Institute of Technology3 Brookhaven National Laboratory
2
Abstract
Deep learning is a popular technique in modern online and offline services. Deep neural
network based learning systems have made groundbreaking progress in model size, training
and inference speed, and expressive power in recent years, but to tailor the model to specific
problems and exploit data and problem structures is still an ongoing research topic. We look
into two types of deep ‘‘multi-’’ objective learning problems: multi-view learning, referring
to learning from data represented by multiple distinct feature sets, and multi-label learning,
referring to learning from data instances belonging to multiple class labels that are not mutually
exclusive. Research endeavors of both problems attempt to base on existing successful deep
architectures and make changes of layers, regularization terms or even build hybrid systems to
meet the problem constraints.
In this report we first explain the original artificial neural network (ANN) with the
backpropagation learning algorithm, and also its deep variants, e.g. deep belief network (DBN),
convolutional neural network (CNN) and recurrent neural network (RNN). Next we present a
survey of some multi-view and multi-label learning frameworks based on deep neural networks.
At last we introduce some applications of deep multi-view and multi-label learning, including
e-commerce item categorization, deep semantic hashing, dense image captioning, and our
preliminary work on x-ray scattering image classification.
1
Contents
1 Introduction 1
2 Deep Neural Network 4
2.1 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Deep Belief Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 DBN and Deep Auto-encoder . . . . . . . . . . . . . . . . . . . . . 8
2.3 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Normalizations and Overfitting Prevention . . . . . . . . . . . . . . 11
2.3.3 Notable Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.3 Long Short Term Memory and Gated Recurrent Unit . . . . . . . . . 15
2
3 Multi-view and Multi-label Learning 18
3.1 Multi-view Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Deep Multi-view Representation Learning . . . . . . . . . . . . . . . 19
3.1.2 Network Co-training . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Multi-label Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Going Deep with Multi-label Learning . . . . . . . . . . . . . . . . . 24
3.2.2 Multi-label Ranking Loss . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.3 Multi-attribute Learning . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.4 Regional Object Detection . . . . . . . . . . . . . . . . . . . . . . . 27
4 Applications 30
4.1 Large Scale Item Categorization in e-Commerce . . . . . . . . . . . . . . . . 30
4.2 Deep Semantic Ranking Based Hashing . . . . . . . . . . . . . . . . . . . . 31
4.3 Dense Image Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Preliminary Work: X-ray Scattering Image Classification . . . . . . . . . . . 34
5 Conclusion 37
3
Chapter 1
Introduction
Machine learning technology drives a vast amount of online and offline services of today: from
web searches to face recognition on mobile phones. The modern Internet has made massive
volume of user generated data possible, greatly benefiting the development of learning methods
and techniques in recent years, which in turn find their way in powering consumer electronic
devices such as cameras and smartphones. In general, machine learning systems tackle
various tasks such as identifying objects in images, transcribing speech into text, fetching web
pages, products and knowledge to users’ interest [52]. In many domain specific applications,
such as x-ray image categorization [46], weather prediction [28], electronic medical record
monitoring [10], data analysis tasks also call for effective and scalable machine learning
systems. The core to those versatile applications is to seek a robust representation of rich and
large amount of data, thereby properties and features of interest become easily discernible and
interpretable, which is a nontrivial task.
Classic machine learning techniques had a lot of limitations when dealing with intrica-
cies in natural data. Traditionally, learning models relied heavily on hand crafted features, i.e.,
a designed set of transforms to convert raw data to a representation suitable for analysis tasks,
which required elaborate design and often profound domain knowledge. In computer vision, for
example, conventional object detection methods extract image features with descriptors such
1
as SIFT [56] or HOG [15], and then use classifiers like support vector machine (SVM) [14]
to classify the resulting feature vectors. This type of fixed hand crafted features usually do
not adapt very well to datasets with various data statistics, or require nontrivial transform
tricks for specific applications. Recently, deep learning, or more specifically deep neural
networks, has presented a paradigm shift in the search of good representations and achieved
great performance boost in machine learning problems of different areas.
Deep learning methods are essentially representation learning methods [6] via multi-
layered approaches to replace the hand crafted features of traditional approaches with layered
learned features discovered by general learning algorithms. Starting with the raw input, deep
learning methods construct multiple layers of nonlinear operations that act on the representation
from the previous layer and pass the response to the deeper layer. The purpose of stacking many
layers of transforms is to expect deep structures to approximate representation mappings of high
complexity. We expect stacked structures to reflect the level of abstraction of understanding
data, so that deeper layers are able to get hold of abstract ideas and robust to variations that are
dramatic from the perspective of the raw input. For example, in image classification, the first
layer should pick up low level visual features such as edges, the second layer should handle the
arrangements of edge features, and deeper layers are supposed to focus on bigger and more
complete bits of elements that make up specific shapes.
Due to the high degrees of freedom of deep neural network structures, training a deep
network model effectively poses a lot of new challenges to researchers. On one hand, training
a complex network requires a huge amount of data, which is computationally expensive; on the
other hand, deep network tends to amplify small changes of responses through layers, making
the system highly sensitive to some specific settings. For numerical concerns, researchers have
introduced techniques such as normalization, data augmentation and dropout [72] to speed up
the training and improve the robustness. For adapting to specific problems, numerous works
have attempted to change the network layers, propose novel objectives and regularization
terms, and build more complex hybrid systems.
2
As deep learning achieves great success in generic classification and recognition
problems, more researchers turn to focus on exploiting the structure of data and the problem in
the learning process, which brings multi-view and multi-label learning problems. Multi-view
data is defined as data with diverse views from multiple sources or feature subsets, e.g. text data
with accompanying images. Joint training methods take the consistency and complementary
relations of different views into consideration for more effective and robust learning. Also, data
may come with multiple correlated attributes, as opposed to a standard N -way classification
problem where the labels are mutually exclusive. By explicitly preserving those correlations
via manifold embedding, for example, the trained models may better capture the patterns of the
data. Multi-view and multi-label learning problems pose different modifications to the neural
network structure to fit the specific applications, which shows the versatility of deep neural
networks.
In this report, we first explain the basics of artificial neural network (ANN) with
the backpropagation training algorithm, as well as several popular variations of deep neural
networks, i.e. deep belief network (DBN), convolutional neural network (CNN) and recurrent
neural network (RNN). Next we present a survey of some multi-view and multi-label adapta-
tions of deep neural networks. Finally, we introduce some applications of deep multi-learning
and our preliminary work on x-ray scattering image classification.
3
Chapter 2
Deep Neural Network
The popular deep neural networks of today are a recent development of the original artificial
neural network, which was mostly superseded by support vector machine [14] in 1990s due to
its simplicity and effectiveness. In this chapter, we first introduce ANN and the backpropagation
training algorithm, and then introduce some of its deep variations, i.e. DBN, CNN and RNN,
and the reason of their new successes.
2.1 Artificial Neural Network
An artificial neural network is a network inspired by the biological neural network for machine
learning tasks. The earliest ANN is a multilayer perceptron (MLP), which consists of multiple
layers of nodes in a directed graph, and each layer is fully connected to the next one (see
Figure 2.1). Each node except the input nodes is a neuron with a nonlinear activation function.
As a multilevel variation of the linear perceptron, MLP is used to model and classify data that
is linearly inseparable via supervised learning.
Backpropagation [63] is a learning algorithm for ANNs. The idea is to look for
the minimum of some error function in the weight space via gradient descent. Given a feed
forward network and a training set {(x1, t1), . . . , (xp, tp)}, where xi is a data instance and
4
Input
Hidden
Output
Figure 2.1: Structure of an artificial neural network.
ti is its label vector, we generally want to produce an output oi that is as close to the target
ti as possible by the means of adjusting network parameters. More specifically, we wish to
minimize an error function, say, L2-error
E =1
2
p∑i=1
‖oi − ti‖2. (2.1)
In order to apply gradient descent during the training, we need the activation function to be
differentiable, e.g. sigmoid:
σ(x) =1
1 + e−x, (2.2)
so that E = E(w1, w2, . . . , w`) is a differentiable function of the ` weight parameters in the
network and the gradient can be calculated as
∇E =
(∂E
∂w1,∂E
∂w2, . . . ,
∂E
∂w`
). (2.3)
5
Then each weight is updated by
∆wi = −γ ∂E∂wi
, i = 1, . . . , `, (2.4)
where γ is the learning rate.
Backpropagation is a realization of the chain rule. Suppose we extend the output layer
of the ANN to compute the error function in the end. In the feed forward step, we evaluate
not only the activation value at each node, but also the derivative of the activation function
given the input. In order to find the partial derivative of the training error E with respect
to an edge weight wij , the chain rule is applied and it is equivalent to running the neural
network backwards. More specifically, we feed the constant 1 to the error node, and traverse
the network backwards and multiply the derivatives at the nodes in between, using the same
weights. The weight correction can be expressed as
∆wij = −γoiδj , (2.5)
where wij refers to the weight on the edge connecting node i to node j, oi is the output at node
i, and δj is the backpropagated error.
In 1995, Cortes and Vapnik devised a special type of perceptron called support vector
machine [14]. The idea of SVM is to map input vectors to a high dimensional feature space that
is linearly separable, and find a decision surface to maximize the distance between positive and
negative samples. They showed the constructed separating hyperplane has good generalization
ability. At the time, multilevel ANNs did not deliver a much better performance despite their
complex structure, due to these difficulties of network backpropagation:
• It requires a large amount of labeled data;
• It takes a long time in multilevel networks;
• It can get stuck in local optima.
6
x
h1
h2
h3
Figure 2.2: Structure of a deep belief network. The top two layers have undirected symmetricconnections between them and is essentially a Restricted Boltzmann Machine. The lowerlayers have top-down directed connections and form a Bayesian network.
In 1990s, research on neural networks with multiple adaptive hidden layers was mostly
shadowed by SVM.
2.2 Deep Belief Network
In 2006, Hinton and Teh proposed the greedy layer-wise training algorithm for DBNs [35],
which was one of the first fast and effective deep learning algorithms. Different deep neural
network structures have been proposed since then and outperformed traditional methods like
SVMs in various learning tasks.
2.2.1 Structure
Deep belief network is a probabilistic generative model composed of multiple layers of
stochastic latent variables [32]. Hinton and Teh [35] showed that DBN can be modeled as
stacked Restricted Boltzmann Machines (RBMs) and thus trained layer-wise in a greedy manner.
The joint distribution of observed vector x = h0 and the hidden layers hk (k = 1, . . . , `) is as
follows
P (x,h1, . . . ,h`) =
(`−2∏k=0
P (hk|hk+1)
)P (h`−1,h`). (2.6)
7
Input Output
Encoder Decoder
Compressed Feature Vector
Figure 2.3: Structure of an auto-encoder. An auto-encoder attempts to reconstruct the in-put through a low-dimensional bottleneck layer in the middle, which gives a compressedrepresentation of the original input data.
where x = h0, P (hk−1|hk) is a conditional distribution for the visible units at level k − 1
given the hidden units at level k, and P (h`−1,h`) is the visible-hidden joint distribution in the
top-level RBM, as illustrated in figure 2.2.
2.2.2 Learning
Hinton and Teh proposed contrastive divergence [35], a greedy layer-wise unsupervised
training algorithm for DBNs. The idea was to treat each level as a separate visible-hidden RBM
and maximize the probability of visible representation, which could be efficiently approximated
by sampling.
The purpose of the unsupervised process was to learn a distribution that could generate
the data given and labeled data was not required. It could also be adapted to a supervised
learning setting by adding the labels to the network architecture and then fine-tuning the
network with labeled data. Experiments showed a slightly fine-tuned DBN could easily
outperform a carefully trained backpropagation net or SVM [33].
2.2.3 DBN and Deep Auto-encoder
An auto-encoder is a feed forward neural network to directly learn a representation for a set
of data. It consists of two parts, the encoder and the decoder, where the encoder learns a
8
Input
Conv1
11×11×96×3
Maxpool1
Conv2
5×5×96×256
Maxpool2
Conv3
3×3×256×384
Conv4
3×3×384×384
Conv5
3×3×384×256
Maxpool5
FC6
4096
FC7
4096
FC8
1000
Figure 2.4: Structure of AlexNet [49]. The CNN consists of 5 convolutional layers, 3 of whichare followed by max-pooling, and 3 fully connected layers.
representation and the decoder tries to reconstruct the input (see Figure 2.3).
Hinton and Salakhutdinov [34] extended the idea of greedy pretraining to deep auto-
encoders. Analogous to DBN training, They proposed to greedily pretrain the weights of the
encoder side, which are initially shared by the decoder, and then fine-tune the auto-encoder
using backpropagation. They showed that pretrained networks had good generalization ability
because most of the weight information came from modeling the data.
2.3 Convolutional Neural Network
Convolutional Neural Network [50][51] is a type of deep, feed forward network that features a
convolution operation for processing arrays and multi-dimensional array data, for example,
1D for sequential or temporal signals, 2D for images and 3D for videos or volumetric images.
CNN was used for handwritten digit recognition [50] back in 1990s, but it was not adopted
widely until a recent breakthrough [49] in ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) [64].
2.3.1 Structure
The architecture of a typical CNN (see Figure 2.4) consists of layers of different operations,
including convolutional layers, pooling layers and fully-connected layers. Here we explain the
9
CNN structure with 2D image arrays as data input; there are adaptations to work with other
forms of data.
Convolutional Layer. A convolutional layer takes the feature maps from the previous
layer (the original image if it is the first layer) and perform convolution with a set of weights
called filter banks. More precisely, let x be M × N × K array of M × N pixels and K
channels, w ∈ RH×W×K×K′ be K ′ filters of size (W,H) with K input channels, b the bias
term, the convolution can be expressed as
yk′ =∑k
w·,·,kk′ ∗ xk + bk′ (2.7)
where ∗ is a 2D convolution. Convolution takes a local patch into account when computing
responses, which is usually highly correlated in terms of image statistics; also it is translation
invariant as the filter window scrolls through the entire image. Therefore it is effective for
identifying distinctive motifs of the images.
It is also worth noting that the multi-channel convolution given by (2.7) is essentially
a weighted sum of the convolutions of all the input channels. Some CNN architectures involve
1× 1 convolutions [74][31], which does not make sense as a conventional 2D convolution, but
they act as dimensionality reduction operations by combining the input channels.
Nonlinear Activation. The local weighted sum acquired from a convolutional layer is
passed through a non-linearity that is applied element-wise, e.g. Rectifier Linear Unit (ReLU)
yijk = max {0, xijk} (2.8)
Glorot et al. [24] showed that ReLU performed better in neural networks than sigmoid or
hyperbolic tangent because it was free of the gradient saturation problem and preserved sparsity
of the signal.
Pooling Layer. Pooling operations combine a local patch into one output value as a
means of downsampling. The common practice is max-pooling that takes the maximum value
10
of a patch from each channel individually, which has better performance than sum-pooling.
There are also attempts to get rid of pooling by increasing the stride of the convolutional
layers [71].
2.3.2 Normalizations and Overfitting Prevention
There are two major numerical difficulties that deep CNN training is facing: speed and
overfitting. Since training data is organized in mini batches in stochastic gradient methods,
the network training will be seeing data of different distributions all the time, which is said
to be having covariate shift [67] and will slow down convergence. Domain adaptation [41]
is usually applied to alleviate the effect. In the context of CNN learning, it has been long
known that the network training has faster convergence if the inputs are whitened to have zero
means and unit variances and decorrelated [53][81]. However, even if the inputs are whitened,
intermediate responses in the middle layers will still experience changes in distributions due to
the weight updates in training, which is called internal covariate shift. Earlier models attempted
to introduce normalization to the middle layers with local response normalization [49], but it
required manual tuning and did not adapt well to different datasets and was dropped by later
models.
Ioffe and Szegedy proposed batch normalization (BN) [39] to learn the scale and shift
parameters along with the network training, so that the optimization is fully aware of and able
to accommodate with the internal normalizations. They argued that in order to preserve the
nonlinear expressive power of the activations, the learned transforms should still be able to
represent identity so that the normalization could be reverted if desired. Therefore for each
activation x, BN learns a pair of parameters γ, β such that:
x =x−E[x]√
Var[x], (2.9)
y = γx + β. (2.10)
11
To address the overfitting problem, the easiest and most common method is to augment
the dataset via label preserving transforms [49][68][12][13], e.g. random translation, flip, crop
and pixel intensity distortion. Srivastava et al. proposed dropout [72] to randomly drop neurons
during training to prevent co-adapting and showed its efficiency and improvement of neural
network performance. Wan et al. generalized dropout to DropConnect [77] which randomly
selected a subset of weights to zero, enabling finer control over which connection to drop. Bulo
et al. proposed dropout distillation [7] to improve the inference accuracy of a dropout network
by finding the best dropout configuration via another stochastic optimization. On the other
hand, maxout networks [26] and multi-bias non-linear activation [54] attempted to modify the
receptive behavior of the activations to generate more effective responses, thereby tackling the
overfitting issues.
2.3.3 Notable Architectures
AlexNet introduced by Krizhevsky et al. [49] was the first deep CNN architecture to gain huge
popularity in the computer vision community. It was one of the first ‘‘deep’’ networks to
use stacked convolutional layers and had remarkable performance boost from utilizing ReLU
gating, GPU accelerated training and dropout.
VGGNet proposed by Simonyan and Zisserman [69] featured a deeper network of
16 or 19 convolutional/fully-connected layers and demonstrated deeper network did deliver
better performance. The network had a homogeneous structure with 3× 3 convolutions and
2× 2 pooling operations only. They also showed that fully-connected layers could be removed
without hurting the network performance, and thus the number of parameters can be reduced
drastically.
GoogLeNet from Szegedy et al. [74] took the inspiration of Network in Network [55]
to replace a single convolutional layer with an Inception module (see Figure 2.5). The Inception
module attempted to pick up sparsely distributed features computed from previous filters with
convolutions of small filter size, and concatenate responses from a few different filters to
12
Previous Layer
1×1 conv
3×3 conv 5×5 conv
1×1 conv 1×1 conv
1×1 conv
3×3 maxpool
Filter
Concatenation
Figure 2.5: The Inception module of GoogLeNet [74]. 3 different sizes of convolutional filters,along with a max-pooling are applied to the same feature maps from the previous layer tocollect feature responses of different spread. The resulting feature maps are concatenated intoa long tensor.
obtain multi-scale features. Because of the small filters used, the number of parameters were
cut down from approximately 60 million as in AlexNet to 4 million, despite the much more
sophisticated 22-layer structure.
ResNet proposed by He et al. [31] went one step further on deepening CNN struc-
tures with a 152-layer network on ImageNet, and even reaching 1000 layers on CIFAR-10
dataset [48]. They pointed out deeper networks had a degradation problem that caused higher
training error compared to shallower networks, and it was not because of overfitting. They
introduced shortcut connections (see Figure 2.6) that passed identity mappings over multiple
layers, so that the multilayer filters were essentially learning the residual values, hence the
name. They showed empirically that this construction sped up the convergence and improved
network performance.
2.4 Recurrent Neural Network
Recurrent Neural Network is a type of neural network with directed cycles connecting the units.
The cycle connections serve as internal states that are updated with sequential inputs, which
resembles an internal memory and allows RNN to model temporal behavior.
13
conv
conv
+
relu
x
F(x)
F(x)+x
xIdentity
Figure 2.6: The skip connection in ResNet [31].
x
s
y
V
WU
xt-1
s
yt-1
V
W
U
xt
s
yt
V
W
U
xt+1
s
yt+1
V
W
U
W
Figure 2.7: An RNN and its unfolded form.
2.4.1 Structure
A typical RNN (see Figure 2.7), once unfolded, can be regarded as a very deep feed forward
network where weights are shared among all hidden layers. Let xt be the input vector, st the
hidden state, s−1 = 0, yt be the output, t = 0, 1, 2, . . . , the forward pass of a vanilla RNN
works as follows
st = tanh (Uxt + Wst−1), (2.11)
y = Vst, (2.12)
where U, V and W are the weight parameters of the RNN. Bias terms may also be present,
but they can be absorbed into the weight matrices, and with a slight abuse of symbols, we still
refer to them as U, V and W.
RNNs can be stacked upon each other to create deeper structures. Bidirectional
RNN [65] uses two RNNs, one processing the sequence from left to right, the other from right
14
to left, to model the sequence exploiting both the past and the future context. Furthermore,
deep bidirectional RNNs have already been used in speech recognition [2], enabling much
higher learning capacity.
2.4.2 Learning
An RNN is trained via Backpropagation Through Time (BPTT) algorithm, which is essentially
a standard BP acting on the unfolded RNN. A BPTT update involves all the previous inputs
since the weights are shared. For instance, for some error function En of the n-th output,
weight matrix W is updated by
∂En
∂W=∂En
∂yn
∂yn
∂sn
∂sn∂W
. (2.13)
Note that sn depends on all preceding states, all of which depend on W. Thus we can unfold it
all according to the chain rule as follows
∂En
∂W=
n∑k=0
∂En
∂yn
∂yn
∂sn
∂sn∂sk
∂sk∂W
. (2.14)
This training method has been known to suffer from the exploding or vanishing gradient
problem [5][61] and thus standard RNN has difficulties modeling long-term dependencies.
This problem can be mitigated by careful choice of initial weights, using ReLU instead of
hyperbolic tangent or other normalization methods; alternatively, variants of RNNs have been
proposed to explicitly encode the internal memory of the neural network.
2.4.3 Long Short Term Memory and Gated Recurrent Unit
Long Short Term Memory (LSTM) networks were proposed in 1997 by Hochreiter and
Schmidhuber [36] to resolve the vanishing gradient problem. The idea is to introduce a cell
15
(a) LSTM: 3 nonlinear gates σ(·) from left to right are: forget, inputand output gates.
(b) A GRU unit containsonly 2 gates.
Figure 2.8: An LSTM unit and a GRU unit. Illustration from [1].
state ct, manipulated by several gates as follows (see Figure 2.8a):
i = σ(xtUi + ht−1W
i), (2.15)
f = σ(xtUf + ht−1W
f ), (2.16)
o = σ(xtUo + ht−1W
o), (2.17)
g = tanh (xtUg + ht−1W
g), (2.18)
ct = ct−1 � f + g � i, (2.19)
ht = tanh (ct)� o, (2.20)
where � is the element-wise Hadamard product, σ is the activation function, i, f and o are
input, forget and output gates respectively, ct is the internal cell state, and ht is the output
hidden state. Both input and forget gates pass the input and the hidden state to an activation
function that clamps the value in [0, 1]. Then the cell state is updated via (2.19) with explicit
control of how much to remember or forget. Finally the output is generated through the output
gate o.
There are also different extentions of the LSTM structure [27][44], e.g. peephole
connections [21] that take the cell state into gating. Gated Recurrent Unit (GRU) [11] is a
16
simpler variation of LSTM with two gates: reset gate r and update gate z (see Figure 2.8b)
z = σ(xtUz + ht−1W
z), (2.21)
r = σ(xtUr + ht−1W
r), (2.22)
h = tanh (xtUh + (ht−1 � r)Wh), (2.23)
ht = (1− z)� ht−1 + z� ht−1. (2.24)
17
Chapter 3
Multi-view and Multi-label Learning
The sophisticated nonlinear structures of deep neural networks have achieved great success in
modeling rich data of high complexity, and also shown great potential of smooth adaptation to
specific problems. On the other hand, due to the intricacies of the training process, exploiting
data and problem structures is essential for building effective models. Recent years have seen
an abundance of work addressing the following 4 ‘‘multi-’’ learning objectives:
• Multi-view learning (MVL) [73][83], or interchangeably called multimodal learning,
refers to learning from data represented by multiple distinct feature sets, e.g. mining text
data with accompanying images;
• Multi-instance learning (MIL) [3] refers to learning predictors from training data in
bags, where each bag contains multiple samples/feature vectors called instances. The
individual instances carry their own attributes/labels, some of which give rise to the bag
labels, but they are not necessarily consistent, e.g. object recognition with images from
multiple perspectives;
• Multi-label learning (MLL) [70] refers to learning from a set of instances, each of
which can belong to multiple classes that are not mutually exclusive, e.g. predicting
attribute tags of items on e-commerce websites or Yelp;
18
• Multi-task learning (MTL) [19] refers to learning a low dimensional representation
of the data to share among a set of related tasks, e.g. extracting image features for
recognition and segmentation simultaneously.
These objectives have correlated problem formulations and may share some method-
ologies in working with structures. For example, MIL can be regarded as a special case of
MVL where each instance reveals a view of the entire bag; if we think of MTL as learning
a predictor function for each individual task, MLL stipulates a number of task functions of
predicting some labels of interest.
In this chapter we introduce some deep MVL and MLL frameworks, which roughly
fall into 2 categories: (a) Modify the network architecture to work with specific structures, and
(b) Use the network model as a feature extractor to obtain some representation of the data, and
have other techniques to perform multi-learning on them. Both approaches benefit from the
deep representations and show the superiority of deep neural networks.
3.1 Multi-view Learning
3.1.1 Deep Multi-view Representation Learning
A common technique to process heterogeneous data types is to map them to a common low di-
mensional feature space. In deep MVL problems, there are mainly two types of approaches [78]:
auto-encoder based and canonical correlation analysis (CCA) [37] based.
Auto-encoder based methods attempt to learn a representation that best reconstructs the
inputs. Ngiam [59] proposed to derive a representation from a view that was always available
at test time from which other views could be reconstructed. This constituted an auto-encoder
network with one shared encoder and separate decoder for each view, and was hence called
19
split auto-encoder. The objective of split auto-encoder can be expressed as follows
minWf ,Wp,Wq
1
N
N∑i=1
(‖xi − p(f(xi))‖2 + ‖yi − q(f(xi))‖2), (3.1)
where xi, yi are data of different views, f is the encoder network, p, q are the decoder networks,
W· are the network parameters.
Andrew et al. proposed a deep extension of CCA called DCCA [4]. In DCCA neural
networks were used for each view of the data as per-view feature extractors, and CCA was
performed on the extracted features:
maxWf ,Wg,U,V
1
Ntr(Uᵀf(X)g(Y)ᵀV) (3.2)
s.t. Uᵀ(
1
Nf(X)f(X)ᵀ + rxI
)U = I,
Vᵀ(
1
Ng(Y)g(Y)ᵀ + ryI
)V = I,
uᵀi f(X)g(Y)ᵀvj = 0, for i 6= j,
where U = [u1, . . . ,uL] and V = [v1, . . . ,vL] are CCA project directions and (rx, ry) > 0
are regularization parameters.
Wang et al. leveraged these 2 objectives and proposed the deep canonically correlated
autoencoders (DCCAE) [78] with a formulation as follows
minWf ,Wg,Wp,Wq,U,V
− 1
Ntr(Uᵀf(X)g(Y)ᵀV)
+λ
N
N∑i=1
(‖xi − p(f(xi))‖2 + ‖yi − q(f(xi))‖2) (3.3)
s.t. constraints in (3.2),
where λ > 0 is a trade-off parameter. They argued DCCAE offered a trade-off between
input-feature and feature-feature mappings.
20
Chang et al. [9] generalized the DCCA framework to not only optimize matching
multi-view data records, but explicitly handle heterogeneous networks where similarity con-
nections could exist between any data pair of either the same type or not. They proposed
heterogeneous network embedding to jointly optimize the deep neural network modules and
the linear embeddings of all views. For example, for a set of image and text data, inputs might
come in the form of image-image, image-text or text-text tuples and not necessarily a matching
two-view pair. They used an AlexNet based network to extract image features and a 2-layer
ANN to process text data, and then defined an extra linear embedding for each feature vector
p(X) = Uᵀp(X), (3.4)
q(z) = Vᵀq(z), (3.5)
where X, z are the image and text data, p(·), q(·) are the neural networks for each, U, V are
the cascading linear embedding respectively. Denote the parameters of p(·) as Wp, and those
of q(·) as Wq and finally the objective could be expressed as
minWp,Wq
1
NII
∑ij
L(p(Xi), p(Xj))
+λ1NTT
∑ij
L(q(zi), q(zj))
+λ2NIT
∑ij
L(p(Xi), q(zj)), (3.6)
where L(·, ·) is a log loss function and N·· are the numbers of links in the heterogeneous graph.
3.1.2 Network Co-training
Split auto-encoders and DCCA methods both mean to bridge different modalities of data and
evaluate their correlations, but there are times when the focus is to make inference based on
multi-view data and an end-to-end joint network model is required. In general there are 2 ways
21
…
… … …
… … …
… …
Figure 3.1: Structure of Deep Multi-view Hashing (DMVH) network [45].
to set up joint co-training networks: multiple networks with shared top layers, and network
cascading.
Shared top layers in a joint network are usually one or several fully connected
layers that take the concatenated multi-network outputs. They can be considered a top level
ANN adding nonlinearity to the multi-view extracted features. Kang et al. proposed deep
multi-view hashing [45] that took two feature vectors, e.g. HOG [56] and GIST [60], in a
4-layer network. Both hidden layers contained 3 groups of units, one for each view as direct
connections from input to output and the other fully connected to the preceding layer to encode
common characteristics (see Figure 3.1). Eklahky et al. proposed Multi-view DNN [18] for
recommendation systems to process user and webpage features. They suggested to use multiple
neural networks to compute compact features of the same length from the multi-view data, then
compute the cosine similarities across the views, and eventually compute the softmax score of
the cosine similarities. Wu et al. [82] proposed HMM-FNN for dynamic gesture recognition,
incorporating 2 HMM layers, one for hand posture features, the other for trajectory features,
which were merged into fully connected layers. Ha et al. [30] proposed to use multiple RNNs
to compute features from a set of description text data on e-commerce websites, and then run 2
fully connected layer and a softmax output on the concatenated features.
Network cascading is often used when a pivot view is present in the multi-view data
that other views depend on, so that cascading networks well reflect their dependencies, most
notably in the image captioning problem. Show and Tell proposed by Vinyals et al. [76], for
22
wt-1wt
Embedding
Recurrent
Joint Layer
SoftMax
CNN
FeaturesImage
Figure 3.2: Structure of m-RNN network [57].
example, fed CNN visual features into an LSTM to generate complete sentences to describe
the input images. Donahue et al. extended the model to stacked LSTMs [16] and Johnson
et al. added in a localization layer to better capture regional features [43]. Multimodal RNN
proposed by Mao et al. [57] had the CNN features join in along with the outputs of a recurrent
layer and a fully connected embedding layer from the text inputs (see Figure 3.2). Kiros et al.
proposed to first learn a joint image-sentence embedding with a CNN-LSTM encoder, and then
process the embedding vector with a novel neural language model decoder [47].
3.2 Multi-label Learning
It seems very natural to apply neural network models to MLL problems since the N -way
output of the network may act as N label classifiers. However, such naıve models ignore
label correlations and are therefore suboptimal. Furthermore, multi-label datasets often have
some labels that are very rare. The severely imbalanced data are challenging to train and often
require special processing.
In this section we first introduce some basic practice to incoporate neural networks in
MLL problems, then we explain some loss functions for MLL, and finally we present some
deep MLL models.
23
3.2.1 Going Deep with Multi-label Learning
One of the earliest neural network based MLL methods is BP-MLL for text categorization
proposed by Zhang and Zhou [84], featuring a fully-connected ANN with 1 hidden layer, 1
output layer to encode likelihood for each label, and a pairwise error function for training
E =m∑i=1
Ei =m∑i=1
1
|Ci+||Ci
−|∑
(k,l)∈Ci+×Ci
−
exp (−(cik − cil)), (3.7)
where Ci+ is the positive label set of sample i, Ci
− is the negative set and cik, cil are outputs of
the neural network. The pairwise error penalizes incorrect predictions of negative labels that
have higher output likelihood than positive ones. However, it is later shown that this form of
error function along with other convex loss functions is not consistent with the non-convex,
discontinuous rank loss [20][8], and empirically BP-MLL did not exhibit significantly better
performance than non-NN approaches such as k Nearest Neighbors [85].
Based on BP-MLL, Nam et al. incoporated modern deep learning practice and tricks
into a large-scale multi-label text classification method [58] and achieved state-of-the-art
performance at the time. They proposed to apply:
• Cross entropy
E =∑i
Ei =∑i,k
−(yik log cik + (1− yik) log (1− cik)), (3.8)
where cik is the activation, yik ∈ {0, 1} is the target, for label k, sample i, instead of error
function (3.7);
• ReLU instead of hyperbolic tangent activations;
• Dropout;
• AdaGrad optimization [17].
24
An alternative construction of the output layer was suggested by Huang et al. [38].
They used a 5-layer pre-trained DBN model and set up a pair of output units for a label l, c·l for
positive and c·l for negative, and they computed softmax to obtain the probability of having the
label:
p·l =exp (c·l)
exp (c·l) + exp (c·l), (3.9)
and they also used cross entropy as error function.
3.2.2 Multi-label Ranking Loss
Gong et al. [25] explored some different loss functions for MLL, which they referred to as
multilabel image annotation problem. In the paper they adopted the AlexNet structure and
experimented with 3 loss functions. The first one was softmax loss inspired by TagProp [29].
They first computed the softmax
p·l =exp (c·l)∑k exp (c·k)
(3.10)
as the posterior probability of each label, and set the ground truth as a label vector y· such that
y·l =Ilc+, (3.11)
where c+ is the number of positive labels and Il is an indicator function that equals to 1 when
label l is positive and 0 otherwise. Finally they minimized the KL-divergence between the
predictions and the ground truths
E =1
n
n∑i=1
DKL(pi ‖ yi). (3.12)
The second loss function was the pairwise ranking loss [42] that penalized negative
25
labels that scored higher than positive ones:
E =∑i
∑j∈Ci
+
∑k∈Ci
−
max (0, 1− cij + cik), (3.13)
but they argued (3.13) did not directly optimize the top-k accuracy which was crucial in the
MLL setting, and it was therefore suboptimal.
The last loss function they considered was the weighted approximate ranking (WARP),
first described in [80]:
E =∑i
∑j∈Ci
+
∑k∈Ci
−
L(rj) max (0, 1− cij + cik), (3.14)
where rj is the rank for the j-th label of image i, L(·) is a weighting function that is increasing
with respect to ranks, so that positive labels that were ranked low received a large penalty and
were pushed to the top. The rank rj of a label was estimated via sampling.
3.2.3 Multi-attribute Learning
In general, the multiple labels of data samples in MLL are non-mutually-exclusive attributes,
for example, an image of a tree can be labeled as tree, plant, green, etc. simultaneously. This
setting can be referred to as multi-attribute learning. Multi-attribute learning is closely related
to the ranking problem, as we generally predict the probability of a set of attributes, high and
low; and unlike the classification problem, we not only pick the one with highest probability,
but we normally consider all attributes exceeding a set threshold as positive.
Different attributes are often correlated. For example, attributes can be hierarchical,
have high chance to co-occur, or the other way around. Che et al. [10] proposed a graph
Laplacian prior to enforce label correlation. Let βi ∈ Rk be the edge weights of unit i in the
last but one layer to the output layer, which was referred to as the output weights, and Ak×k be
26
the similarity matrix of the k labels. The Laplacian matrix L = C−A, C = diag((∑
j Aij)i)
has the property
tr(βᵀLβ) =1
2
∑ij
Aij‖βi − βj‖22, (3.15)
so that the regularized loss function can be written as follows
L =∑i
Ei + λR(Θ) +ρ
2tr(βᵀLβ), (3.16)
where Ei is a per-sample loss term, Θ is model parameters, R(·) is a norm, and the Laplacian
regularizer encourages closely related labels to have similar predictions. The similarity matrix
A can be either structure-based (e.g. graph adjacency matrix) or data-driven (e.g. label
co-occurrence matrix).
Shankar et al. introduced a weakly supervised MLL problem and proposed deep-
carving to train CNNs for this setting [66]. Under weak supervision, all data instances may
have multiple labels as in MLL, but only one is known as a ground truth label and all others
are missing. They observed deep CNNs were able to generate disentangled feature maps
even in a weakly supervised scenario, but would get befuddled by the lack of labels which
later disrupted convergence. Using AlexNet as base network architecture, they proposed a
deep-carving algorithm to update the pseudo-labels for training to reflect the similarities of the
convolutional feature maps between single samples and per class averages. The CNN was then
able to carve itself iteratively for MLL.
3.2.4 Regional Object Detection
Regional object detection is a special type of MLL problem on images. Natural images often
contain multiple salient objects, and to detect them all accurately involves 2 different problems:
localization and classification. Prior to 2014, the mainstream methods were building complex
27
ensemble systems of low level features and recognition performance was stagnant for a few
years.
With the new developments of CNN based methods, Girshick et al. devised regional
CNN (R-CNN) [23] to address the multiple object detection problem. R-CNN took the possible
regions containing object instances generated by region proposal methods (e.g. [75]) and passed
the image patches to a fine-tuned CNN, and finally used SVM to classify CNN features. To
train R-CNN, ground truth bounding boxes were required to filter region proposals, and the
CNN was first pre-trained on ImageNet and then fine-tuned on the dedicated dataset.
Fast R-CNN [22] replaced the CNN-SVM pipeline in original R-CNN with an end-
to-end architecture, with the help of a novel RoI max-pooling layer. The RoI pooling layer
replaced a middle max-pooling layer of an existing CNN architecture, e.g. VGG16, to accept
the convolutional feature maps produced by the preceding layers. It performed max-pooling on
the RoI window into a grid of the same size of the original max-pooling layer output so that
the resulting feature maps could proceed in the rest of the CNN. The network ended with two
output layers: softmax probabilities and a per-class bounding box regressor. Fast R-CNN got
rid of the multi-stage CNN+SVM training, but still required ground truth bounding boxes as
well as an external region proposal method for inference.
Faster R-CNN [62] continued to speed up the region proposal process with a joint of
region proposal network (RPN) and object detection network that shared a large portion of the
CNN. After the last shared convolutional layer, on the sliding window ran a small network that
splitted into a classifier and a bounding box regressor like in fast R-CNN. Region proposals of
different shapes and sizes were generated and ran through the sliding window network unit,
enabling nearly cost-free region proposals during inference.
Wei et al. proposed Hypotheses-CNN-Pooling (HCP) [79] that eliminated the need for
ground truth region bounding boxes. HCP was based on two key assumptions: (a) each region
proposal, referred to as hypothesis, contained at most one object, and (b) all possible objects
were covered by some hypotheses. For a group of hypotheses, they avoided the bounding box
28
evaluation by introducing the cross-hypothesis max-pooling. Any hypothesis hi, valid or not,
were fed into a shared CNN to produce an output vector vi, and in the following fusion layer
vj = maxi
vji . (3.17)
The cross-hypothesis max-pooling effectively suppressed noisy hypotheses and pre-
served high responses from valid hypotheses, making it unnecessary to determine bounding
boxes of objects.
29
Chapter 4
Applications
In this chapter we first introduce three application scenarios of deep MVL and MLL, and then
briefly present our preliminary work on x-ray scattering image classification.
4.1 Large Scale Item Categorization in e-Commerce
E-commerce websites like eBay and Amazon have attracted busy traffic and transactions
nowadays with the development of web and mobile technologies. Many new items are
registered online every day, represented by metadata such as title, category, image, price, etc.,
most of which are given manually by human sellers. Automatic item categorization attempts
to infer the category of items based on the metadata provided. The categorization problem is
difficult because of noisy information included in the text data, as well as the long tail data
distribution which makes data samples for certain smaller categories extremely imbalanced.
Ha et al. proposed deep categorization network (DeepCN) [30], which was an end-
to-end multiple RNN system for large scale item categorization. They denoted an item d as a
record with a category label y and an attribute vector x of 6 attributes:
d = {x, y} = {x(1),x(2), . . . ,x(6), y}, (4.1)
30
…
…
Metadata Input
Attribute RNN
Concatenation
Fully Connected
Layer
SoftMax Output
Figure 4.1: Structure of DeepCN [30].
where the 6 attributes x(i) are item name, brand name, high-level category, shopping mall ID,
manufacturer and image signature respectively; x(1) to x(5) are word sequences and x(6) is an
image descriptor of color and edge patterns.
DeepCN consisted of multiple attribute specific RNNs, fully connected layers and
one softmax output layer (see Figure 4.1). Each RNN processed one metadata attribute and
generated a semantic vector, and then all output vectors from RNNs were concatenated into
one and went through the fully connected layers. The output layer took the softmax of the last
fully connected layer to represent the probability of the categories. DeepCN could be trained
with a normal backpropagation in the top layers, and then BPTT for the attribute RNNs.
4.2 Deep Semantic Ranking Based Hashing
Representing image efficiently is crucial for content-based image retrieval. Binary hashing is a
popular representation method because it is straightforward to compute and store. However,
traditional hashing schemes that are data-independent or distance metrics based fail to grasp
semantic (dis)similarity among images. Therefore learning based, especially deep learning
based hashing is highly desired.
Formally, A hash function h : RD → {−1, 1} maps a D-dimensional input onto a
binary code. For learning based image hashing, the objective is to learn a set of hash functions
h(x) = [h1(x), h2(x), . . . , hK(x)] preserving the semantic structures.
31
Fu
lly co
nn
ected
Lay
er (FC
a)
Fu
lly co
nn
ected
Lay
er (FC
b)
Ha
sh L
ay
er
5 convolutional
layers
Figure 4.2: Structure of deep hash function [86].
Zhao et al. devised a hashing scheme that used deep semantic features for multi-label
retrieval [86]. They proposed a deep hash function containing the 5 convolutional layers as in
AlexNet, followed by two fully connected layers (FCa, FCb) and a hash layer (see Figure 4.2).
Both FCa and FCb had direct connections to the hash layer, since they argued the features from
FCb alone had strong invariance and were not sensitive enough to semantic discrepancies. The
deep hash function was defined as
h(x;w) = sign(wᵀ[fa(x); fb(x)]), (4.2)
where w denotes the weight in the hash layer, fa(·) and fb(·) are output vectors of FCa and
FCb respectively.
Given a training set D, the objective of supervised learning here was to make the hash
code similarity, which was measured by Hamming distance, consistent with the multi-label
similarity among images, i.e. how many labels were identical/different. For each query q,
denote {xn}Nn=1 as a ranked list of samples with decreasing multi-label similarity with the
query q. The loss function could be expressed as
F(W) =∑
q∈D,{xi}Mi=1⊂D
Lω(h(q;W), {h(xi;W)}Mi=1)
+α
2‖Eq(h(q;W))‖22 +
β
2‖W‖22, (4.3)
where W is the model parameters, Lω is a surrogate loss of Hamming distances between hash
32
CNN
…
LSTM
…
conv
Localization Layer
Image Conv feature
Regional feature
Recognition
Network
Region proposal
With scores Region sampling
Sampling
Grid
Bilinear Sampler
Striped gray cat
Cats watching TV
Figure 4.3: Pipeline of FCLN [43].
codes weighted by similarity rank.
4.3 Dense Image Captioning
Deep learning powered image understanding has made remarkable progress in two aspects:
multiple object detection in images (label density) and image captioning of high complexity
(label complexity). To unify the developments on both fronts, Johnson et al. [43] introduced
the dense captioning task to predict a set of detailed descriptions of multiple objects in an
image. They proposed a Fully Convolutional Localization Network (FCLN), as a fusion of
regional deep CNN and RNN language model, for the task.
FCLN consisted of a regional CNN based on Faster R-CNN [62] and a subsequent
LSTM (see Figure 4.3). They followed the configuration of VGG16 before the last pooling
layer, and then on the output feature map, candidate bounding boxes were sampled and passed
through a regressor. In order to have differentiable box coordinates for optimization, they
replaced the hard bounding boxes from the RoI layer in Fast R-CNN [22] with a bilinear
33
Halo Ring Peak AgBH
Figure 4.4: Examples of x-ray scattering image attributes.
interpolation [40]. The sampled regional features were then run through 2 fully connected
layers to obtain final feature representations before fed into the LSTM.
4.4 Preliminary Work: X-ray Scattering Image Classification
X-ray scattering is a powerful technique for probing physical structure of materials at the
molecular and nano-scale. Modern x-ray scattering facilities can generate 50,000 to 1,000,000
images/day (1-4 TB/day), thus to automate the workflow is crucial for material discovery
endeavors.
A key problem in x-ray scattering image studying is image classification. Scientists
are interested in image attributes of various aspects, which roughly involves experiments,
instrumentation, imaging, scattering features, samples, materials and specific substances. In
our experimental image dataset, all 2832 images from 13 x-ray scattering runs have been
labeled with 98 binary attributes of these 6 categories by a domain expert. Figure 4.4 provides
a few examples of attributed images. A robust prediction method will help the entire process
of image inspection, exploration, etc. Previous approaches [46] have been using traditional
computer vision techniques for the classification task, and we attempted to bring deep MLL
into this problem.
Since the amount of experimental data is insufficient for deep neural network training,
we generated 100,000 synthetic images using simulation software to train the model. We
34
aggregated 15 higher-level major attributes (see Table 4.1) from the synthetic dataset that
covered typical visual patterns and physical meanings and had a good amount of positive
samples, among which 9 (see Table 4.2) were also present in experimental data and could be
used to evaluate the performance on experimental data.
We preprocessed the image data with log transform
I ′ =1
log (IM + 1)log (I + 1), (4.4)
where I is the pixel intensity, IM is the maximum intensity, and trained the CNN with the log
images. We adopted the AlexNet model with a sigmoid output from the last fully connected
layer as probability predictions and used cross entropy as loss function, and trained the network
with momentum 0.9. Performance on synthetic data and real data are listed as follows:
Table 4.1: Mean Average Precision on Synthetic Dataset
Category mAP
Diffuse low-q 0.7823Diffuse high-q 0.7624Halo 0.7718Higher orders 0.8943Rings 0.8926BCC 0.0186FCC 0.0750Hexagonal 0.1821Lamellar 0.2725Symmetry halo 0.4758Symmetry ring 0.4055Circular beamstop 0.3016Wedge beamstop 0.6206Linear beamstop 0.6402Beam off image 0.8273
We are considering several possible directions of improving the learning model: (a)
exploit the hierarchical structure of attributes, such as major visual pattern vs. style variation,
physical meaning vs. visual cue, to predict the attributes in a layered manner instead of all
35
Table 4.2: Mean Average Precision on Experimental Dataset
Category mAP
Diffuse low-q 0.5545Diffuse high-q 0.1425Halo 0.2414Higher orders 0.6365Rings 0.7485Symmetry ring 0.0119Circular beamstop 0.5649Linear beamstop 0.3325Beam off image 0.8750
at once; (b) incorporate the statistics along the radial direction to facilitate inference; and (c)
perform multi-scale classification.
36
Chapter 5
Conclusion
In this report we introduced some deep learning models in conjunction with MVL and MLL
problems, and presented some applications of deep multi-learning. Deep neural network
models have enjoyed great successes not only in their vanilla forms, but also with heavy
modifications for specific problems in recent years. Among these works end-to-end models
with novel connections and units worked particularly well and demonstrated the versatility of
neural networks. They were easy to train and able to incorporated problem intuitions naturally
into the model.
It is worth noting that although MVL and MLL problems indicate that our understand-
ing of the data still plays an important role, it is not a comeback to the traditional approaches
of defining good hand crafted features. Deep learning methods are powerful to learn features
automatically and what we are providing is understanding of abstract entities and relations.
In that sense, deep multi-learning enables high-level knowledge input and is one step further
ahead from generic deep models.
In future, we plan to delve deeper into exploiting MVL and MLL structures with deep
learning. We will be experimenting with the following ideas for more effective training:
37
• Try new data augmentation techniques to improve the network performance with limited
data, and better methods to organize input modalities and features for an MVL setting;
• Devise new transform layers to carry out the computations required in the pipeline so
that the network encompasses all model parameters and is able to perform stochastic
update all at once;
• Apply manifold embedding methods to model data and label interactions for more robust,
structure-aware MLL.
38
References
[1] Understanding LSTM Networks. URL http://colah.github.io/posts/
2015-08-Understanding-LSTMs/.
[2] D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen,
M. Chrzanowski, A. Coates, G. Diamos, and others. Deep speech 2: End-to-end speech
recognition in english and mandarin. In Proceedings of The 33rd International Conference
on Machine Learning, 2016.
[3] J. Amores. Multiple instance classification: Review, taxonomy and comparative study.
Artificial Intelligence, 201:81--105, Aug 2013.
[4] G. Andrew, R. Arora, J. A. Bilmes, and K. Livescu. Deep Canonical Correlation Analysis.
In Proceedings of the 30th International Conference on Machine Learning, pages 1247--
1255, 2013.
[5] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient
descent is difficult. IEEE Transactions on Neural Networks, 5(2):157--166, Mar 1994.
[6] Y. Bengio, A. Courville, and P. Vincent. Representation Learning: A Review and New
Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):
1798--1828, Aug 2013.
[7] S. R. Bulo, L. Porzi, and P. Kontschieder. Dropout distillation. In Proceedings of The
33rd International Conference on Machine Learning, pages 99--107, 2016.
39
[8] C. Calauzenes, N. Usunier, and P. Gallinari. On the (non-) existence of convex, calibrated
surrogate losses for ranking. In Neural Information Processing Systems, pages 197--205.
Curran Associates, Inc., 2012.
[9] S. Chang, W. Han, J. Tang, G.-J. Qi, C. C. Aggarwal, and T. S. Huang. Heterogeneous
network embedding via deep architectures. In Proceedings of the 21th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pages 119--128.
ACM, 2015.
[10] Z. Che, D. Kale, W. Li, M. T. Bahadori, and Y. Liu. Deep Computational Phenotyping.
In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, KDD ’15, pages 507--516, New York, NY, USA, 2015.
ACM.
[11] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and
Y. Bengio. Learning Phrase Representations using RNN Encoder-Decoder for Statistical
Machine Translation. In EMNLP 2014, 2014.
[12] D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for
image classification. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE
Conference on, pages 3642--3649. IEEE, 2012.
[13] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber. High-
performance neural networks for visual object classification. In IJCAI 2011.
[14] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273--297,
1995.
[15] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In
2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR’05), volume 1, pages 886--893 vol. 1, Jun 2005.
40
[16] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan,
K. Saenko, and T. Darrell. Long-Term Recurrent Convolutional Networks for Visual
Recognition and Description. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 2625--2634, 2015.
[17] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning
and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121--2159,
2011.
[18] A. M. Elkahky, Y. Song, and X. He. A multi-view deep learning approach for cross do-
main user modeling in recommendation systems. In Proceedings of the 24th International
Conference on World Wide Web, pages 278--288. ACM, 2015.
[19] A. Evgeniou and M. Pontil. Multi-task feature learning. In Advances in neural information
processing systems, volume 19, page 41, 2007.
[20] W. Gao and Z.-H. Zhou. On the Consistency of Multi-Label Learning. In COLT,
volume 19, pages 341--358, 2011.
[21] F. A. Gers and J. Schmidhuber. Recurrent nets that time and count. In IJCNN 2000,
Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks,
2000, volume 3, pages 189--194 vol.3, 2000.
[22] R. Girshick. Fast R-CNN. In Proceedings of the IEEE International Conference on
Computer Vision, pages 1440--1448, 2015.
[23] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich Feature Hierarchies for Accurate
Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 580--587, 2014.
[24] X. Glorot, A. Bordes, and Y. Bengio. Deep Sparse Rectifier Neural Networks. In Aistats,
volume 15, page 275, 2011.
41
[25] Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe. Deep Convolutional Ranking for
Multilabel Image Annotation. In ICLR 2014.
[26] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. C. Courville, and Y. Bengio. Maxout
networks. In Proceedings of the 30th International Conference on Machine Learning,
volume 28, pages 1319--1327, 2013.
[27] K. Greff, R. K. Srivastava, J. Koutnik, B. R. Steunebrink, and J. Schmidhuber. LSTM: A
Search Space Odyssey. IEEE Transactions on Neural Networks and Learning Systems,
2016.
[28] A. Grover, A. Kapoor, and E. Horvitz. A Deep Hybrid Model for Weather Forecasting.
In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, KDD ’15, pages 379--386, New York, NY, USA, 2015.
ACM.
[29] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. TagProp: Discriminative metric
learning in nearest neighbor models for image auto-annotation. In 2009 IEEE 12th
International Conference on Computer Vision, pages 309--316, Sep 2009.
[30] J.-W. Ha, H. Pyo, and J. Kim. Large-Scale Item Categorization in e-Commerce Using
Multiple Recurrent Neural Networks. In Proceedings of the 22nd ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining, pages 107--115. ACM,
2016.
[31] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
CVPR 2016.
[32] G. Hinton. Deep belief networks. Scholarpedia, 4(5):5947, 2009. URL http://www.
scholarpedia.org/article/Deep_belief_networks.
[33] G. E. Hinton. Deep belief nets, 2007. URL http://www.cs.toronto.edu/
˜hinton/nipstutorial/nipstut3.pdf.
42
[34] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural
networks. Science, 313(5786):504--507, 2006.
[35] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets.
Neural computation, 18(7):1527--1554, 2006.
[36] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):
1735--1780, 1997.
[37] H. Hotelling. Relations Between Two Sets of Variates. Biometrika, 28(3/4):321--377,
1936.
[38] Y. Huang, W. Wang, L. Wang, and T. Tan. Multi-task deep neural network for multi-label
learning. In 2013 IEEE International Conference on Image Processing, pages 2897--2900.
IEEE, 2013.
[39] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. In Proceedings of the 32nd International Conference on
Machine Learning, 2015.
[40] M. Jaderberg, K. Simonyan, A. Zisserman, and others. Spatial trans-
former networks. In Advances in Neural Information Processing Systems,
pages 2017--2025, 2015. URL http://papers.nips.cc/paper/
5854-spatial-transformer-networks.
[41] J. Jiang. A literature survey on domain adaptation of statistical classifiers. Tech-
nical report, 2008. URL http://sifaka.cs.uiuc.edu/jiang4/domain_
adaptation/survey/da_survey.pdf.
[42] T. Joachims. Optimizing Search Engines Using Clickthrough Data. In Proceedings of
the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, KDD ’02, pages 133--142, New York, NY, USA, 2002. ACM.
43
[43] J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fully convolutional localization
networks for dense captioning. In CVPR 2016.
[44] R. Jozefowicz, W. Zaremba, and I. Sutskever. An Empirical Exploration of Recurrent
Network Architectures. In Proceedings of The 32nd International Conference on Machine
Learning, pages 2342--2350, 2015.
[45] Y. Kang, S. Kim, and S. Choi. Deep learning to hash with multiple representations. In
2012 IEEE 12th International Conference on Data Mining, pages 930--935. IEEE, 2012.
[46] M. H. Kiapour, K. Yager, A. C. Berg, and T. L. Berg. Materials discovery: Fine-grained
classification of X-ray scattering images. In IEEE Winter Conference on Applications of
Computer Vision, pages 933--940. IEEE, 2014.
[47] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying Visual-Semantic Embeddings
with Multimodal Neural Language Models. In TACL 2015.
[48] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images.
Technical report, 2009.
[49] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep
Convolutional Neural Networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q.
Weinberger, editors, Advances in Neural Information Processing Systems 25, pages
1097--1105, 2012.
[50] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D.
Jackel. Handwritten digit recognition with a back-propagation network. In Advances in
neural information processing systems, 1990.
[51] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11):2278--2324, 1998.
[52] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436--444, May
2015.
44
[53] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller. Efficient backprop. In Neural
networks: Tricks of the trade, pages 9--48. Springer, 2012.
[54] H. Li, W. Ouyang, and X. Wang. Multi-Bias Non-linear Activation in Deep Neural
Networks. In Proceedings of The 33rd International Conference on Machine Learning,
2016.
[55] M. Lin, Q. Chen, and S. Yan. Network in network. In ICLR 2014.
[56] D. G. Lowe. Object recognition from local scale-invariant features. In The Proceedings of
the Seventh IEEE International Conference on Computer Vision, 1999, volume 2, pages
1150--1157 vol.2, 1999.
[57] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. Deep Captioning with
Multimodal Recurrent Neural Networks (m-RNN). In ICLR 2015.
[58] J. Nam, J. Kim, E. L. Mencıa, I. Gurevych, and J. Furnkranz. Large-scale multi-label text
classification—revisiting neural networks. In Joint European Conference on Machine
Learning and Knowledge Discovery in Databases, pages 437--452. Springer, 2014.
[59] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning.
In Proceedings of the 28th international conference on machine learning (ICML-11),
pages 689--696, 2011.
[60] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of
the spatial envelope. International journal of computer vision, 42(3):145--175, 2001.
[61] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural
networks. In Proceedings of the 30th International Conference on Machine Learning,
volume 28, pages 1310--1318, 2013.
[62] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards Real-Time Object
Detection with Region Proposal Networks. In C. Cortes, N. D. Lawrence, D. D. Lee,
45
M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing
Systems 28, pages 91--99, 2015.
[63] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-
propagating errors. Nature, 323:533--536, 1986.
[64] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211--
252, 2015.
[65] M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE Transac-
tions on Signal Processing, 45(11):2673--2681, 1997.
[66] S. Shankar, V. K. Garg, and R. Cipolla. Deep-carving: Discovering visual attributes by
carving deep neural nets. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 3403--3412, 2015.
[67] H. Shimodaira. Improving predictive inference under covariate shift by weighting the
log-likelihood function. Journal of statistical planning and inference, 90(2):227--244,
2000.
[68] P. Y. Simard, D. Steinkraus, and J. C. Platt. Best practices for convolutional neural
networks applied to visual document analysis. In ICDAR, volume 3, pages 958--962,
2003.
[69] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image
recognition. In ICLR 2015.
[70] M. S. Sorower. A literature survey on algorithms for multi-label learning. Techni-
cal report, 2010. URL http://people.oregonstate.edu/˜sorowerm/pdf/
Qual-Multilabel-Shahed-CompleteVersion.pdf.
46
[71] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity:
The all convolutional net. In ICLR 2015 Workshop.
[72] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout:
a simple way to prevent neural networks from overfitting. Journal of Machine Learning
Research, 15(1):1929--1958, 2014.
[73] S. Sun. A survey of multi-view machine learning. Neural Computing and Applications,
23(7-8):2031--2038, Feb 2013.
[74] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,
and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 1--9, 2015.
[75] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for
object recognition. International journal of computer vision, 104(2):154--171, 2013.
[76] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption
generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 3156--3164, 2015.
[77] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Regularization of neural networks
using dropconnect. In Proceedings of the 30th International Conference on Machine
Learning (ICML-13), pages 1058--1066, 2013.
[78] W. Wang, R. Arora, K. Livescu, and J. Bilmes. On deep multi-view representation
learning. In Proc. of the 32nd Int. Conf. Machine Learning (ICML 2015), pages 1083--
1092, 2015.
[79] Y. Wei, W. Xia, M. Lin, J. Huang, B. Ni, J. Dong, Y. Zhao, and S. Yan. HCP: A Flexible
CNN Framework for Multi-Label Image Classification. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 38(9):1901--1907, Sep 2016.
47
[80] J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling Up To Large Vocabulary
Image Annotation. In Proceedings of the International Joint Conference on Artificial
Intelligence, IJCAI, 2011.
[81] S. Wiesler and H. Ney. A convergence analysis of log-linear training. In Advances in
Neural Information Processing Systems, pages 657--665, 2011.
[82] H. Wu, J. Wang, and X. Zhang. Combining hidden Markov model and fuzzy neural
network for continuous recognition of complex dynamic gestures. The Visual Computer,
pages 1--14, Aug 2015.
[83] C. Xu, D. Tao, and C. Xu. A Survey on Multi-view Learning. arXiv:1304.5634 [cs], Apr
2013. arXiv: 1304.5634.
[84] M.-L. Zhang and Z.-H. Zhou. Multilabel Neural Networks with Applications to Func-
tional Genomics and Text Categorization. IEEE Transactions on Knowledge and Data
Engineering, 18(10):1338--1351, Oct 2006.
[85] M.-L. Zhang and Z.-H. Zhou. ML-KNN: A lazy learning approach to multi-label learning.
Pattern Recognition, 40(7):2038--2048, Jul 2007.
[86] F. Zhao, Y. Huang, L. Wang, and T. Tan. Deep semantic ranking based hashing for
multi-label image retrieval. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 1556--1564, 2015.
48