multi-view, multi-label learning with deep neural networks

Multi-view, Multi-label Learning with DeepNeural Networks

Ziqiao Guan1

Research Proficiency Exam

September 2016

AdvisorHong Qin1

Professor

CommitteeMinh Hoai Nguyen1 Dimitris Samaras1 Dantong Yu2, 3

Assistant Professor Associate Professor Associate Professor1 Department of Computer Science, Stony Brook University

2 Martin Tuchman School of Management, New Jersey Institute of Technology3 Brookhaven National Laboratory

Abstract

Deep learning is a popular technique in modern online and offline services. Deep neural

network based learning systems have made groundbreaking progress in model size, training

and inference speed, and expressive power in recent years, but to tailor the model to specific

problems and exploit data and problem structures is still an ongoing research topic. We look

into two types of deep ‘‘multi-’’ objective learning problems: multi-view learning, referring

to learning from data represented by multiple distinct feature sets, and multi-label learning,

referring to learning from data instances belonging to multiple class labels that are not mutually

exclusive. Research endeavors of both problems attempt to base on existing successful deep

architectures and make changes of layers, regularization terms or even build hybrid systems to

meet the problem constraints.

In this report we first explain the original artificial neural network (ANN) with the

backpropagation learning algorithm, and also its deep variants, e.g. deep belief network (DBN),

convolutional neural network (CNN) and recurrent neural network (RNN). Next we present a

survey of some multi-view and multi-label learning frameworks based on deep neural networks.

At last we introduce some applications of deep multi-view and multi-label learning, including

e-commerce item categorization, deep semantic hashing, dense image captioning, and our

preliminary work on x-ray scattering image classification.

Contents

1 Introduction 1

2 Deep Neural Network 4

2.1 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Deep Belief Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.3 DBN and Deep Auto-encoder . . . . . . . . . . . . . . . . . . . . . 8

2.3 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.2 Normalizations and Overfitting Prevention . . . . . . . . . . . . . . 11

2.3.3 Notable Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.3 Long Short Term Memory and Gated Recurrent Unit . . . . . . . . . 15

2

3 Multi-view and Multi-label Learning 18

3.1 Multi-view Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.1 Deep Multi-view Representation Learning . . . . . . . . . . . . . . . 19

3.1.2 Network Co-training . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Multi-label Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.1 Going Deep with Multi-label Learning . . . . . . . . . . . . . . . . . 24

3.2.2 Multi-label Ranking Loss . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.3 Multi-attribute Learning . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.4 Regional Object Detection . . . . . . . . . . . . . . . . . . . . . . . 27

4 Applications 30

4.1 Large Scale Item Categorization in e-Commerce . . . . . . . . . . . . . . . . 30

4.2 Deep Semantic Ranking Based Hashing . . . . . . . . . . . . . . . . . . . . 31

4.3 Dense Image Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.4 Preliminary Work: X-ray Scattering Image Classification . . . . . . . . . . . 34

5 Conclusion 37

3

Chapter 1

Introduction

Machine learning technology drives a vast amount of online and offline services of today: from

web searches to face recognition on mobile phones. The modern Internet has made massive

volume of user generated data possible, greatly benefiting the development of learning methods

and techniques in recent years, which in turn find their way in powering consumer electronic

devices such as cameras and smartphones. In general, machine learning systems tackle

various tasks such as identifying objects in images, transcribing speech into text, fetching web

pages, products and knowledge to users’ interest [52]. In many domain specific applications,

such as x-ray image categorization [46], weather prediction [28], electronic medical record

monitoring [10], data analysis tasks also call for effective and scalable machine learning

systems. The core to those versatile applications is to seek a robust representation of rich and

large amount of data, thereby properties and features of interest become easily discernible and

interpretable, which is a nontrivial task.

Classic machine learning techniques had a lot of limitations when dealing with intrica-

cies in natural data. Traditionally, learning models relied heavily on hand crafted features, i.e.,

a designed set of transforms to convert raw data to a representation suitable for analysis tasks,

which required elaborate design and often profound domain knowledge. In computer vision, for

example, conventional object detection methods extract image features with descriptors such

1

as SIFT [56] or HOG [15], and then use classifiers like support vector machine (SVM) [14]

to classify the resulting feature vectors. This type of fixed hand crafted features usually do

not adapt very well to datasets with various data statistics, or require nontrivial transform

tricks for specific applications. Recently, deep learning, or more specifically deep neural

networks, has presented a paradigm shift in the search of good representations and achieved

great performance boost in machine learning problems of different areas.

Deep learning methods are essentially representation learning methods [6] via multi-

layered approaches to replace the hand crafted features of traditional approaches with layered

learned features discovered by general learning algorithms. Starting with the raw input, deep

learning methods construct multiple layers of nonlinear operations that act on the representation

from the previous layer and pass the response to the deeper layer. The purpose of stacking many

layers of transforms is to expect deep structures to approximate representation mappings of high

complexity. We expect stacked structures to reflect the level of abstraction of understanding

data, so that deeper layers are able to get hold of abstract ideas and robust to variations that are

dramatic from the perspective of the raw input. For example, in image classification, the first

layer should pick up low level visual features such as edges, the second layer should handle the

arrangements of edge features, and deeper layers are supposed to focus on bigger and more

complete bits of elements that make up specific shapes.

Due to the high degrees of freedom of deep neural network structures, training a deep

network model effectively poses a lot of new challenges to researchers. On one hand, training

a complex network requires a huge amount of data, which is computationally expensive; on the

other hand, deep network tends to amplify small changes of responses through layers, making

the system highly sensitive to some specific settings. For numerical concerns, researchers have

introduced techniques such as normalization, data augmentation and dropout [72] to speed up

the training and improve the robustness. For adapting to specific problems, numerous works

have attempted to change the network layers, propose novel objectives and regularization

terms, and build more complex hybrid systems.

2

As deep learning achieves great success in generic classification and recognition

problems, more researchers turn to focus on exploiting the structure of data and the problem in

the learning process, which brings multi-view and multi-label learning problems. Multi-view

data is defined as data with diverse views from multiple sources or feature subsets, e.g. text data

with accompanying images. Joint training methods take the consistency and complementary

relations of different views into consideration for more effective and robust learning. Also, data

may come with multiple correlated attributes, as opposed to a standard N -way classification

problem where the labels are mutually exclusive. By explicitly preserving those correlations

via manifold embedding, for example, the trained models may better capture the patterns of the

data. Multi-view and multi-label learning problems pose different modifications to the neural

network structure to fit the specific applications, which shows the versatility of deep neural

networks.

In this report, we first explain the basics of artificial neural network (ANN) with

the backpropagation training algorithm, as well as several popular variations of deep neural

networks, i.e. deep belief network (DBN), convolutional neural network (CNN) and recurrent

neural network (RNN). Next we present a survey of some multi-view and multi-label adapta-

tions of deep neural networks. Finally, we introduce some applications of deep multi-learning

and our preliminary work on x-ray scattering image classification.

3

Chapter 2

Deep Neural Network

The popular deep neural networks of today are a recent development of the original artificial

neural network, which was mostly superseded by support vector machine [14] in 1990s due to

its simplicity and effectiveness. In this chapter, we first introduce ANN and the backpropagation

training algorithm, and then introduce some of its deep variations, i.e. DBN, CNN and RNN,

and the reason of their new successes.

2.1 Artificial Neural Network

An artificial neural network is a network inspired by the biological neural network for machine

learning tasks. The earliest ANN is a multilayer perceptron (MLP), which consists of multiple

layers of nodes in a directed graph, and each layer is fully connected to the next one (see

Figure 2.1). Each node except the input nodes is a neuron with a nonlinear activation function.

As a multilevel variation of the linear perceptron, MLP is used to model and classify data that

is linearly inseparable via supervised learning.

Backpropagation [63] is a learning algorithm for ANNs. The idea is to look for

the minimum of some error function in the weight space via gradient descent. Given a feed

forward network and a training set {(x1, t1), . . . , (xp, tp)}, where xi is a data instance and

4

Input

Hidden

Output

Figure 2.1: Structure of an artificial neural network.

ti is its label vector, we generally want to produce an output oi that is as close to the target

ti as possible by the means of adjusting network parameters. More specifically, we wish to

minimize an error function, say, L2-error

E =1

2

p∑i=1

‖oi − ti‖2. (2.1)

In order to apply gradient descent during the training, we need the activation function to be

differentiable, e.g. sigmoid:

σ(x) =1

1 + e−x, (2.2)

so that E = E(w1, w2, . . . , w`) is a differentiable function of the ` weight parameters in the

network and the gradient can be calculated as

∇E =

(∂E

∂w1,∂E

∂w2, . . . ,

∂E

∂w`

). (2.3)

5

Then each weight is updated by

∆wi = −γ ∂E∂wi

, i = 1, . . . , `, (2.4)

where γ is the learning rate.

Backpropagation is a realization of the chain rule. Suppose we extend the output layer

of the ANN to compute the error function in the end. In the feed forward step, we evaluate

not only the activation value at each node, but also the derivative of the activation function

given the input. In order to find the partial derivative of the training error E with respect

to an edge weight wij , the chain rule is applied and it is equivalent to running the neural

network backwards. More specifically, we feed the constant 1 to the error node, and traverse

the network backwards and multiply the derivatives at the nodes in between, using the same

weights. The weight correction can be expressed as

∆wij = −γoiδj , (2.5)

where wij refers to the weight on the edge connecting node i to node j, oi is the output at node

i, and δj is the backpropagated error.

In 1995, Cortes and Vapnik devised a special type of perceptron called support vector

machine [14]. The idea of SVM is to map input vectors to a high dimensional feature space that

is linearly separable, and find a decision surface to maximize the distance between positive and

negative samples. They showed the constructed separating hyperplane has good generalization

ability. At the time, multilevel ANNs did not deliver a much better performance despite their

complex structure, due to these difficulties of network backpropagation:

• It requires a large amount of labeled data;

• It takes a long time in multilevel networks;

• It can get stuck in local optima.

6

x

h1

h2

h3

Figure 2.2: Structure of a deep belief network. The top two layers have undirected symmetricconnections between them and is essentially a Restricted Boltzmann Machine. The lowerlayers have top-down directed connections and form a Bayesian network.

In 1990s, research on neural networks with multiple adaptive hidden layers was mostly

shadowed by SVM.

2.2 Deep Belief Network

In 2006, Hinton and Teh proposed the greedy layer-wise training algorithm for DBNs [35],

which was one of the first fast and effective deep learning algorithms. Different deep neural

network structures have been proposed since then and outperformed traditional methods like

SVMs in various learning tasks.

2.2.1 Structure

Deep belief network is a probabilistic generative model composed of multiple layers of

stochastic latent variables [32]. Hinton and Teh [35] showed that DBN can be modeled as

stacked Restricted Boltzmann Machines (RBMs) and thus trained layer-wise in a greedy manner.

The joint distribution of observed vector x = h0 and the hidden layers hk (k = 1, . . . , `) is as

follows

P (x,h1, . . . ,h`) =

(`−2∏k=0

P (hk|hk+1)

)P (h`−1,h`). (2.6)

7

Input Output

Encoder Decoder

Compressed Feature Vector

Figure 2.3: Structure of an auto-encoder. An auto-encoder attempts to reconstruct the in-put through a low-dimensional bottleneck layer in the middle, which gives a compressedrepresentation of the original input data.

where x = h0, P (hk−1|hk) is a conditional distribution for the visible units at level k − 1

given the hidden units at level k, and P (h`−1,h`) is the visible-hidden joint distribution in the

top-level RBM, as illustrated in figure 2.2.

2.2.2 Learning

Hinton and Teh proposed contrastive divergence [35], a greedy layer-wise unsupervised

training algorithm for DBNs. The idea was to treat each level as a separate visible-hidden RBM

and maximize the probability of visible representation, which could be efficiently approximated

by sampling.

The purpose of the unsupervised process was to learn a distribution that could generate

the data given and labeled data was not required. It could also be adapted to a supervised

learning setting by adding the labels to the network architecture and then fine-tuning the

network with labeled data. Experiments showed a slightly fine-tuned DBN could easily

outperform a carefully trained backpropagation net or SVM [33].

2.2.3 DBN and Deep Auto-encoder

An auto-encoder is a feed forward neural network to directly learn a representation for a set

of data. It consists of two parts, the encoder and the decoder, where the encoder learns a

8

Input

Conv1

11×11×96×3

Maxpool1

Conv2

5×5×96×256

Maxpool2

Conv3

3×3×256×384

Conv4

3×3×384×384

Conv5

3×3×384×256

Maxpool5

FC6

4096

FC7

4096

FC8

1000

Figure 2.4: Structure of AlexNet [49]. The CNN consists of 5 convolutional layers, 3 of whichare followed by max-pooling, and 3 fully connected layers.

representation and the decoder tries to reconstruct the input (see Figure 2.3).

Hinton and Salakhutdinov [34] extended the idea of greedy pretraining to deep auto-

encoders. Analogous to DBN training, They proposed to greedily pretrain the weights of the

encoder side, which are initially shared by the decoder, and then fine-tune the auto-encoder

using backpropagation. They showed that pretrained networks had good generalization ability

because most of the weight information came from modeling the data.

2.3 Convolutional Neural Network

Convolutional Neural Network [50][51] is a type of deep, feed forward network that features a

convolution operation for processing arrays and multi-dimensional array data, for example,

1D for sequential or temporal signals, 2D for images and 3D for videos or volumetric images.

CNN was used for handwritten digit recognition [50] back in 1990s, but it was not adopted

widely until a recent breakthrough [49] in ImageNet Large Scale Visual Recognition Challenge

(ILSVRC) [64].

2.3.1 Structure

The architecture of a typical CNN (see Figure 2.4) consists of layers of different operations,

including convolutional layers, pooling layers and fully-connected layers. Here we explain the

9

CNN structure with 2D image arrays as data input; there are adaptations to work with other

forms of data.

Convolutional Layer. A convolutional layer takes the feature maps from the previous

layer (the original image if it is the first layer) and perform convolution with a set of weights

called filter banks. More precisely, let x be M × N × K array of M × N pixels and K

channels, w ∈ RH×W×K×K′ be K ′ filters of size (W,H) with K input channels, b the bias

term, the convolution can be expressed as

yk′ =∑k

w·,·,kk′ ∗ xk + bk′ (2.7)

where ∗ is a 2D convolution. Convolution takes a local patch into account when computing

responses, which is usually highly correlated in terms of image statistics; also it is translation

invariant as the filter window scrolls through the entire image. Therefore it is effective for

identifying distinctive motifs of the images.

It is also worth noting that the multi-channel convolution given by (2.7) is essentially

a weighted sum of the convolutions of all the input channels. Some CNN architectures involve

1× 1 convolutions [74][31], which does not make sense as a conventional 2D convolution, but

they act as dimensionality reduction operations by combining the input channels.

Nonlinear Activation. The local weighted sum acquired from a convolutional layer is

passed through a non-linearity that is applied element-wise, e.g. Rectifier Linear Unit (ReLU)

yijk = max {0, xijk} (2.8)

Glorot et al. [24] showed that ReLU performed better in neural networks than sigmoid or

hyperbolic tangent because it was free of the gradient saturation problem and preserved sparsity

of the signal.

Pooling Layer. Pooling operations combine a local patch into one output value as a

means of downsampling. The common practice is max-pooling that takes the maximum value

10

of a patch from each channel individually, which has better performance than sum-pooling.

There are also attempts to get rid of pooling by increasing the stride of the convolutional

layers [71].

2.3.2 Normalizations and Overfitting Prevention

There are two major numerical difficulties that deep CNN training is facing: speed and

overfitting. Since training data is organized in mini batches in stochastic gradient methods,

the network training will be seeing data of different distributions all the time, which is said

to be having covariate shift [67] and will slow down convergence. Domain adaptation [41]

is usually applied to alleviate the effect. In the context of CNN learning, it has been long

known that the network training has faster convergence if the inputs are whitened to have zero

means and unit variances and decorrelated [53][81]. However, even if the inputs are whitened,

intermediate responses in the middle layers will still experience changes in distributions due to

the weight updates in training, which is called internal covariate shift. Earlier models attempted

to introduce normalization to the middle layers with local response normalization [49], but it

required manual tuning and did not adapt well to different datasets and was dropped by later

models.

Ioffe and Szegedy proposed batch normalization (BN) [39] to learn the scale and shift

parameters along with the network training, so that the optimization is fully aware of and able

to accommodate with the internal normalizations. They argued that in order to preserve the

nonlinear expressive power of the activations, the learned transforms should still be able to

represent identity so that the normalization could be reverted if desired. Therefore for each

activation x, BN learns a pair of parameters γ, β such that:

x =x−E[x]√

Var[x], (2.9)

y = γx + β. (2.10)

11

To address the overfitting problem, the easiest and most common method is to augment

the dataset via label preserving transforms [49][68][12][13], e.g. random translation, flip, crop

and pixel intensity distortion. Srivastava et al. proposed dropout [72] to randomly drop neurons

during training to prevent co-adapting and showed its efficiency and improvement of neural

network performance. Wan et al. generalized dropout to DropConnect [77] which randomly

selected a subset of weights to zero, enabling finer control over which connection to drop. Bulo

et al. proposed dropout distillation [7] to improve the inference accuracy of a dropout network

by finding the best dropout configuration via another stochastic optimization. On the other

hand, maxout networks [26] and multi-bias non-linear activation [54] attempted to modify the

receptive behavior of the activations to generate more effective responses, thereby tackling the

overfitting issues.

2.3.3 Notable Architectures

AlexNet introduced by Krizhevsky et al. [49] was the first deep CNN architecture to gain huge

popularity in the computer vision community. It was one of the first ‘‘deep’’ networks to

use stacked convolutional layers and had remarkable performance boost from utilizing ReLU

gating, GPU accelerated training and dropout.

VGGNet proposed by Simonyan and Zisserman [69] featured a deeper network of

16 or 19 convolutional/fully-connected layers and demonstrated deeper network did deliver

better performance. The network had a homogeneous structure with 3× 3 convolutions and

2× 2 pooling operations only. They also showed that fully-connected layers could be removed

without hurting the network performance, and thus the number of parameters can be reduced

drastically.

GoogLeNet from Szegedy et al. [74] took the inspiration of Network in Network [55]

to replace a single convolutional layer with an Inception module (see Figure 2.5). The Inception

module attempted to pick up sparsely distributed features computed from previous filters with

convolutions of small filter size, and concatenate responses from a few different filters to

12

Previous Layer

1×1 conv

3×3 conv 5×5 conv

1×1 conv 1×1 conv

1×1 conv

3×3 maxpool

Filter

Concatenation

Figure 2.5: The Inception module of GoogLeNet [74]. 3 different sizes of convolutional filters,along with a max-pooling are applied to the same feature maps from the previous layer tocollect feature responses of different spread. The resulting feature maps are concatenated intoa long tensor.

obtain multi-scale features. Because of the small filters used, the number of parameters were

cut down from approximately 60 million as in AlexNet to 4 million, despite the much more

sophisticated 22-layer structure.

ResNet proposed by He et al. [31] went one step further on deepening CNN struc-

tures with a 152-layer network on ImageNet, and even reaching 1000 layers on CIFAR-10

dataset [48]. They pointed out deeper networks had a degradation problem that caused higher

training error compared to shallower networks, and it was not because of overfitting. They

introduced shortcut connections (see Figure 2.6) that passed identity mappings over multiple

layers, so that the multilayer filters were essentially learning the residual values, hence the

name. They showed empirically that this construction sped up the convergence and improved

network performance.

2.4 Recurrent Neural Network

Recurrent Neural Network is a type of neural network with directed cycles connecting the units.

The cycle connections serve as internal states that are updated with sequential inputs, which

resembles an internal memory and allows RNN to model temporal behavior.

13

conv

conv

+

relu

x

F(x)

F(x)+x

xIdentity

Figure 2.6: The skip connection in ResNet [31].

x

s

y

V

WU

xt-1

s

yt-1

V

W

U

xt

s

yt

V

W

U

xt+1

s

yt+1

V

W

U

W

Figure 2.7: An RNN and its unfolded form.

2.4.1 Structure

A typical RNN (see Figure 2.7), once unfolded, can be regarded as a very deep feed forward

network where weights are shared among all hidden layers. Let xt be the input vector, st the

hidden state, s−1 = 0, yt be the output, t = 0, 1, 2, . . . , the forward pass of a vanilla RNN

works as follows

st = tanh (Uxt + Wst−1), (2.11)

y = Vst, (2.12)

where U, V and W are the weight parameters of the RNN. Bias terms may also be present,

but they can be absorbed into the weight matrices, and with a slight abuse of symbols, we still

refer to them as U, V and W.

RNNs can be stacked upon each other to create deeper structures. Bidirectional

RNN [65] uses two RNNs, one processing the sequence from left to right, the other from right

14

to left, to model the sequence exploiting both the past and the future context. Furthermore,

deep bidirectional RNNs have already been used in speech recognition [2], enabling much

higher learning capacity.

2.4.2 Learning

An RNN is trained via Backpropagation Through Time (BPTT) algorithm, which is essentially

a standard BP acting on the unfolded RNN. A BPTT update involves all the previous inputs

since the weights are shared. For instance, for some error function En of the n-th output,

weight matrix W is updated by

∂En

∂W=∂En

∂yn

∂yn

∂sn

∂sn∂W

. (2.13)

Note that sn depends on all preceding states, all of which depend on W. Thus we can unfold it

all according to the chain rule as follows

∂En

∂W=

n∑k=0

∂En

∂yn

∂yn

∂sn

∂sn∂sk

∂sk∂W

. (2.14)

This training method has been known to suffer from the exploding or vanishing gradient

problem [5][61] and thus standard RNN has difficulties modeling long-term dependencies.

This problem can be mitigated by careful choice of initial weights, using ReLU instead of

hyperbolic tangent or other normalization methods; alternatively, variants of RNNs have been

proposed to explicitly encode the internal memory of the neural network.

2.4.3 Long Short Term Memory and Gated Recurrent Unit

Long Short Term Memory (LSTM) networks were proposed in 1997 by Hochreiter and

Schmidhuber [36] to resolve the vanishing gradient problem. The idea is to introduce a cell

15

(a) LSTM: 3 nonlinear gates σ(·) from left to right are: forget, inputand output gates.

(b) A GRU unit containsonly 2 gates.

Figure 2.8: An LSTM unit and a GRU unit. Illustration from [1].

state ct, manipulated by several gates as follows (see Figure 2.8a):

i = σ(xtUi + ht−1W

i), (2.15)

f = σ(xtUf + ht−1W

f ), (2.16)

o = σ(xtUo + ht−1W

o), (2.17)

g = tanh (xtUg + ht−1W

g), (2.18)

ct = ct−1 � f + g � i, (2.19)

ht = tanh (ct)� o, (2.20)

where � is the element-wise Hadamard product, σ is the activation function, i, f and o are

input, forget and output gates respectively, ct is the internal cell state, and ht is the output

hidden state. Both input and forget gates pass the input and the hidden state to an activation

function that clamps the value in [0, 1]. Then the cell state is updated via (2.19) with explicit

control of how much to remember or forget. Finally the output is generated through the output

gate o.

There are also different extentions of the LSTM structure [27][44], e.g. peephole

connections [21] that take the cell state into gating. Gated Recurrent Unit (GRU) [11] is a

16

simpler variation of LSTM with two gates: reset gate r and update gate z (see Figure 2.8b)

z = σ(xtUz + ht−1W

z), (2.21)

r = σ(xtUr + ht−1W

r), (2.22)

h = tanh (xtUh + (ht−1 � r)Wh), (2.23)

ht = (1− z)� ht−1 + z� ht−1. (2.24)

17

Chapter 3

Multi-view and Multi-label Learning

The sophisticated nonlinear structures of deep neural networks have achieved great success in

modeling rich data of high complexity, and also shown great potential of smooth adaptation to

specific problems. On the other hand, due to the intricacies of the training process, exploiting

data and problem structures is essential for building effective models. Recent years have seen

an abundance of work addressing the following 4 ‘‘multi-’’ learning objectives:

• Multi-view learning (MVL) [73][83], or interchangeably called multimodal learning,

refers to learning from data represented by multiple distinct feature sets, e.g. mining text

data with accompanying images;

• Multi-instance learning (MIL) [3] refers to learning predictors from training data in

bags, where each bag contains multiple samples/feature vectors called instances. The

individual instances carry their own attributes/labels, some of which give rise to the bag

labels, but they are not necessarily consistent, e.g. object recognition with images from

multiple perspectives;

• Multi-label learning (MLL) [70] refers to learning from a set of instances, each of

which can belong to multiple classes that are not mutually exclusive, e.g. predicting

attribute tags of items on e-commerce websites or Yelp;

18

• Multi-task learning (MTL) [19] refers to learning a low dimensional representation

of the data to share among a set of related tasks, e.g. extracting image features for

recognition and segmentation simultaneously.

These objectives have correlated problem formulations and may share some method-

ologies in working with structures. For example, MIL can be regarded as a special case of

MVL where each instance reveals a view of the entire bag; if we think of MTL as learning

a predictor function for each individual task, MLL stipulates a number of task functions of

predicting some labels of interest.

In this chapter we introduce some deep MVL and MLL frameworks, which roughly

fall into 2 categories: (a) Modify the network architecture to work with specific structures, and

(b) Use the network model as a feature extractor to obtain some representation of the data, and

have other techniques to perform multi-learning on them. Both approaches benefit from the

deep representations and show the superiority of deep neural networks.

3.1 Multi-view Learning

3.1.1 Deep Multi-view Representation Learning

A common technique to process heterogeneous data types is to map them to a common low di-

mensional feature space. In deep MVL problems, there are mainly two types of approaches [78]:

auto-encoder based and canonical correlation analysis (CCA) [37] based.

Auto-encoder based methods attempt to learn a representation that best reconstructs the

inputs. Ngiam [59] proposed to derive a representation from a view that was always available

at test time from which other views could be reconstructed. This constituted an auto-encoder

network with one shared encoder and separate decoder for each view, and was hence called

19

split auto-encoder. The objective of split auto-encoder can be expressed as follows

minWf ,Wp,Wq

1

N

N∑i=1

(‖xi − p(f(xi))‖2 + ‖yi − q(f(xi))‖2), (3.1)

where xi, yi are data of different views, f is the encoder network, p, q are the decoder networks,

W· are the network parameters.

Andrew et al. proposed a deep extension of CCA called DCCA [4]. In DCCA neural

networks were used for each view of the data as per-view feature extractors, and CCA was

performed on the extracted features:

maxWf ,Wg,U,V

1

Ntr(Uᵀf(X)g(Y)ᵀV) (3.2)

s.t. Uᵀ(

1

Nf(X)f(X)ᵀ + rxI

)U = I,

Vᵀ(

1

Ng(Y)g(Y)ᵀ + ryI

)V = I,

uᵀi f(X)g(Y)ᵀvj = 0, for i 6= j,

where U = [u1, . . . ,uL] and V = [v1, . . . ,vL] are CCA project directions and (rx, ry) > 0

are regularization parameters.

Wang et al. leveraged these 2 objectives and proposed the deep canonically correlated

autoencoders (DCCAE) [78] with a formulation as follows

minWf ,Wg,Wp,Wq,U,V

− 1

Ntr(Uᵀf(X)g(Y)ᵀV)

+λ

N

N∑i=1

(‖xi − p(f(xi))‖2 + ‖yi − q(f(xi))‖2) (3.3)

s.t. constraints in (3.2),

where λ > 0 is a trade-off parameter. They argued DCCAE offered a trade-off between

input-feature and feature-feature mappings.

20

Chang et al. [9] generalized the DCCA framework to not only optimize matching

multi-view data records, but explicitly handle heterogeneous networks where similarity con-

nections could exist between any data pair of either the same type or not. They proposed

heterogeneous network embedding to jointly optimize the deep neural network modules and

the linear embeddings of all views. For example, for a set of image and text data, inputs might

come in the form of image-image, image-text or text-text tuples and not necessarily a matching

two-view pair. They used an AlexNet based network to extract image features and a 2-layer

ANN to process text data, and then defined an extra linear embedding for each feature vector

p(X) = Uᵀp(X), (3.4)

q(z) = Vᵀq(z), (3.5)

where X, z are the image and text data, p(·), q(·) are the neural networks for each, U, V are

the cascading linear embedding respectively. Denote the parameters of p(·) as Wp, and those

of q(·) as Wq and finally the objective could be expressed as

minWp,Wq

1

NII

∑ij

L(p(Xi), p(Xj))

+λ1NTT

∑ij

L(q(zi), q(zj))

+λ2NIT

∑ij

L(p(Xi), q(zj)), (3.6)

where L(·, ·) is a log loss function and N·· are the numbers of links in the heterogeneous graph.

3.1.2 Network Co-training

Split auto-encoders and DCCA methods both mean to bridge different modalities of data and

evaluate their correlations, but there are times when the focus is to make inference based on

multi-view data and an end-to-end joint network model is required. In general there are 2 ways

21

…

… … …

… … …

… …

Figure 3.1: Structure of Deep Multi-view Hashing (DMVH) network [45].

to set up joint co-training networks: multiple networks with shared top layers, and network

cascading.

Shared top layers in a joint network are usually one or several fully connected

layers that take the concatenated multi-network outputs. They can be considered a top level

ANN adding nonlinearity to the multi-view extracted features. Kang et al. proposed deep

multi-view hashing [45] that took two feature vectors, e.g. HOG [56] and GIST [60], in a

4-layer network. Both hidden layers contained 3 groups of units, one for each view as direct

connections from input to output and the other fully connected to the preceding layer to encode

common characteristics (see Figure 3.1). Eklahky et al. proposed Multi-view DNN [18] for

recommendation systems to process user and webpage features. They suggested to use multiple

neural networks to compute compact features of the same length from the multi-view data, then

compute the cosine similarities across the views, and eventually compute the softmax score of

the cosine similarities. Wu et al. [82] proposed HMM-FNN for dynamic gesture recognition,

incorporating 2 HMM layers, one for hand posture features, the other for trajectory features,

which were merged into fully connected layers. Ha et al. [30] proposed to use multiple RNNs

to compute features from a set of description text data on e-commerce websites, and then run 2

fully connected layer and a softmax output on the concatenated features.

Network cascading is often used when a pivot view is present in the multi-view data

that other views depend on, so that cascading networks well reflect their dependencies, most

notably in the image captioning problem. Show and Tell proposed by Vinyals et al. [76], for

22

wt-1wt

Embedding

Recurrent

Joint Layer

SoftMax

CNN

FeaturesImage

Figure 3.2: Structure of m-RNN network [57].

example, fed CNN visual features into an LSTM to generate complete sentences to describe

the input images. Donahue et al. extended the model to stacked LSTMs [16] and Johnson

et al. added in a localization layer to better capture regional features [43]. Multimodal RNN

proposed by Mao et al. [57] had the CNN features join in along with the outputs of a recurrent

layer and a fully connected embedding layer from the text inputs (see Figure 3.2). Kiros et al.

proposed to first learn a joint image-sentence embedding with a CNN-LSTM encoder, and then

process the embedding vector with a novel neural language model decoder [47].

3.2 Multi-label Learning

It seems very natural to apply neural network models to MLL problems since the N -way

output of the network may act as N label classifiers. However, such naıve models ignore

label correlations and are therefore suboptimal. Furthermore, multi-label datasets often have

some labels that are very rare. The severely imbalanced data are challenging to train and often

require special processing.

In this section we first introduce some basic practice to incoporate neural networks in

MLL problems, then we explain some loss functions for MLL, and finally we present some

deep MLL models.

23

3.2.1 Going Deep with Multi-label Learning

One of the earliest neural network based MLL methods is BP-MLL for text categorization

proposed by Zhang and Zhou [84], featuring a fully-connected ANN with 1 hidden layer, 1

output layer to encode likelihood for each label, and a pairwise error function for training

E =m∑i=1

Ei =m∑i=1

1

|Ci+||Ci

−|∑

(k,l)∈Ci+×Ci

−

exp (−(cik − cil)), (3.7)

where Ci+ is the positive label set of sample i, Ci

− is the negative set and cik, cil are outputs of

the neural network. The pairwise error penalizes incorrect predictions of negative labels that

have higher output likelihood than positive ones. However, it is later shown that this form of

error function along with other convex loss functions is not consistent with the non-convex,

discontinuous rank loss [20][8], and empirically BP-MLL did not exhibit significantly better

performance than non-NN approaches such as k Nearest Neighbors [85].

Based on BP-MLL, Nam et al. incoporated modern deep learning practice and tricks

into a large-scale multi-label text classification method [58] and achieved state-of-the-art

performance at the time. They proposed to apply:

• Cross entropy

E =∑i

Ei =∑i,k

−(yik log cik + (1− yik) log (1− cik)), (3.8)

where cik is the activation, yik ∈ {0, 1} is the target, for label k, sample i, instead of error

function (3.7);

• ReLU instead of hyperbolic tangent activations;

• Dropout;

• AdaGrad optimization [17].

24

An alternative construction of the output layer was suggested by Huang et al. [38].

They used a 5-layer pre-trained DBN model and set up a pair of output units for a label l, c·l for

positive and c·l for negative, and they computed softmax to obtain the probability of having the

label:

p·l =exp (c·l)

exp (c·l) + exp (c·l), (3.9)

and they also used cross entropy as error function.

3.2.2 Multi-label Ranking Loss

Gong et al. [25] explored some different loss functions for MLL, which they referred to as

multilabel image annotation problem. In the paper they adopted the AlexNet structure and

experimented with 3 loss functions. The first one was softmax loss inspired by TagProp [29].

They first computed the softmax

p·l =exp (c·l)∑k exp (c·k)

(3.10)

as the posterior probability of each label, and set the ground truth as a label vector y· such that

y·l =Ilc+, (3.11)

where c+ is the number of positive labels and Il is an indicator function that equals to 1 when

label l is positive and 0 otherwise. Finally they minimized the KL-divergence between the

predictions and the ground truths

E =1

n

n∑i=1

DKL(pi ‖ yi). (3.12)

The second loss function was the pairwise ranking loss [42] that penalized negative

25

labels that scored higher than positive ones:

E =∑i

∑j∈Ci

+

∑k∈Ci

−

max (0, 1− cij + cik), (3.13)

but they argued (3.13) did not directly optimize the top-k accuracy which was crucial in the

MLL setting, and it was therefore suboptimal.

The last loss function they considered was the weighted approximate ranking (WARP),

first described in [80]:

E =∑i

∑j∈Ci

+

∑k∈Ci

−

L(rj) max (0, 1− cij + cik), (3.14)

where rj is the rank for the j-th label of image i, L(·) is a weighting function that is increasing

with respect to ranks, so that positive labels that were ranked low received a large penalty and

were pushed to the top. The rank rj of a label was estimated via sampling.

3.2.3 Multi-attribute Learning

In general, the multiple labels of data samples in MLL are non-mutually-exclusive attributes,

for example, an image of a tree can be labeled as tree, plant, green, etc. simultaneously. This

setting can be referred to as multi-attribute learning. Multi-attribute learning is closely related

to the ranking problem, as we generally predict the probability of a set of attributes, high and

low; and unlike the classification problem, we not only pick the one with highest probability,

but we normally consider all attributes exceeding a set threshold as positive.

Different attributes are often correlated. For example, attributes can be hierarchical,

have high chance to co-occur, or the other way around. Che et al. [10] proposed a graph

Laplacian prior to enforce label correlation. Let βi ∈ Rk be the edge weights of unit i in the

last but one layer to the output layer, which was referred to as the output weights, and Ak×k be

26

the similarity matrix of the k labels. The Laplacian matrix L = C−A, C = diag((∑

j Aij)i)

has the property

tr(βᵀLβ) =1

2

∑ij

Aij‖βi − βj‖22, (3.15)

so that the regularized loss function can be written as follows

L =∑i

Ei + λR(Θ) +ρ

2tr(βᵀLβ), (3.16)

where Ei is a per-sample loss term, Θ is model parameters, R(·) is a norm, and the Laplacian

regularizer encourages closely related labels to have similar predictions. The similarity matrix

A can be either structure-based (e.g. graph adjacency matrix) or data-driven (e.g. label

co-occurrence matrix).

Shankar et al. introduced a weakly supervised MLL problem and proposed deep-

carving to train CNNs for this setting [66]. Under weak supervision, all data instances may

have multiple labels as in MLL, but only one is known as a ground truth label and all others

are missing. They observed deep CNNs were able to generate disentangled feature maps

even in a weakly supervised scenario, but would get befuddled by the lack of labels which

later disrupted convergence. Using AlexNet as base network architecture, they proposed a

deep-carving algorithm to update the pseudo-labels for training to reflect the similarities of the

convolutional feature maps between single samples and per class averages. The CNN was then

able to carve itself iteratively for MLL.

3.2.4 Regional Object Detection

Regional object detection is a special type of MLL problem on images. Natural images often

contain multiple salient objects, and to detect them all accurately involves 2 different problems:

localization and classification. Prior to 2014, the mainstream methods were building complex

27

ensemble systems of low level features and recognition performance was stagnant for a few

years.

With the new developments of CNN based methods, Girshick et al. devised regional

CNN (R-CNN) [23] to address the multiple object detection problem. R-CNN took the possible

regions containing object instances generated by region proposal methods (e.g. [75]) and passed

the image patches to a fine-tuned CNN, and finally used SVM to classify CNN features. To

train R-CNN, ground truth bounding boxes were required to filter region proposals, and the

CNN was first pre-trained on ImageNet and then fine-tuned on the dedicated dataset.

Fast R-CNN [22] replaced the CNN-SVM pipeline in original R-CNN with an end-

to-end architecture, with the help of a novel RoI max-pooling layer. The RoI pooling layer

replaced a middle max-pooling layer of an existing CNN architecture, e.g. VGG16, to accept

the convolutional feature maps produced by the preceding layers. It performed max-pooling on

the RoI window into a grid of the same size of the original max-pooling layer output so that

the resulting feature maps could proceed in the rest of the CNN. The network ended with two

output layers: softmax probabilities and a per-class bounding box regressor. Fast R-CNN got

rid of the multi-stage CNN+SVM training, but still required ground truth bounding boxes as

well as an external region proposal method for inference.

Faster R-CNN [62] continued to speed up the region proposal process with a joint of

region proposal network (RPN) and object detection network that shared a large portion of the

CNN. After the last shared convolutional layer, on the sliding window ran a small network that

splitted into a classifier and a bounding box regressor like in fast R-CNN. Region proposals of

different shapes and sizes were generated and ran through the sliding window network unit,

enabling nearly cost-free region proposals during inference.

Wei et al. proposed Hypotheses-CNN-Pooling (HCP) [79] that eliminated the need for

ground truth region bounding boxes. HCP was based on two key assumptions: (a) each region

proposal, referred to as hypothesis, contained at most one object, and (b) all possible objects

were covered by some hypotheses. For a group of hypotheses, they avoided the bounding box

28

evaluation by introducing the cross-hypothesis max-pooling. Any hypothesis hi, valid or not,

were fed into a shared CNN to produce an output vector vi, and in the following fusion layer

vj = maxi

vji . (3.17)

The cross-hypothesis max-pooling effectively suppressed noisy hypotheses and pre-

served high responses from valid hypotheses, making it unnecessary to determine bounding

boxes of objects.

29

Chapter 4

Applications

In this chapter we first introduce three application scenarios of deep MVL and MLL, and then

briefly present our preliminary work on x-ray scattering image classification.

4.1 Large Scale Item Categorization in e-Commerce

E-commerce websites like eBay and Amazon have attracted busy traffic and transactions

nowadays with the development of web and mobile technologies. Many new items are

registered online every day, represented by metadata such as title, category, image, price, etc.,

most of which are given manually by human sellers. Automatic item categorization attempts

to infer the category of items based on the metadata provided. The categorization problem is

difficult because of noisy information included in the text data, as well as the long tail data

distribution which makes data samples for certain smaller categories extremely imbalanced.

Ha et al. proposed deep categorization network (DeepCN) [30], which was an end-

to-end multiple RNN system for large scale item categorization. They denoted an item d as a

record with a category label y and an attribute vector x of 6 attributes:

d = {x, y} = {x(1),x(2), . . . ,x(6), y}, (4.1)

30

…

…

Metadata Input

Attribute RNN

Concatenation

Fully Connected

Layer

SoftMax Output

Figure 4.1: Structure of DeepCN [30].

where the 6 attributes x(i) are item name, brand name, high-level category, shopping mall ID,

manufacturer and image signature respectively; x(1) to x(5) are word sequences and x(6) is an

image descriptor of color and edge patterns.

DeepCN consisted of multiple attribute specific RNNs, fully connected layers and

one softmax output layer (see Figure 4.1). Each RNN processed one metadata attribute and

generated a semantic vector, and then all output vectors from RNNs were concatenated into

one and went through the fully connected layers. The output layer took the softmax of the last

fully connected layer to represent the probability of the categories. DeepCN could be trained

with a normal backpropagation in the top layers, and then BPTT for the attribute RNNs.

4.2 Deep Semantic Ranking Based Hashing

Representing image efficiently is crucial for content-based image retrieval. Binary hashing is a

popular representation method because it is straightforward to compute and store. However,

traditional hashing schemes that are data-independent or distance metrics based fail to grasp

semantic (dis)similarity among images. Therefore learning based, especially deep learning

based hashing is highly desired.

Formally, A hash function h : RD → {−1, 1} maps a D-dimensional input onto a

binary code. For learning based image hashing, the objective is to learn a set of hash functions

h(x) = [h1(x), h2(x), . . . , hK(x)] preserving the semantic structures.

31

Fu

lly co

nn

ected

Lay

er (FC

a)

Fu

lly co

nn

ected

Lay

er (FC

b)

Ha

sh L

ay

er

5 convolutional

layers

Figure 4.2: Structure of deep hash function [86].

Zhao et al. devised a hashing scheme that used deep semantic features for multi-label

retrieval [86]. They proposed a deep hash function containing the 5 convolutional layers as in

AlexNet, followed by two fully connected layers (FCa, FCb) and a hash layer (see Figure 4.2).

Both FCa and FCb had direct connections to the hash layer, since they argued the features from

FCb alone had strong invariance and were not sensitive enough to semantic discrepancies. The

deep hash function was defined as

h(x;w) = sign(wᵀ[fa(x); fb(x)]), (4.2)

where w denotes the weight in the hash layer, fa(·) and fb(·) are output vectors of FCa and

FCb respectively.

Given a training set D, the objective of supervised learning here was to make the hash

code similarity, which was measured by Hamming distance, consistent with the multi-label

similarity among images, i.e. how many labels were identical/different. For each query q,

denote {xn}Nn=1 as a ranked list of samples with decreasing multi-label similarity with the

query q. The loss function could be expressed as

F(W) =∑

q∈D,{xi}Mi=1⊂D

Lω(h(q;W), {h(xi;W)}Mi=1)

+α

2‖Eq(h(q;W))‖22 +

β

2‖W‖22, (4.3)

where W is the model parameters, Lω is a surrogate loss of Hamming distances between hash

32

CNN

…

LSTM

…

conv

Localization Layer

Image Conv feature

Regional feature

Recognition

Network

Region proposal

With scores Region sampling

Sampling

Grid

Bilinear Sampler

Striped gray cat

Cats watching TV

Figure 4.3: Pipeline of FCLN [43].

codes weighted by similarity rank.

4.3 Dense Image Captioning

Deep learning powered image understanding has made remarkable progress in two aspects:

multiple object detection in images (label density) and image captioning of high complexity

(label complexity). To unify the developments on both fronts, Johnson et al. [43] introduced

the dense captioning task to predict a set of detailed descriptions of multiple objects in an

image. They proposed a Fully Convolutional Localization Network (FCLN), as a fusion of

regional deep CNN and RNN language model, for the task.

FCLN consisted of a regional CNN based on Faster R-CNN [62] and a subsequent

LSTM (see Figure 4.3). They followed the configuration of VGG16 before the last pooling

layer, and then on the output feature map, candidate bounding boxes were sampled and passed

through a regressor. In order to have differentiable box coordinates for optimization, they

replaced the hard bounding boxes from the RoI layer in Fast R-CNN [22] with a bilinear

33

Halo Ring Peak AgBH

Figure 4.4: Examples of x-ray scattering image attributes.

interpolation [40]. The sampled regional features were then run through 2 fully connected

layers to obtain final feature representations before fed into the LSTM.

4.4 Preliminary Work: X-ray Scattering Image Classification

X-ray scattering is a powerful technique for probing physical structure of materials at the

molecular and nano-scale. Modern x-ray scattering facilities can generate 50,000 to 1,000,000

images/day (1-4 TB/day), thus to automate the workflow is crucial for material discovery

endeavors.

A key problem in x-ray scattering image studying is image classification. Scientists

are interested in image attributes of various aspects, which roughly involves experiments,

instrumentation, imaging, scattering features, samples, materials and specific substances. In

our experimental image dataset, all 2832 images from 13 x-ray scattering runs have been

labeled with 98 binary attributes of these 6 categories by a domain expert. Figure 4.4 provides

a few examples of attributed images. A robust prediction method will help the entire process

of image inspection, exploration, etc. Previous approaches [46] have been using traditional

computer vision techniques for the classification task, and we attempted to bring deep MLL

into this problem.

Since the amount of experimental data is insufficient for deep neural network training,

we generated 100,000 synthetic images using simulation software to train the model. We

34

aggregated 15 higher-level major attributes (see Table 4.1) from the synthetic dataset that

covered typical visual patterns and physical meanings and had a good amount of positive

samples, among which 9 (see Table 4.2) were also present in experimental data and could be

used to evaluate the performance on experimental data.

We preprocessed the image data with log transform

I ′ =1

log (IM + 1)log (I + 1), (4.4)

where I is the pixel intensity, IM is the maximum intensity, and trained the CNN with the log

images. We adopted the AlexNet model with a sigmoid output from the last fully connected

layer as probability predictions and used cross entropy as loss function, and trained the network

with momentum 0.9. Performance on synthetic data and real data are listed as follows:

Table 4.1: Mean Average Precision on Synthetic Dataset

Category mAP

Diffuse low-q 0.7823Diffuse high-q 0.7624Halo 0.7718Higher orders 0.8943Rings 0.8926BCC 0.0186FCC 0.0750Hexagonal 0.1821Lamellar 0.2725Symmetry halo 0.4758Symmetry ring 0.4055Circular beamstop 0.3016Wedge beamstop 0.6206Linear beamstop 0.6402Beam off image 0.8273

We are considering several possible directions of improving the learning model: (a)

exploit the hierarchical structure of attributes, such as major visual pattern vs. style variation,

physical meaning vs. visual cue, to predict the attributes in a layered manner instead of all

35

Table 4.2: Mean Average Precision on Experimental Dataset

Category mAP

Diffuse low-q 0.5545Diffuse high-q 0.1425Halo 0.2414Higher orders 0.6365Rings 0.7485Symmetry ring 0.0119Circular beamstop 0.5649Linear beamstop 0.3325Beam off image 0.8750

at once; (b) incorporate the statistics along the radial direction to facilitate inference; and (c)

perform multi-scale classification.

36

Chapter 5

Conclusion

In this report we introduced some deep learning models in conjunction with MVL and MLL

problems, and presented some applications of deep multi-learning. Deep neural network

models have enjoyed great successes not only in their vanilla forms, but also with heavy

modifications for specific problems in recent years. Among these works end-to-end models

with novel connections and units worked particularly well and demonstrated the versatility of

neural networks. They were easy to train and able to incorporated problem intuitions naturally

into the model.

It is worth noting that although MVL and MLL problems indicate that our understand-

ing of the data still plays an important role, it is not a comeback to the traditional approaches

of defining good hand crafted features. Deep learning methods are powerful to learn features

automatically and what we are providing is understanding of abstract entities and relations.

In that sense, deep multi-learning enables high-level knowledge input and is one step further

ahead from generic deep models.

In future, we plan to delve deeper into exploiting MVL and MLL structures with deep

learning. We will be experimenting with the following ideas for more effective training:

37

• Try new data augmentation techniques to improve the network performance with limited

data, and better methods to organize input modalities and features for an MVL setting;

• Devise new transform layers to carry out the computations required in the pipeline so

that the network encompasses all model parameters and is able to perform stochastic

update all at once;

• Apply manifold embedding methods to model data and label interactions for more robust,

structure-aware MLL.

38

References

[1] Understanding LSTM Networks. URL http://colah.github.io/posts/

2015-08-Understanding-LSTMs/.

[2] D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen,

M. Chrzanowski, A. Coates, G. Diamos, and others. Deep speech 2: End-to-end speech

recognition in english and mandarin. In Proceedings of The 33rd International Conference

on Machine Learning, 2016.

[3] J. Amores. Multiple instance classification: Review, taxonomy and comparative study.

Artificial Intelligence, 201:81--105, Aug 2013.

[4] G. Andrew, R. Arora, J. A. Bilmes, and K. Livescu. Deep Canonical Correlation Analysis.

In Proceedings of the 30th International Conference on Machine Learning, pages 1247--

1255, 2013.

[5] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient

descent is difficult. IEEE Transactions on Neural Networks, 5(2):157--166, Mar 1994.

[6] Y. Bengio, A. Courville, and P. Vincent. Representation Learning: A Review and New

Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):

1798--1828, Aug 2013.

[7] S. R. Bulo, L. Porzi, and P. Kontschieder. Dropout distillation. In Proceedings of The

33rd International Conference on Machine Learning, pages 99--107, 2016.

39

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

[8] C. Calauzenes, N. Usunier, and P. Gallinari. On the (non-) existence of convex, calibrated

surrogate losses for ranking. In Neural Information Processing Systems, pages 197--205.

Curran Associates, Inc., 2012.

[9] S. Chang, W. Han, J. Tang, G.-J. Qi, C. C. Aggarwal, and T. S. Huang. Heterogeneous

network embedding via deep architectures. In Proceedings of the 21th ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining, pages 119--128.

ACM, 2015.

[10] Z. Che, D. Kale, W. Li, M. T. Bahadori, and Y. Liu. Deep Computational Phenotyping.

In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining, KDD ’15, pages 507--516, New York, NY, USA, 2015.

ACM.

[11] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and

Y. Bengio. Learning Phrase Representations using RNN Encoder-Decoder for Statistical

Machine Translation. In EMNLP 2014, 2014.

[12] D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for

image classification. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE

Conference on, pages 3642--3649. IEEE, 2012.

[13] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber. High-

performance neural networks for visual object classification. In IJCAI 2011.

[14] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273--297,

1995.

[15] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In

2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition

(CVPR’05), volume 1, pages 886--893 vol. 1, Jun 2005.

40

[16] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan,

K. Saenko, and T. Darrell. Long-Term Recurrent Convolutional Networks for Visual

Recognition and Description. In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 2625--2634, 2015.

[17] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning

and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121--2159,

2011.

[18] A. M. Elkahky, Y. Song, and X. He. A multi-view deep learning approach for cross do-

main user modeling in recommendation systems. In Proceedings of the 24th International

Conference on World Wide Web, pages 278--288. ACM, 2015.

[19] A. Evgeniou and M. Pontil. Multi-task feature learning. In Advances in neural information

processing systems, volume 19, page 41, 2007.

[20] W. Gao and Z.-H. Zhou. On the Consistency of Multi-Label Learning. In COLT,

volume 19, pages 341--358, 2011.

[21] F. A. Gers and J. Schmidhuber. Recurrent nets that time and count. In IJCNN 2000,

Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks,

2000, volume 3, pages 189--194 vol.3, 2000.

[22] R. Girshick. Fast R-CNN. In Proceedings of the IEEE International Conference on

Computer Vision, pages 1440--1448, 2015.

[23] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich Feature Hierarchies for Accurate

Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages 580--587, 2014.

[24] X. Glorot, A. Bordes, and Y. Bengio. Deep Sparse Rectifier Neural Networks. In Aistats,

volume 15, page 275, 2011.

41

[25] Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe. Deep Convolutional Ranking for

Multilabel Image Annotation. In ICLR 2014.

[26] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. C. Courville, and Y. Bengio. Maxout

networks. In Proceedings of the 30th International Conference on Machine Learning,

volume 28, pages 1319--1327, 2013.

[27] K. Greff, R. K. Srivastava, J. Koutnik, B. R. Steunebrink, and J. Schmidhuber. LSTM: A

Search Space Odyssey. IEEE Transactions on Neural Networks and Learning Systems,

2016.

[28] A. Grover, A. Kapoor, and E. Horvitz. A Deep Hybrid Model for Weather Forecasting.

In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining, KDD ’15, pages 379--386, New York, NY, USA, 2015.

ACM.

[29] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. TagProp: Discriminative metric

learning in nearest neighbor models for image auto-annotation. In 2009 IEEE 12th

International Conference on Computer Vision, pages 309--316, Sep 2009.

[30] J.-W. Ha, H. Pyo, and J. Kim. Large-Scale Item Categorization in e-Commerce Using

Multiple Recurrent Neural Networks. In Proceedings of the 22nd ACM SIGKDD Interna-

tional Conference on Knowledge Discovery and Data Mining, pages 107--115. ACM,

2016.

[31] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In

CVPR 2016.

[32] G. Hinton. Deep belief networks. Scholarpedia, 4(5):5947, 2009. URL http://www.

scholarpedia.org/article/Deep_belief_networks.

[33] G. E. Hinton. Deep belief nets, 2007. URL http://www.cs.toronto.edu/

˜hinton/nipstutorial/nipstut3.pdf.

42

http://www.scholarpedia.org/article/Deep_belief_networks

http://www.scholarpedia.org/article/Deep_belief_networks

http://www.cs.toronto.edu/~hinton/nipstutorial/nipstut3.pdf

http://www.cs.toronto.edu/~hinton/nipstutorial/nipstut3.pdf

[34] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural

networks. Science, 313(5786):504--507, 2006.

[35] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets.

Neural computation, 18(7):1527--1554, 2006.

[36] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):

1735--1780, 1997.

[37] H. Hotelling. Relations Between Two Sets of Variates. Biometrika, 28(3/4):321--377,

1936.

[38] Y. Huang, W. Wang, L. Wang, and T. Tan. Multi-task deep neural network for multi-label

learning. In 2013 IEEE International Conference on Image Processing, pages 2897--2900.

IEEE, 2013.

[39] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by

reducing internal covariate shift. In Proceedings of the 32nd International Conference on

Machine Learning, 2015.

[40] M. Jaderberg, K. Simonyan, A. Zisserman, and others. Spatial trans-

former networks. In Advances in Neural Information Processing Systems,

pages 2017--2025, 2015. URL http://papers.nips.cc/paper/

5854-spatial-transformer-networks.

[41] J. Jiang. A literature survey on domain adaptation of statistical classifiers. Tech-

nical report, 2008. URL http://sifaka.cs.uiuc.edu/jiang4/domain_

adaptation/survey/da_survey.pdf.

[42] T. Joachims. Optimizing Search Engines Using Clickthrough Data. In Proceedings of

the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data

Mining, KDD ’02, pages 133--142, New York, NY, USA, 2002. ACM.

43

http://papers.nips.cc/paper/5854-spatial-transformer-networks

http://papers.nips.cc/paper/5854-spatial-transformer-networks

http://sifaka.cs.uiuc.edu/jiang4/domain_adaptation/survey/da_survey.pdf

http://sifaka.cs.uiuc.edu/jiang4/domain_adaptation/survey/da_survey.pdf

[43] J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fully convolutional localization

networks for dense captioning. In CVPR 2016.

[44] R. Jozefowicz, W. Zaremba, and I. Sutskever. An Empirical Exploration of Recurrent

Network Architectures. In Proceedings of The 32nd International Conference on Machine

Learning, pages 2342--2350, 2015.

[45] Y. Kang, S. Kim, and S. Choi. Deep learning to hash with multiple representations. In

2012 IEEE 12th International Conference on Data Mining, pages 930--935. IEEE, 2012.

[46] M. H. Kiapour, K. Yager, A. C. Berg, and T. L. Berg. Materials discovery: Fine-grained

classification of X-ray scattering images. In IEEE Winter Conference on Applications of

Computer Vision, pages 933--940. IEEE, 2014.

[47] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying Visual-Semantic Embeddings

with Multimodal Neural Language Models. In TACL 2015.

[48] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images.

Technical report, 2009.

[49] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep

Convolutional Neural Networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q.

Weinberger, editors, Advances in Neural Information Processing Systems 25, pages

1097--1105, 2012.

[50] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D.

Jackel. Handwritten digit recognition with a back-propagation network. In Advances in

neural information processing systems, 1990.

[51] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to

document recognition. Proceedings of the IEEE, 86(11):2278--2324, 1998.

[52] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436--444, May

2015.

44

[53] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller. Efficient backprop. In Neural

networks: Tricks of the trade, pages 9--48. Springer, 2012.

[54] H. Li, W. Ouyang, and X. Wang. Multi-Bias Non-linear Activation in Deep Neural

Networks. In Proceedings of The 33rd International Conference on Machine Learning,

2016.

[55] M. Lin, Q. Chen, and S. Yan. Network in network. In ICLR 2014.

[56] D. G. Lowe. Object recognition from local scale-invariant features. In The Proceedings of

the Seventh IEEE International Conference on Computer Vision, 1999, volume 2, pages

1150--1157 vol.2, 1999.

[57] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. Deep Captioning with

Multimodal Recurrent Neural Networks (m-RNN). In ICLR 2015.

[58] J. Nam, J. Kim, E. L. Mencıa, I. Gurevych, and J. Furnkranz. Large-scale multi-label text

classification—revisiting neural networks. In Joint European Conference on Machine

Learning and Knowledge Discovery in Databases, pages 437--452. Springer, 2014.

[59] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning.

In Proceedings of the 28th international conference on machine learning (ICML-11),

pages 689--696, 2011.

[60] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of

the spatial envelope. International journal of computer vision, 42(3):145--175, 2001.

[61] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural

networks. In Proceedings of the 30th International Conference on Machine Learning,

volume 28, pages 1310--1318, 2013.

[62] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards Real-Time Object

Detection with Region Proposal Networks. In C. Cortes, N. D. Lawrence, D. D. Lee,

45

M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing

Systems 28, pages 91--99, 2015.

[63] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-

propagating errors. Nature, 323:533--536, 1986.

[64] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,

A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual

Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211--

252, 2015.

[65] M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE Transac-

tions on Signal Processing, 45(11):2673--2681, 1997.

[66] S. Shankar, V. K. Garg, and R. Cipolla. Deep-carving: Discovering visual attributes by

carving deep neural nets. In Proceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, pages 3403--3412, 2015.

[67] H. Shimodaira. Improving predictive inference under covariate shift by weighting the

log-likelihood function. Journal of statistical planning and inference, 90(2):227--244,

2000.

[68] P. Y. Simard, D. Steinkraus, and J. C. Platt. Best practices for convolutional neural

networks applied to visual document analysis. In ICDAR, volume 3, pages 958--962,

2003.

[69] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image

recognition. In ICLR 2015.

[70] M. S. Sorower. A literature survey on algorithms for multi-label learning. Techni-

cal report, 2010. URL http://people.oregonstate.edu/˜sorowerm/pdf/

Qual-Multilabel-Shahed-CompleteVersion.pdf.

46

http://people.oregonstate.edu/~sorowerm/pdf/Qual-Multilabel-Shahed-CompleteVersion.pdf

http://people.oregonstate.edu/~sorowerm/pdf/Qual-Multilabel-Shahed-CompleteVersion.pdf

[71] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity:

The all convolutional net. In ICLR 2015 Workshop.

[72] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout:

a simple way to prevent neural networks from overfitting. Journal of Machine Learning

Research, 15(1):1929--1958, 2014.

[73] S. Sun. A survey of multi-view machine learning. Neural Computing and Applications,

23(7-8):2031--2038, Feb 2013.

[74] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,

and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, pages 1--9, 2015.

[75] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for

object recognition. International journal of computer vision, 104(2):154--171, 2013.

[76] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption

generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, pages 3156--3164, 2015.

[77] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Regularization of neural networks

using dropconnect. In Proceedings of the 30th International Conference on Machine

Learning (ICML-13), pages 1058--1066, 2013.

[78] W. Wang, R. Arora, K. Livescu, and J. Bilmes. On deep multi-view representation

learning. In Proc. of the 32nd Int. Conf. Machine Learning (ICML 2015), pages 1083--

1092, 2015.

[79] Y. Wei, W. Xia, M. Lin, J. Huang, B. Ni, J. Dong, Y. Zhao, and S. Yan. HCP: A Flexible

CNN Framework for Multi-Label Image Classification. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 38(9):1901--1907, Sep 2016.

47

[80] J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling Up To Large Vocabulary

Image Annotation. In Proceedings of the International Joint Conference on Artificial

Intelligence, IJCAI, 2011.

[81] S. Wiesler and H. Ney. A convergence analysis of log-linear training. In Advances in

Neural Information Processing Systems, pages 657--665, 2011.

[82] H. Wu, J. Wang, and X. Zhang. Combining hidden Markov model and fuzzy neural

network for continuous recognition of complex dynamic gestures. The Visual Computer,

pages 1--14, Aug 2015.

[83] C. Xu, D. Tao, and C. Xu. A Survey on Multi-view Learning. arXiv:1304.5634 [cs], Apr

2013. arXiv: 1304.5634.

[84] M.-L. Zhang and Z.-H. Zhou. Multilabel Neural Networks with Applications to Func-

tional Genomics and Text Categorization. IEEE Transactions on Knowledge and Data

Engineering, 18(10):1338--1351, Oct 2006.

[85] M.-L. Zhang and Z.-H. Zhou. ML-KNN: A lazy learning approach to multi-label learning.

Pattern Recognition, 40(7):2038--2048, Jul 2007.

[86] F. Zhao, Y. Huang, L. Wang, and T. Tan. Deep semantic ranking based hashing for

multi-label image retrieval. In Proceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, pages 1556--1564, 2015.

48

multi-view, multi-label learning with deep neural networks

Documents