Download - Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)
Modeling Images, Videos and Text
Using the Caffe Deep Learning
Library (Part 2)
Kate Saenko
Microsoft Summer Machine Learning School, St Petersburg 2015
PART I
INTRODUCTION
THE VISUAL DESCRIPTION PROBLEM
MODELING IMAGES
MODELING LANGUAGE
INTRO TO NEURAL NETWORKS
VIDEO-TO-TEXT NEURAL NETWORK
PART II
INTRO TO CAFFE
CAFFE IMAGE AND LANGUAGE MODELS
PART I
INTRODUCTION
THE VISUAL DESCRIPTION PROBLEM
MODELING IMAGES
MODELING LANGUAGE
INTRO TO NEURAL NETWORKS
VIDEO-TO-TEXT NEURAL NETWORK
PART II
INTRO TO CAFFE
CAFFE IMAGE AND LANGUAGE MODELS
Intro to Caffe
http://tutorial.caffe.berkeleyvision.org/
Caffe Tour Follow along with the tutorial
slides!
PART I
INTRODUCTION
THE VISUAL DESCRIPTION PROBLEM
MODELING IMAGES
MODELING LANGUAGE
INTRO TO NEURAL NETWORKS
VIDEO-TO-TEXT NEURAL NETWORK
PART II
INTRO TO CAFFE
CAFFE IMAGE AND LANGUAGE MODELS
Sequence Learning • Instances of the form x = <x1, x2, x3, …, xT>
• Variable sequence length T
• Learn a transition function f with parameters W:
• f should update hidden state ht and output yt
h0 := 0
for t = 1, 2, 3, …, T:
<yt, ht> = fW(xt, ht-1)
ht-1
xt
ht
yt
f
Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015
Sequence Learning
0
x1
h1
z1
f
x2
h2
z2
f
xT
hT
zT
f hT-1 …
Equivalent to a T-layer deep network, unrolled in time
Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015
Sequence Learning • What should the transition function f be?
• At a minimum, we want something non-linear and
differentiable
ht-1
xt
ht
zt
f
Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015
Sequence Learning • A “vanilla” RNN:
ht = σ(Whxxt + Whhht-1 + bh)
zt = σ(Whzht + bz)
• Problems
• Difficult to train — vanishing/exploding gradients
• Unable to “select” inputs, hidden state, outputs
• Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015
• RNNs have vanishing gradient
problem
• Solution: LSTM by Hochreiter &
Schmidhuber (1997)
• Keeps information for a long time
• memory cell:
– Keep gate=1 maintains state
– write gate allows input
– read gate allows output
– differentiable
output to
rest of
RNN
input from
rest of
RNN
read
gate
write
gate
keep
gate
1.73
Background - LSTM Unit
based on slide by Geoff Hinton
read
1
write
0
keep
1
1.7
read
0
write
0
1.7
read
0
write
1
1.7
1.7 1.7
keep
1
keep
0
keep
0
time
Background - LSTM Unit
based on slide by Geoff Hinton
Sequence Learning
Long Short-Term Memory (LSTM)
Proposed by Hochreiter and Schmidhuber, 1997
Sequence Learning
LSTM
(Hochreiter &
Schmidhuber, 1997)
• Allows long-term
dependencies to be
learned
• Effective for
• speech recognition
• handwriting
recognition
• translation
• parsing
Sequence Learning
LSTM
(Hochreiter &
Schmidhuber, 1997)
1 ct-1
ct-1 0 0
Exactly
remember
previous
cell state —
discard input
Jeff Donahue
Sequence Learning
LSTM
(Hochreiter &
Schmidhuber, 1997)
0 0
1 Wxt
Forget
previous
cell state —
only remember
input
Jeff Donahue
Image Description
CNN
LSTM LSTM LSTM LSTM LSTM sequential
output
a <BOS> man is talking <EOS>
• Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015
Video Description
LSTM
CNN
LSTM
LSTM
LSTM
LSTM LSTM LSTM LSTM LSTM
CNN
LSTM
LSTM
CNN
LSTM
LSTM
CNN
LSTM
sequential input & output
<BOS>
LSTM LSTM LSTM LSTM
<EOS> talking is man a
• Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015
Sequence learning features now available in Caffe.
Check out PR #2033 “Unrolled recurrent layers (RNN, LSTM)”
• Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015
Training Sequence Models • At training time, want the model to predict the next time
step given all previous time steps: p(wt+1 | w1:t)
• Example: A bee buzzes.
input output
0 <BOS> a
1 a bee
2 bee buzzes
3 buzzes <EOS>
• Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015
Sequence Input Format • First input: “cont” (continuation) indicators (T x N)
• Second input: data (T x N x D)
a dog fetches <EOS> the bee
0 1 1 1 0 1
cat in a hat <EOS> a
0 1 1 1 1 0
buzzes <EOS> a
1 1 0
tree falls <EOS>
1 1 1
N = 2, T = 6
batch 1 batch 2…
• Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015
Sequence Input Format • Inference is exact over infinite batches
• Backpropagation approximate — truncated at batch
boundaries
a dog fetches <EOS> the bee
0 1 1 1 0 1
cat in a hat <EOS> a
0 1 1 1 1 0
buzzes <EOS> a
1 1 0
tree falls <EOS>
1 1 1
N = 2, T = 6
batch 1 batch 2…
• Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015
Sequence Input Format • Words are usually represented as one-hot vectors
0
0
0
.
.
0
1
0
.
.
0
“bee”
8000D vector;
all 0s except
index of
word vocabulary
with 8000
words
a
the
am
.
.
cat
bee
dog
.
.
run
0
1
2
.
.
3999
4000
4001
.
.
7999
• Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015
Sequence Input Format • EmbedLayer projects one-hot vector
0
0
0
.
.
0
1
0
.
.
0
“bee”
8000D vector;
all 0s except
index of
word
4000
(index of the
word “bee”
in our
vocabulary)
layer {
type: “Embed”
embed_param {
input_dim: 8000
num_output: 500
}
}
500D
embedding
of “bee”
2.7e-2
1.5e-1
3.1e+2
…
• Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015
Image Description
• Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015
Image Description
• Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015
Video Description
LSTM
CNN
LSTM
LSTM
LSTM
LSTM LSTM LSTM LSTM LSTM
CNN
LSTM
LSTM
CNN
LSTM
LSTM
CNN
LSTM
<BOS>
LSTM LSTM LSTM LSTM
<EOS> talking is man a
Venugopalan et al., “Sequence to Sequence -- Video to Text,” 2015.
http://arxiv.org/abs/1505.00487
N timesteps: watch video M timesteps: produce caption
• Jeff Donahue
Language model trained
on image captions notebook
Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015