learning compact recurrent neural networks with block-term ... · learning compact recurrent neural...

Learning Compact Recurrent Neural Networks withBlock-Term Tensor Decomposition

Jinmian Ye1, Linnan Wang2, Guangxi Li1, Di Chen1, Shandian Zhe3, Xinqi Chu4, Zenglin Xu1

1SMILE Lab, University of Electronic Science and Technology of China 2Brown University 3University of Utah 4Xjera Labs pte.ltd

IntroductionRNNs are powerful sequence modeling tools. We introduce a compactarchitecture with Block-Term tensor decomposition to address theredundancy problem.Challenges: 1) traditional RNNs suffer from an excess of parameters;and 2) Tensor-Train RNN has limited representation ability and flexibilitydue to the difficulty of searching the optimal setting of ranks.Contributions: 1) introduces a new sparsely connected RNNarchitecture with Block-Term tensor decomposition; and 2) achievesbetter performance while maintaining fewer parameters.

Figure: Architecture of BT-LSTM. The redundant dense connections between input andhidden state is replaced by low-rank BT representation.

Should be noted that we only substitute the input-hidden matrixmultiplication while retaining the current design philosophy of LSTM.(

ft′, it′, c̃′t ,ot

′)=W ·xt +U ·ht−1+b (1)(

ft , it , c̃t ,ot)=(σ(ft

′), σ(it

′), tanh(c̃′t), σ(ot

′))

(2)

AnalysisComparison of complexity and memory usage of vanilla RNN,Tensor-Train RNN (TT-RNN) and our BT-RNN. In this table, the weightmatrix’s shape is I ×J. Here, Jmax =maxk(Jk),k ∈ [1,d].

Method Time MemoryRNN forward O(IJ) O(IJ)RNN backward O(IJ) O(IJ)TT-RNN forward O(dIR2Jmax) O(RI)TT-RNN backward O(d2IR4Jmax) O(R3I)BT-RNN forward O(NdIRdJmax) O(RdI)BT-RNN backward O(Nd2IRdJmax) O(RdI)

Total #Parameters: PBTD =N(∑d

k=1 IkJkR +Rd)

Figure: The number of parameters w.r.tCore-order d and Tucker-rank R, in the setting ofI = 4096,J = 256,N = 1. While the vanilla RNNcontains I × J = 1048576 parameters. When d issmall, the first part

∑d1 IkJkR does the main

contribution to parameters. While d is large, thesecond part Rd does. So we can see the numberof parameters will go down sharply at first, but riseup gradually as d grows up (except for the case ofR = 1).

1 2 3 4 5 6 7 8 9 10 11 12

Core-order d

100

101

102

103

104

105

106

107

108

# P

ara

ms

R=1 R=2 R=3 R=4

MethodBackground 1: Tensor Product Given A ∈RI1×···×Id and B ∈RJ1×···×Jd, while Ik = Jk .To simplify, i−k denotes indices (i1, . . . , ik−1), while i+k denotes (ik+1, . . . , id):

(A •k B)i−k ,i+k ,j−k ,j

+k=

Ik∑p=1

Ai−k ,p,i+kBj−k ,p,j

+k

(3)

Background 2: Block-Term Decomposition

Figure: X=∑N

n=1Gn •1 A(1)n •2 A

(2)n •3 · · · •d A(d)

n

BT-RNN Model: 1) Tensorizing W and x; 2) Decomposing W with BTD; 3)Computing W ·x; 4) Traing from scratch via BPTT!

(a)vector to tensor (b)matrix to tensor

Figure: Step 1: Tensorization operation in a case of3-order tensors.

Figure: Step 2 & 3: Decomposing weightmatrix and computing y = Wx.

Implementation: W ∈RJ×I, W ∈RJ1×I1×J2×···×Jd×Id, where I = I1I2 · · · Id andJ = J1J2 · · ·Jd. Gn ∈RR1×···×Rd denotes the core tensor, A(d)

n ∈RId×Jd×Rd denotes thefactor tensor.

W ·xt =N∑

n=1

Xt •1A(1)n •2 · · · •d A

(d)n •1,2,...,d Gn (4)

ConclusionWe proposed a Block-Term RNN architecture to address the redundancy problem inRNNs. Experiment results on 3 challenge tasks show that our BT-RNN architecturecan not only consume several orders fewer parameters but also improve the modelperformance over standard traditional LSTM and the TT-LSTM.

References[1] L. De Lathauwer. Decompositions of a higher-order tensor in block termspart ii:Definitions and uniqueness. SIAM SIMAX’ 2008.[2] Y. Yang, D. Krompass, and V. Tresp. Tensor-Train Recurrent Neural Networks forVideo Classification. In ICML’ 2017.[3] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra. Draw: Arecurrent neural network for image generation. In ICML’ 2015.[4] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural imagecaption generator. In CVPR’ 2015.

ExperimentsTasks: 1. Action Recognition in Videos; 2. Image Generation; 3. Image Captioning; 4. SensitivityAnalysis on Hyper-Parameters.Datasets: 1. UCF11 YouTube Action dataset; 2. MNIST; 3. MSCOCO.

0 50 100 150 200 250 300 350 400

Epoch

0

0.5

1

1.5

2

2.5

3

Tra

in L

oss

LSTM=Params: 58.9M-Top: 0.697

BTLSTM=R: 1-CR: 81693x-Top: 0.784

BTLSTM=R: 2-CR: 40069x-Top: 0.803

BTLSTM=R: 4-CR: 17388x-Top: 0.853

TTLSTM=R: 4-CR: 17554x-Top: 0.781

Figure: Performance of different RNN models on the ActionRecognition task trained with UCF11.

Method Accuracy

OrthogonalApproaches

Original 0.712Spatial-temporal 0.761Visual Attention 0.850

RNNApproaches

LSTM 0.697TT-LSTM 0.796BT-LSTM 0.853

Table: State-of-the-art results on UCF11 dataset reportedin literature, in comparison with our best model.

Task 1: We use a single LSTM cell as the model architecture to evaluate BT-LSTM against LSTMand TT-LSTM. The frames in video are directly input to the LSTM cell. The figure in leftdemonstrates the training loss of different models. From these experiments, we claim that ourBT-LSTM has: 1) 8×104 times parameter reductions; 2) faster convergence; 3) better modelefficiency.

0 50 100 150 200 250 3000

50

100

150

200

250

300

(a)LSTM, #Params:1.8M

0 50 100 150 200 250 3000

50

100

150

200

250

300

(b) BT-LSTM,#Params:1184

Task 2: In this experiment, we use anencoder-decoder architecture to generate im-ages. We only substitute the encoder net-work to qualitatively evaluate the LSTM andBT-LSTM. The result shows that both LSTMand BT-LSTM can generate comparable im-ages.

(c) LSTM: A train traveling downtracks next to a forest.TT-LSTM: A train traveling downtrain tracks next to a forest.BT-LSTM: A train traveling througha lush green forest.

(d) LSTM: A group of people stand-ing next to each other.TT-LSTM: A group of men standingnext to each other.BT-LSTM: A group of people posingfor a photo.

(e) LSTM: A man and a dog arestanding in the snow.TT-LSTM: A man and a dog are inthe snow.BT-LSTM: A man and a dog playingwith a frisbee.

(f) LSTM: A large elephantstanding next to a baby elephant.TT-LSTM: An elephant walkingdown a dirt road near trees.BT-LSTM: A large elephant walkingdown a road with cars.

Task 3: We use the architecture described in [4], all the three models can generate propersentences but with little improvement in BT-LSTM.

(g) Truth:W′P=4096

(h) y = W · x,P=4096

(i) d=2, R=1,N=1, P=129

(j) d=2, R=4,N=1, P=528

(k) d=2, R=1,N=2, P=258

(l) d=4, R=4,N=1, P=384

Figure: The trained W for different BT-LSTM settings. The closer to (a), the better W is.

Task 4: In this experiment, we use a single LSTM cell as the model architecture to evaluateBT-LSTM against LSTM and TT-LSTM.

learning compact recurrent neural networks with block-term ... · learning compact recurrent neural...

Documents