learning compact recurrent neural networks with block-term ... · learning compact recurrent neural...

1
Learning Compact Recurrent Neural Networks with Block-Term Tensor Decomposition Jinmian Ye 1 , Linnan Wang 2 , Guangxi Li 1 , Di Chen 1 , Shandian Zhe 3 , Xinqi Chu 4 , Zenglin Xu 1 1 SMILE Lab, University of Electronic Science and Technology of China 2 Brown University 3 University of Utah 4 Xjera Labs pte.ltd Introduction RNNs are powerful sequence modeling tools. We introduce a compact architecture with Block-Term tensor decomposition to address the redundancy problem. Challenges: 1) traditional RNNs suffer from an excess of parameters; and 2) Tensor-Train RNN has limited representation ability and flexibility due to the difficulty of searching the optimal setting of ranks. Contributions: 1) introduces a new sparsely connected RNN architecture with Block-Term tensor decomposition; and 2) achieves better performance while maintaining fewer parameters. Figure: Architecture of BT-LSTM. The redundant dense connections between input and hidden state is replaced by low-rank BT representation. Should be noted that we only substitute the input-hidden matrix multiplication while retaining the current design philosophy of LSTM. f t 0 , i t 0 , ˜ c 0 t , o t 0 = W · x t + U · h t -1 + b (1) ( f t , i t , ˜ c t , o t ) = σ (f t 0 )(i t 0 ), tanh( ˜ c 0 t )(o t 0 ) (2) Analysis Comparison of complexity and memory usage of vanilla RNN, Tensor-Train RNN (TT-RNN) and our BT-RNN. In this table, the weight matrix’s shape is I × J . Here, J max = max k (J k ), k [1, d ]. Method Time Memory RNN forward O(IJ ) O(IJ ) RNN backward O(IJ ) O(IJ ) TT-RNN forward O(dIR 2 J max ) O(RI ) TT-RNN backward O(d 2 IR 4 J max ) O(R 3 I ) BT-RNN forward O(NdIR d J max ) O(R d I ) BT-RNN backward O(Nd 2 IR d J max ) O(R d I ) Total #Parameters: P BTD = N ( d k =1 I k J k R + R d ) Figure: The number of parameters w.r.t Core-order d and Tucker-rank R , in the setting of I = 4096, J = 256, N = 1. While the vanilla RNN contains I × J = 1048576 parameters. When d is small, the first part d 1 I k J k R does the main contribution to parameters. While d is large, the second part R d does. So we can see the number of parameters will go down sharply at first, but rise up gradually as d grows up (except for the case of R = 1). 1 2 3 4 5 6 7 8 9 10 11 12 Core-order d 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 # Params R=1 R=2 R=3 R=4 Method Background 1: Tensor Product Given A R I 1 ×···×I d and B R J 1 ×···×J d , while I k = J k . To simplify, i - k denotes indices (i 1 ,..., i k -1 ), while i + k denotes (i k +1 ,..., i d ): (A k B) i - k ,i + k ,j - k ,j + k = I k X p =1 A i - k ,p ,i + k B j - k ,p ,j + k (3) Background 2: Block-Term Decomposition Figure: X = N n=1 G n 1 A (1) n 2 A (2) n 3 ···• d A (d ) n BT-RNN Model: 1) Tensorizing W and x; 2) Decomposing W with BTD; 3) Computing W · x; 4) Traing from scratch via BPTT! (a) vector to tensor (b) matrix to tensor Figure: Step 1: Tensorization operation in a case of 3-order tensors. Figure: Step 2 & 3: Decomposing weight matrix and computing y = Wx. Implementation: W R J ×I , W R J 1 ×I 1 ×J 2 ×···×J d ×I d , where I = I 1 I 2 ··· I d and J = J 1 J 2 ··· J d . G n R R 1 ×···×R d denotes the core tensor, A (d ) n R I d ×J d ×R d denotes the factor tensor. W · x t = N X n=1 X t 1 A (1) n 2 ···• d A (d ) n 1,2,...,d G n (4) Conclusion We proposed a Block-Term RNN architecture to address the redundancy problem in RNNs. Experiment results on 3 challenge tasks show that our BT-RNN architecture can not only consume several orders fewer parameters but also improve the model performance over standard traditional LSTM and the TT-LSTM. References [1] L. De Lathauwer. Decompositions of a higher-order tensor in block termspart ii: Definitions and uniqueness. SIAM SIMAX’ 2008. [2] Y. Yang, D. Krompass, and V. Tresp. Tensor-Train Recurrent Neural Networks for Video Classification. In ICML’ 2017. [3] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra. Draw: A recurrent neural network for image generation. In ICML’ 2015. [4] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR’ 2015. Experiments Tasks: 1. Action Recognition in Videos; 2. Image Generation; 3. Image Captioning; 4. Sensitivity Analysis on Hyper-Parameters. Datasets: 1. UCF11 YouTube Action dataset; 2. MNIST; 3. MSCOCO. 0 50 100 150 200 250 300 350 400 Epoch 0 0.5 1 1.5 2 2.5 3 Train Loss LSTM=Params: 58.9M-Top: 0.697 BTLSTM=R: 1-CR: 81693x-Top: 0.784 BTLSTM=R: 2-CR: 40069x-Top: 0.803 BTLSTM=R: 4-CR: 17388x-Top: 0.853 TTLSTM=R: 4-CR: 17554x-Top: 0.781 Figure: Performance of different RNN models on the Action Recognition task trained with UCF11. Method Accuracy Orthogonal Approaches Original 0.712 Spatial-temporal 0.761 Visual Attention 0.850 RNN Approaches LSTM 0.697 TT-LSTM 0.796 BT-LSTM 0.853 Table: State-of-the-art results on UCF11 dataset reported in literature, in comparison with our best model. Task 1: We use a single LSTM cell as the model architecture to evaluate BT-LSTM against LSTM and TT-LSTM. The frames in video are directly input to the LSTM cell. The figure in left demonstrates the training loss of different models. From these experiments, we claim that our BT-LSTM has: 1) 8 × 10 4 times parameter reductions ; 2) faster convergence ; 3) better model efficiency. (a)LSTM, #Params:1.8M (b) BT-LSTM, #Params:1184 Task 2: In this experiment, we use an encoder-decoder architecture to generate im- ages. We only substitute the encoder net- work to qualitatively evaluate the LSTM and BT-LSTM. The result shows that both LSTM and BT-LSTM can generate comparable im- ages. (c) LSTM: A train traveling down tracks next to a forest. TT-LSTM: A train traveling down train tracks next to a forest. BT-LSTM: A train traveling through a lush green forest. (d) LSTM: A group of people stand- ing next to each other. TT-LSTM: A group of men standing next to each other. BT-LSTM: A group of people posing for a photo. (e) LSTM: A man and a dog are standing in the snow. TT-LSTM: A man and a dog are in the snow. BT-LSTM: A man and a dog playing with a frisbee. (f) LSTM: A large elephant standing next to a baby elephant. TT-LSTM: An elephant walking down a dirt road near trees. BT-LSTM: A large elephant walking down a road with cars. Task 3: We use the architecture described in [4], all the three models can generate proper sentences but with little improvement in BT-LSTM. (g) Truth:W 0 P=4096 (h) y = W · x, P=4096 (i) d=2, R=1, N=1, P=129 (j) d=2, R=4, N=1, P=528 (k) d=2, R=1, N=2, P=258 (l) d=4, R=4, N=1, P=384 Figure: The trained W for different BT-LSTM settings. The closer to (a), the better W is. Task 4: In this experiment, we use a single LSTM cell as the model architecture to evaluate BT-LSTM against LSTM and TT-LSTM.

Upload: others

Post on 18-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Compact Recurrent Neural Networks with Block-Term ... · Learning Compact Recurrent Neural Networks with Block-Term Tensor Decomposition Jinmian Ye1, Linnan Wang2, Guangxi

Learning Compact Recurrent Neural Networks withBlock-Term Tensor Decomposition

Jinmian Ye1, Linnan Wang2, Guangxi Li1, Di Chen1, Shandian Zhe3, Xinqi Chu4, Zenglin Xu1

1SMILE Lab, University of Electronic Science and Technology of China 2Brown University 3University of Utah 4Xjera Labs pte.ltd

IntroductionRNNs are powerful sequence modeling tools. We introduce a compactarchitecture with Block-Term tensor decomposition to address theredundancy problem.Challenges: 1) traditional RNNs suffer from an excess of parameters;and 2) Tensor-Train RNN has limited representation ability and flexibilitydue to the difficulty of searching the optimal setting of ranks.Contributions: 1) introduces a new sparsely connected RNNarchitecture with Block-Term tensor decomposition; and 2) achievesbetter performance while maintaining fewer parameters.

Figure: Architecture of BT-LSTM. The redundant dense connections between input andhidden state is replaced by low-rank BT representation.

Should be noted that we only substitute the input-hidden matrixmultiplication while retaining the current design philosophy of LSTM.(

ft′, it′, c̃′t ,ot

′)=W ·xt +U ·ht−1+b (1)(

ft , it , c̃t ,ot)=(σ(ft

′), σ(it

′), tanh(c̃′t), σ(ot

′))

(2)

AnalysisComparison of complexity and memory usage of vanilla RNN,Tensor-Train RNN (TT-RNN) and our BT-RNN. In this table, the weightmatrix’s shape is I ×J. Here, Jmax =maxk(Jk),k ∈ [1,d].

Method Time MemoryRNN forward O(IJ) O(IJ)RNN backward O(IJ) O(IJ)TT-RNN forward O(dIR2Jmax) O(RI)TT-RNN backward O(d2IR4Jmax) O(R3I)BT-RNN forward O(NdIRdJmax) O(RdI)BT-RNN backward O(Nd2IRdJmax) O(RdI)

Total #Parameters: PBTD =N(∑d

k=1 IkJkR +Rd)

Figure: The number of parameters w.r.tCore-order d and Tucker-rank R, in the setting ofI = 4096,J = 256,N = 1. While the vanilla RNNcontains I × J = 1048576 parameters. When d issmall, the first part

∑d1 IkJkR does the main

contribution to parameters. While d is large, thesecond part Rd does. So we can see the numberof parameters will go down sharply at first, but riseup gradually as d grows up (except for the case ofR = 1).

1 2 3 4 5 6 7 8 9 10 11 12

Core-order d

100

101

102

103

104

105

106

107

108

# P

ara

ms

R=1 R=2 R=3 R=4

MethodBackground 1: Tensor Product Given A ∈RI1×···×Id and B ∈RJ1×···×Jd, while Ik = Jk .To simplify, i−k denotes indices (i1, . . . , ik−1), while i+k denotes (ik+1, . . . , id):

(A •k B)i−k ,i+k ,j−k ,j

+k=

Ik∑p=1

Ai−k ,p,i+kBj−k ,p,j

+k

(3)

Background 2: Block-Term Decomposition

Figure: X=∑N

n=1Gn •1 A(1)n •2 A

(2)n •3 · · · •d A(d)

n

BT-RNN Model: 1) Tensorizing W and x; 2) Decomposing W with BTD; 3)Computing W ·x; 4) Traing from scratch via BPTT!

(a)vector to tensor (b)matrix to tensor

Figure: Step 1: Tensorization operation in a case of3-order tensors.

Figure: Step 2 & 3: Decomposing weightmatrix and computing y = Wx.

Implementation: W ∈RJ×I, W ∈RJ1×I1×J2×···×Jd×Id, where I = I1I2 · · · Id andJ = J1J2 · · ·Jd. Gn ∈RR1×···×Rd denotes the core tensor, A(d)

n ∈RId×Jd×Rd denotes thefactor tensor.

W ·xt =N∑

n=1

Xt •1A(1)n •2 · · · •d A

(d)n •1,2,...,d Gn (4)

ConclusionWe proposed a Block-Term RNN architecture to address the redundancy problem inRNNs. Experiment results on 3 challenge tasks show that our BT-RNN architecturecan not only consume several orders fewer parameters but also improve the modelperformance over standard traditional LSTM and the TT-LSTM.

References[1] L. De Lathauwer. Decompositions of a higher-order tensor in block termspart ii:Definitions and uniqueness. SIAM SIMAX’ 2008.[2] Y. Yang, D. Krompass, and V. Tresp. Tensor-Train Recurrent Neural Networks forVideo Classification. In ICML’ 2017.[3] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra. Draw: Arecurrent neural network for image generation. In ICML’ 2015.[4] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural imagecaption generator. In CVPR’ 2015.

ExperimentsTasks: 1. Action Recognition in Videos; 2. Image Generation; 3. Image Captioning; 4. SensitivityAnalysis on Hyper-Parameters.Datasets: 1. UCF11 YouTube Action dataset; 2. MNIST; 3. MSCOCO.

0 50 100 150 200 250 300 350 400

Epoch

0

0.5

1

1.5

2

2.5

3

Tra

in L

oss

LSTM=Params: 58.9M-Top: 0.697

BTLSTM=R: 1-CR: 81693x-Top: 0.784

BTLSTM=R: 2-CR: 40069x-Top: 0.803

BTLSTM=R: 4-CR: 17388x-Top: 0.853

TTLSTM=R: 4-CR: 17554x-Top: 0.781

Figure: Performance of different RNN models on the ActionRecognition task trained with UCF11.

Method Accuracy

OrthogonalApproaches

Original 0.712Spatial-temporal 0.761Visual Attention 0.850

RNNApproaches

LSTM 0.697TT-LSTM 0.796BT-LSTM 0.853

Table: State-of-the-art results on UCF11 dataset reportedin literature, in comparison with our best model.

Task 1: We use a single LSTM cell as the model architecture to evaluate BT-LSTM against LSTMand TT-LSTM. The frames in video are directly input to the LSTM cell. The figure in leftdemonstrates the training loss of different models. From these experiments, we claim that ourBT-LSTM has: 1) 8×104 times parameter reductions; 2) faster convergence; 3) better modelefficiency.

0 50 100 150 200 250 3000

50

100

150

200

250

300

(a)LSTM, #Params:1.8M

0 50 100 150 200 250 3000

50

100

150

200

250

300

(b) BT-LSTM,#Params:1184

Task 2: In this experiment, we use anencoder-decoder architecture to generate im-ages. We only substitute the encoder net-work to qualitatively evaluate the LSTM andBT-LSTM. The result shows that both LSTMand BT-LSTM can generate comparable im-ages.

(c) LSTM: A train traveling downtracks next to a forest.TT-LSTM: A train traveling downtrain tracks next to a forest.BT-LSTM: A train traveling througha lush green forest.

(d) LSTM: A group of people stand-ing next to each other.TT-LSTM: A group of men standingnext to each other.BT-LSTM: A group of people posingfor a photo.

(e) LSTM: A man and a dog arestanding in the snow.TT-LSTM: A man and a dog are inthe snow.BT-LSTM: A man and a dog playingwith a frisbee.

(f) LSTM: A large elephantstanding next to a baby elephant.TT-LSTM: An elephant walkingdown a dirt road near trees.BT-LSTM: A large elephant walkingdown a road with cars.

Task 3: We use the architecture described in [4], all the three models can generate propersentences but with little improvement in BT-LSTM.

(g) Truth:W′P=4096

(h) y = W · x,P=4096

(i) d=2, R=1,N=1, P=129

(j) d=2, R=4,N=1, P=528

(k) d=2, R=1,N=2, P=258

(l) d=4, R=4,N=1, P=384

Figure: The trained W for different BT-LSTM settings. The closer to (a), the better W is.

Task 4: In this experiment, we use a single LSTM cell as the model architecture to evaluateBT-LSTM against LSTM and TT-LSTM.