spectral pruning: compressing deep neural networks via spectral … · 2020. 7. 20. · that the...

Spectral Pruning: Compressing Deep Neural Networksvia Spectral Analysis and its Generalization Error

Taiji Suzuki1,2,∗ , Hiroshi Abe3,† , Tomoya Murata4 , Shingo Horiuchi5 , Kotaro Ito4 , TokumaWachi4 , So Hirai5 , Masatoshi Yukishima4 and Tomoaki Nishimura5,‡

1The University of Tokyo, Japan ,2Center for Advanced Intelligence Project, RIKEN, Japan ,

3iPride Co., Ltd., Japan ,4NTT DATA Mathematical Systems Inc., Japan ,

5NTT Data Corporation, Japan,∗[email protected], †[email protected], ‡[email protected]

AbstractCompression techniques for deep neural networkmodels are becoming very important for the ef-ficient execution of high-performance deep learn-ing systems on edge-computing devices. The con-cept of model compression is also important foranalyzing the generalization error of deep learn-ing, known as the compression-based error bound.However, there is still huge gap between a practi-cally effective compression method and its rigor-ous background of statistical learning theory. Toresolve this issue, we develop a new theoreticalframework for model compression and propose anew pruning method called spectral pruning basedon this framework. We define the “degrees offreedom” to quantify the intrinsic dimensionalityof a model by using the eigenvalue distributionof the covariance matrix across the internal nodesand show that the compression ability is essentiallycontrolled by this quantity. Moreover, we present asharp generalization error bound of the compressedmodel and characterize the bias–variance tradeoffinduced by the compression procedure. We applyour method to several datasets to justify our theo-retical analyses and show the superiority of the theproposed method.

1 IntroductionCurrently, deep learning is the most promising approachadopted by various machine learning applications such ascomputer vision, natural language processing, and audio pro-cessing. Along with the rapid development of the deep learn-ing techniques, its network structure is becoming consider-ably complicated. In addition to the model structure, themodel size is also becoming larger, which prevents the imple-mentation of deep neural network models in edge-computingdevices for applications such as smartphone services, au-tonomous vehicle driving, and drone control. To overcomethis problem, model compression techniques such as prun-ing, factorization [Denil et al., 2013; Denton et al., 2014], and

quantization [Han et al., 2015] have been extensively studiedin the literature.

Among these techniques, pruning is a typical approach thatdiscards redundant nodes, e.g., by explicit regularization suchas `1 and `2 penalization during training [Lebedev and Lem-pitsky, 2016; Wen et al., 2016; He et al., 2017]. It has beenimplemented as ThiNet [Luo et al., 2017], Net-Trim [Aghasiet al., 2017], NISP [Yu et al., 2018], and so on [Denil et al.,2013]. A similar effect can be realized by implicit random-ized regularization such as DropConnect [Wan et al., 2013],which randomly removes connections during the trainingphase. However, only few of these techniques (e.g., Net-Trim[Aghasi et al., 2017]) are supported by statistical learning the-ory. In particular, it unclear which type of quantity controlsthe compression ability. On the theoretical side, compression-based generalization analysis is a promising approach formeasuring the redundancy of a network [Arora et al., 2018;Zhou et al., 2019]. However, despite their theoretical nov-elty, the connection of these generalization error analyses topractically useful compression methods is not obvious.

In this paper, we develop a new compression based gen-eralization error bound and propose a new simple pruningmethod that is compatible with the generalization error anal-ysis. Our method aims to minimize the information loss in-duced by compression; in particular, it minimizes the redun-dancy among nodes instead of merely looking at the amountof information of each individual node. It can be executed bysimply observing the covariance matrix in the internal layersand is easy to implement. The proposed method is supportedby a comprehensive theoretical analysis. Notably, the approx-imation error induced by compression is characterized by thenotion of the statistical degrees of freedom [Mallows, 1973;Caponnetto and de Vito, 2007]. It represents the intrinsic di-mensionality of a model and is determined by the eigenvaluesof the covariance matrix between each node in each layer.Usually, we observe that the eigenvalue rapidly decreases(Fig. 1a) for several reasons such as explicit regularization(Dropout [Wager et al., 2013], weight decay [Krogh andHertz, 1992]), and implicit regularization [Hardt et al., 2016;Gunasekar et al., 2018], which means that the amount of im-portant information processed in each layer is not large. In

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

2839

particular, the rapid decay in eigenvalues leads to a low num-ber of degrees of freedom. Then, we can effectively compressa trained network into a smaller one that has fewer parame-ters than the original. Behind the theory, there is essentially aconnection to the random feature technique for kernel meth-ods [Bach, 2017]. Compression error analysis is directly con-nected to generalization error analysis. The derived bound isactually much tighter than the naive VC-theory bound on theuncompressed network [Bartlett et al., 2019] and even tighterthan recent compression-based bounds [Arora et al., 2018].Further, there is a tradeoff between the bias and the variance,where the bias is induced by the network compression and thevariance is induced by the variation in the training data. In ad-dition, we show the superiority of our method and experimen-tally verify our theory with extensive numerical experiments.Our contributions are summarized as follows:

• We give a theoretical compression bound which is com-patible with a practically useful pruning method, andpropose a new simple pruning method called spectralpruning for compressing deep neural networks.

• We characterize the model compression ability by uti-lizing the notion of the degrees of freedom, whichrepresents the intrinsic dimensionality of the model.We also give a generalization error bound when atrained network is compressed by our method and showthat the bias–variance tradeoff induced by model com-pression appears. The obtained bound is fairly tightcompared with existing compression-based bounds andmuch tighter than the naive VC-dimension bound.

2 Model Compression Problem and itsAlgorithm

Suppose that the training data Dtr = {(xi, yi)}ni=1 are ob-served, where xi ∈ Rdx is an input and yi is an output thatcould be a real number (yi ∈ R), a binary label (yi ∈ {±1}),and so on. The training data are independently and identicallydistributed. To train the appropriate relationship between xand y, we construct a deep neural network model as

f(x) = (W (L)η(·) + b(L)) ◦ · · · ◦ (W (1)x+ b(1)),

where W (`) ∈ Rm`+1×m` , b(`) ∈ Rm`+1 (` = 1, . . . , L),and η : R → R is an activation function (here, the acti-vation function is applied in an element-wise manner; fora vector x ∈ Rd, η(x) = (η(x1), . . . , η(xd))

>). Here,m` is the width of the `-th layer such that mL+1 = 1

(output) and m1 = dx (input). Let f be a trained net-work obtained from the training data Dtr = {(xi, yi)}ni=1

where its parameters are denoted by (W (`), b(`))L`=1, i.e.,f(x) = (W (L)η(·) + b(L)) ◦ · · · ◦ (W (1)x + b(1)). The in-put to the `-th layer (after activation) is denoted by φ(`)(x) =

η ◦ (W (`−1)η(·) + b(`−1)) ◦ · · · ◦ (W (1)x+ b(1)). We do notspecify how to train the network f , and the following argu-ment can be applied to any learning method such as the em-pirical risk minimizer, the Bayes estimator, or another esti-mator. We want to compress the trained network f to another

smaller one f ] having widths (m]`)L`=1 with keeping the test

accuracy as high as possible.To compress the trained network f to a smaller one f ], we

propose a simple strategy called spectral pruning. The mainidea of the method is to find the most informative subset of thenodes. The amount of information of the subset is measuredby how well the selected nodes can explain the other nodesin the layer and recover the output to the next layer. For ex-ample, if some nodes are heavily correlated with each other,then only one of them will be selected by our method. Theinformation redundancy can be computed by a covariancematrix between nodes and a simple regression problem. Wedo not need to solve a specific nonlinear optimization prob-lem unlike the methods in [Lebedev and Lempitsky, 2016;Wen et al., 2016; Aghasi et al., 2017].

2.1 Algorithm DescriptionOur method basically simultaneously minimizes the input in-formation loss and output information loss, which will be de-fined as follows.(i) Input information loss. First, we explain the input in-formation loss. Denote φ(x) = φ(`)(x) for simplicity, andlet φJ(x) = (φj(x))j∈J ∈ Rm

]` be a subvector of φ(x)

corresponding to an index set J ∈ [m`]m]

` , where [m] :={1, . . . ,m} (here, duplication of the index is allowed). Thebasic strategy is to solve the following optimization problemso that we can recover φ(x) from φJ(x) as accurately as pos-sible:

AJ := (A(`)J =) argmin

A∈Rm`×|J|E[‖φ−AφJ‖2] + ‖A‖2τ , (1)

where E[·] is the expectation with respect to the empirical dis-tribution (E[f ] = 1

n

∑ni=1 f(xi)) and ‖A‖2τ = Tr[AIτA

>]

for a regularization parameter τ ∈ R|J|+ and Iτ := diag (τ)(how to set the regularization parameter will be given inTheorem 1). The optimal solution AJ can be explicitly ex-pressed by utilizing the (noncentered) covariance matrix inthe `-th layer of the trained network f , which is defined asΣ := Σ(`) = 1

n

∑ni=1 φ(xi)φ(xi)

>, defined on the empiri-cal distribution (here, we omit the layer index ` for notationalsimplicity). Let ΣI,I′ ∈ RK×H for K,H ∈ N be the subma-trix of Σ for the index sets I ∈ [m`]

K and I ′ ∈ [m`]H such

that ΣI,I′ = (Σi,j)i∈I,j∈I′ . Let F = {1, . . . ,m`} be the fullindex set. Then, we can easily see that

AJ = ΣF,J (ΣJ,J + Iτ )−1.

Hence, the full vector φ(x) can be decoded from φJ(x)

as φ(x) ≈ AJφJ(x) = ΣF,J (ΣJ,J + Iτ )−1φJ(x). Tomeasure the approximation error, we define L

(A)τ (J) =

minA∈Rm`×|J| E[‖φ − AφJ‖2] + ‖A‖2τ . By substituting theexplicit formula AJ into the objective, this is reformulated as

L(A)τ (J) =Tr[ΣF,F − ΣF,J (ΣJ,J + Iτ )−1ΣJ,F ].

(ii) Output information loss. Next, we explain the output in-formation loss. Suppose that we aim to directly approximate


2840

the outputs Z(`)φ for a weight matrix Z(`) ∈ Rm×m` with anoutput sizem ∈ N. A typical situation is that Z(`) = W (`) sothat we approximate the output W (`)φ (the concrete settingof Z(`) will be specified in Theorem 1). Then, we considerthe objective

L(B)τ (J) :=

m∑j=1

minα∈Rm`

{E[(Z

(`)j,: φ− α

>φJ)2]+‖α>‖2τ}

= Tr{Z(`)[ΣF,F − ΣF,J (ΣJ,J + Iτ )−1ΣJ,F ]Z(`)>},

where Z(`)j,: means the j-th raw of the matrix Z(`). It can be

easily checked that the optimal solution αJ of the minimumin the definition of L(B)

τ is given as αJ = A>J Z(`)>j,: for each

j = 1, . . . ,m.(iii) Combination of the input and output informationlosses. Finally, we combine the input and output informationlosses and aim to minimize this combination. To do so, wepropose to the use of the convex combination of both criteriafor a parameter 0 ≤ θ ≤ 1 and optimize it with respect toJ under a cardinality constraint |J | = m]

` for a prespecifiedwidth m]

` of the compressed network:

minJ

L(θ)τ (J) = θL(A)

τ (J) + (1− θ)L(B)τ (J)

s.t. J ∈ [m`]m]

` . (2)

We call this method spectral pruning. There are the hyper-parameter θ and regularization parameter τ . However, wesee that it is robust against the choice of hyperparameter inexperiments (Sec. 5). Let J]` be the optimal J that mini-mizes the objective. This optimization problem is NP-hard,but an approximate solution is obtained by the greedy algo-rithm since it is reduced to maximization of a monotonic sub-modular function [Krause and Golovin, 2014]. That is, westart from J = ∅, sequentially choose an element j∗ ∈ [m`]

that maximally reduces the objective L(θ)τ , and add this ele-

ment j∗ to J (J ← J ∪ {j∗}) until |J | = m]` is satisfied.

After we chose an index J]` (` = 2, . . . , L) for eachlayer, we construct the compressed network f ] as f ](x) =(W ](L)η(·) + b](L)) ◦ · · · ◦ (W ](1)x+ b](1)), where W ](`) =

W(`)

J]`+1,F

A(`)

J]`

and b](`) = b(`)

J]`+1

.

An application to a CNN is given in the appendix of thelong version of this paper [Suzuki et al., 2020a]. The methodcan be executed in a layer-wise manner, thus it can be appliedto networks with complicated structures such as ResNet.

3 Compression accuracy Analysis andGeneralization Error Bound

In this section, we give a theoretical guarantee of our method.First, we give the approximation error induced by our prun-ing procedure in Theorem 1. Next, we evaluate the gen-eralization error of the compressed network in Theorem 2.More specifically, we introduce a quantity called the degreesof freedom [Mallows, 1973; Caponnetto and de Vito, 2007;

Suzuki, 2018; Suzuki et al., 2020b] that represents the intrin-sic dimensionality of the model and determines the approxi-mation accuracy.

For the theoretical analysis, we define a neural networkmodel with norm constraints on the parameters W (`) and b(`)(` = 1, . . . , L). Let R > 0 and Rb > 0 be the upper boundsof the parameters, and define the norm-constrained model as

F := {(W (L)η(·) + b(L)) ◦ · · · ◦ (W (1)x+ b(1)) |

maxj ‖W (`)j,: ‖ ≤ R√

m`+1, ‖b(`)‖∞ ≤ Rb√

m`+1},

where W (`)j,: means the j-th raw of the matrix W (`), ‖ · ‖ is

the Euclidean norm, and ‖ · ‖∞ is the `∞-norm1. We makethe following assumption for the activation function, which issatisfied by ReLU and leaky ReLU [Maas et al., 2013].

Assumption 1. We assume that the activation function η sat-isfies (1) scale invariance: η(ax) = aη(x) for all a > 0and x ∈ Rd and (2) 1-Lipschitz continuity: |η(x)− η(x′)| ≤‖x− x′‖ for all x, x′ ∈ Rd, where d is arbitrary.

3.1 Approximation Error AnalysisHere, we evaluate the approximation error derived by ourpruning procedure. Let (m]

`)L`=1 denote the width of each

layer of the compressed network f ]. We characterize theapproximation error between f ] and f on the basis of thedegrees of freedom with respect to the empirical L2-norm‖g‖2n := 1

n

∑ni=1 ‖g(xi)‖2, which is defined for a vector-

valued function g. Recall that the empirical covariance ma-trix in the `-th layer is denoted by Σ(`). We define the degreesof freedom as

N`(λ) := Tr[Σ(`)(Σ(`) + λI)−1] =∑m`

j=1 µ(`)j /(µ

(`)j + λ),

where (µ(`)j )m`

j=1 are the eigenvalues of Σ(`) sorted in de-creasing order. Roughly speaking, this quantity quantifiesthe number of eigenvalues above λ, and thus it is monoton-ically decreasing w.r.t. λ. The degrees of freedom play anessential role in investigating the predictive accuracy of ridgeregression [Mallows, 1973; Caponnetto and de Vito, 2007;Bach, 2017]. To characterize the output information loss, wealso define the output aware degrees of freedom with respectto a matrix Z(`) as

N ′`(λ;Z(`)) := Tr[Z(`)Σ(`)(Σ(`) + λI)−1Z(`)>].

This quantity measures the intrinsic dimensionality of theoutput from the `-th layer for a weight matrix Z(`). Ifthe covariance Σ(`) and the matrix Z(`) are near low rank,N ′`(λ;Z(`)) becomes much smaller than N`(λ). Finally, wedefine Nθ

` (λ) := θN`(λ) + (1− θ)N ′`(λ;Z(`)).To evaluate the approximation error induced by compres-

sion, we define λ` > 0 as

λ` = inf{λ ≥ 0 | m]` ≥ 5N`(λ) log(80N`(λ))}. (3)

1We are implicitly supposing R,Rb ' 1 so that‖W (`)‖F, ‖b(`)‖ = O(1).


2841

Conversely, we may determine m]` from λ` to obtain the

theorems we will mention below. Along with the degreesof freedom, we define the leverage score τ (`) ∈ Rm` asτ

(`)j := 1

N`(λ`)[Σ(`)(Σ(`) + λ`I)

−1]j,j (j ∈ [m`]). Note that∑m`

j=1 τ(`)j = 1 originates from the definition of the degrees

of freedom. The leverage score can be seen as the amount ofcontribution of node j ∈ [m`] to the degrees of freedom. Forsimplicity, we assume that τ (`)

j > 0 for all `, j (otherwise, we

just need to neglect such a node with τ (`)j = 0).

For the approximation error bound, we consider two situ-ations: (i) (Backward procedure) spectral pruning is appliedfrom ` = L to ` = 2 in order, and for pruning the `-th layer,we may utilize the selected index J]`+1 in the ` + 1-th layerand (ii) (Simultaneous procedure) spectral pruning is simulta-neously applied for all ` = 2, . . . , L. We provide a statementfor only the backward procedure. The simultaneous proce-dure also achieves a similar bound with some modifications.For the complete statement of the theorem, see the appendixof the long version [Suzuki et al., 2020a].

As for Z(`) for the output information loss, we setZ

(`)k,: =

√m`q

(`)jk

(maxj′ ‖W(`)

j′,:‖)−1W

(`)jk,:

(k = 1, . . . ,m]`+1)

where we let J]`+1 = {j1, . . . , jm]`+1}, and q

(`)j :=

(τ(`+1)j )−1∑

j′∈J]`+1

(τ(`+1)

j′ )−1(j ∈ J]`+1) and q(`)

j = 0 (otherwise). Fi-

nally, we set the regularization parameter τ as τ ← m]`λ`τ

(`).Theorem 1 (Compression rate via the degrees of freedom).If we solve the optimization problem (2) with the additionalconstraint

∑j∈J(τ

(`)j )−1 ≤ 5

3m`m]` for the index set J , then

the optimization problem is feasible, and the overall approxi-mation error of f ] is bounded by

‖f − f ]‖n ≤∑L

`=2

(RL−`+1

√∏L`′=`ζ`′,θ

)√λ` (4)

for R =√cR, where c is a universal constant, and ζ`,θ :=

Nθ` (λ`)

(θ

maxj∈[m`+1] ‖W(`)j,: ‖

2

‖(W (`))>Iq(`)

W (`)‖op+ (1− θ)m`

)−12.

The proof is given in the appendix of the long version[Suzuki et al., 2020a]. To prove the theorem, we essentiallyneed to use theories of random features in kernel methods[Bach, 2017; Suzuki, 2018]. The main message from the the-orem is that the approximation error induced by compressionis directly controlled by the degrees of freedom. Since the de-grees of freedom N`(λ) are a monotonically decreasing func-tion with respect to λ, they become large as λ decreases to0. The behavior of the eigenvalues determines how rapidlyN`(λ) increases as λ→ 0. We can see that if the eigenvaluesµ

(`)1 ≥ µ

(`)2 ≥ . . . rapidly decrease, then the approximation

error λ` can be much smaller for a given model size m]`. In

other words, f ] can be much closer to the original network fif there are only a few large eigenvalues.

2‖ · ‖op represents the operator norm of a matrix (the largestabsolute singular value).

The quantity ζ`,θ characterizes how well the approximationerror λ`′ of the lower layers `′ ≤ ` propagates to the final out-put. We can see that a tradeoff between ζ`,θ and θ appears. Bya simple evaluation, Nθ

` in the numerator of ζ`,θ is boundedby m`; thus, θ = 1 gives ζ`,θ ≤ 1. On the other hand, the

termmaxj∈[m`+1] ‖W

(`)j,: ‖

2

‖(W (`))>Iq(`)

W (`)‖optakes a value between m`+1 and 1;

thus, θ = 1 is not necessarily the best choice to maximizethe denominator. From this consideration, we can see that thevalue of θ that best minimizes ζ`,θ exists between 0 and 1,which supports our numerical result (Fig. 2b). In any situa-tion, small degrees of freedom give a small ζ`,θ, leading to asharper bound.

3.2 Generalization Error AnalysisHere, we derive the generalization error bound of the com-pressed network with respect to the population risk. We willsee that a bias–variance tradeoff induced by network com-pression appears. As usual, we train a network through thetraining error Ψ(f) := 1

n

∑ni=1 ψ(yi, f(xi)), where ψ :

R×R→ R is a loss function. Correspondingly, the expectederror is denoted by Ψ(f) := E[ψ(Y, f(X))], where the ex-pectation is taken with respect to (X,Y ) ∼ P . Our aim hereis to bound the generalization error Ψ(f ]) of the compressednetwork. Let the marginal distribution of X be PX and thatof y be PY . First, we assume the Lipschitz continuity for theloss function ψ.

Assumption 2. The loss function ψ is ρ-Lipschitz con-tinuous: |ψ(y, f) − ψ(y, f ′)| ≤ ρ|f − f ′| (∀y ∈supp(PY), ∀f, f ′ ∈ R). The support of PX is bounded:‖x‖ ≤ Dx (∀x ∈ supp(PX )).

For a technical reason, we assume the following conditionfor the spectral pruning algorithm.

Assumption 3. We assume that 0 ≤ θ ≤ 1 is appropriatelychosen so that ζ`,θ in Theorem 1 satisfies ζ`,θ ≤ 1 almostsurely, and spectral pruning is solved under the condition∑j∈J(τ

(`)j )−1 ≤ 5

3m`m]` on the index set J .

As for the choice of θ, this assumption is always satisfiedat least by the backward procedure. The condition on thelinear constraint on J is merely to ensure the leverage scoresare balanced for the chosen index. Note that the bounds inTheorem 1 can be achieved even with this condition.

If L∞-norm of networks is loosely evaluated, the gen-eralization error bound of deep learning can be unrealis-tically large because there appears L∞-norm in its evalu-ation. However, we may consider a truncated estimator[[f(x)]] := max{−M,min{M, f(x)}} for sufficiently large0 < M ≤ ∞ to moderate the L∞-norm (if M = ∞, thisdoes not affect anything). Note that the truncation proceduredoes not affect the classification error for a classification task.To bound the generalization error, we define δ1 and δ2 for(m]

2, . . . ,m]L) and (λ2, . . . , λL) satisfying relation (3) as 3

δ1 =∑L`=2

(RL−`+1

√∏L`′=` ζ`′,θ

)√λ`,

3log+(x) = max{1, log(x)}.


2842

δ22 =

1

n

L∑`=1

m]`m

]`+1 log+

(1 + 4Gmax{R,Rb}

R∞

),

where R∞ := min{RLDx +∑L`=1 R

L−`Rb,M}, G :=

LRL−1Dx +∑L`=1 R

L−` for R =√cR and Rb =

√cRb

with the constants c introduced in Theorem 1. Let Rn,t :=1n

(t+

∑L`=2 log(m`)

)for t > 0. Then, we obtain the fol-

lowing generalization error bound for the compressed net-work f ].

Theorem 2 (Generalization error bound of the compressednetwork). Suppose that Assumptions 1, 2, and 3 are satisfied.Then, the spectral pruning method presented in Theorem 1satisfies the following generalization error bound. There ex-ists a universal constant C1 > 0 such that for any t > 0, itholds that

Ψ([[f ]]]) ≤Ψ([[f ]]) + ρ{δ1 + C1R∞(δ2 + δ2

2 +√Rn,t)

}.Ψ([[f ]])+

∑L

`=2

√λ`+

√∑L`=1m

]`+1m

]`

n log+(G),

uniformly over all choices of m] = (m]1, . . . ,m

]L) with prob-

ability 1− 2e−t.

The proof is given in the appendix of the long version[Suzuki et al., 2020a]. From this theorem, the generalizationerror of f ] is upper-bounded by the training error of the origi-nal network f (which is usually small) and an additional term.By Theorem 1, δ1 represents the approximation error betweenf and f ]; hence, it can be regarded as a bias. The secondterm δ2 is the variance term induced by the sample deviation.It is noted that the variance term δ2 only depends on the sizeof the compressed network rather than the original networksize. On the other hand, a naive application of the theorem

implies Ψ([[f ]]) − Ψ([[f ]]) ≤ O(√

1n

∑L`=1m`+1m`

)for the

original network f , which coincides with the VC-dimensionbased bound [Bartlett et al., 2019] but is much larger than δ2when m]

` � m`. Therefore, the variance is significantly re-duced by model compression, resulting in a much improvedgeneralization error. Note that the relation between δ1 and δ2is a tradeoff due to the monotonicity of the degrees of free-dom. When m]

` is large, the bias δ1 becomes small owingto the monotonicity of the degrees of freedom, but the vari-ance δ2(m]) will be large. Hence, we need to tune the size(m]

`)L`=1 to obtain the best generalization error by balancing

the bias (δ1) and variance (δ2).The generalization error bound is uniformly valid over the

choice of m] (to ensure this, the term Rn,t appears). Thus,m] can be arbitrary and chosen in a data-dependent manner.This means that the bound is a posteriori, and the best choiceof m] can depend on the trained network.

4 Relation to Existing WorkA seminal work [Arora et al., 2018] showed a generaliza-tion error bound based on how the network can be com-pressed. Although the theoretical insights provided by their

Layer Original [Arora et al., 2018] Spec Prun1 1,728 1,645 1,0134 147,456 644,654 84,4996 589,824 3,457,882 270,2169 1,179,648 36,920 50,76812 2,359,296 22,735 4,58315 2,359,296 26,584 3,886

Table 1: Comparison of the intrinsic dimensionality of our degreesof freedom and existing one. They are computed for a VGG-19network trained on CIFAR-10.

analysis are quite instructive, the theory does not give a prac-tical compression method. In fact, a random projection isproposed in the analysis, but it is not intended for practicaluse. The most difference is that their analysis exploits thenear low rankness of the weight matrix W (`), while ours ex-ploits the near low rankness of the covariance matrix Σ(`).They are not directly comparable; thus, we numerically com-pare the intrinsic dimensionality of both with a VGG-19 net-work trained on CIFAR-10. Table 1 summarizes a compar-ison of the intrinsic dimensionalities. For our analysis, weused N`(λ`)N`+1(λ`+1)k2 for the intrinsic dimensionality ofthe `-th layer, where k is the kernel size4. This is the numberof parameters in the `-th layer for the width m]

` ' N`(λ`)

where λ` was set as λ` = 10−3 × Tr[Σ(`)], which is suf-ficiently small. We can see that the quantity based on ourdegrees of freedom give significantly small values in almostall layers.

The PAC-Bayes bound [Dziugaite and Roy, 2017; Zhou etal., 2019] is also a promising approach for obtaining the non-vacuous generalization error bound of a compressed network.However, these studies “assume” the existence of effectivecompression methods and do not provide any specific algo-rithm. [Suzuki, 2018; Suzuki et al., 2020b] also pointed outthe importance of the degrees of freedom for analyzing thegeneralization error of deep learning but did not give a prac-tical algorithm.

5 Numerical ExperimentsIn this section, we conduct numerical experiments to show thevalidity of our theory and the effectiveness of the proposedmethod.

5.1 Eigenvalue Distribution and CompressionAbility

We show how the rate of decrease in the eigenvalues affectsthe compression accuracy to justify our theoretical analysis.We constructed a network (namely, NN3) consisting of threehidden fully connected layers with widths (300, 1000, 300)following the settings in [Aghasi et al., 2017] and trained itwith 60,000 images in MNIST and 50,000 images in CIFAR-10. Figure 1a shows the magnitudes of the eigenvalues ofthe 3rd hidden layers of the networks trained for each dataset

4We omitted quantities related to the depth L and log term, butthe intrinsic dimensionality of [Arora et al., 2018] also omits thesefactors.


2843

50 100 150 200 250Index

10−6

10−5

10−4

10−3

10−2

10−1

100

Eige

nvalue

CIFARMNIST

(a) Eigenvalue distributions in each layer forMNIST and CIFAR-10.

0 50 100 150 200 250 300m# (# of parameters)

10−6

10−4

10−2

100

102

rel.

mea

n sq

. erro

r compression ratio vs. mean sq. error

MNISTCIFAR-10

(b) Approximation error with its s.d. versusthe width m]

Figure 1: Eigenvalue distribution and compression ability of a fullyconnected network in MNIST and CIFAR-10.

10−10 10−9 10−8 10−7 10−6 10−5 10−4

λ (relative to Σiμi)

78

79

80

81

class

. acc

urac

y (%

) λ vs. classification accuracy

ratio=0.4ratio=0.5ratio=0.6ratio=0.7ratio=0.8ratio=0.9ratio=1.0

(a) Accuracy versus λ` in CIFAR-10 for eachcompression rate.

0.0 0.2 0.4 0.6 0.8 1.0theta

0.450.500.550.600.650.700.750.80

class

. acc

urac

y

0.670.560.490.41

0.360.320.280.24

0.210.170.130.09

(b) Accuracy versus θ in CIFAR-10 for eachcompression rate.

Figure 2: Relation between accuracy and the hyper parameters λ`and θ. The best λ and θ are indicated by the star symbol.

(plotted on a semilog scale). The eigenvalues are sorted indecreasing order, and they are normalized by division by themaximum eigenvalue. We see that eigenvalues for MNISTdecrease much more rapidly than those for CIFAR-10. Thisindicates that MINST is “easier” than CIFAR-10 becausethe degrees of freedom (an intrinsic dimensionality) of thenetwork trained on MNIST are relatively smaller than thosetrained on CIFAR-10. Figure 1b presents the (relative) com-pression error ‖f − f ]‖n/‖f‖n versus the width m]

3 of thecompressed network where we compressed only the 3rd layerand λ3 was fixed to a constant 10−6×Tr[Σ(`)] and θ = 0.5. Itshows a rapid decrease in the compression error for MNISTthan CIFAR-10 (about 100 times smaller). This is becauseMNIST has faster eigenvalue decay than CIFAR-10.

Figure 2a shows the relation between the test classificationaccuracy and λ`. It is plotted for a VGG-13 network trainedon CIFAR-10. We chose the width m]

` that gave the best ac-curacy for each λ` under the constraint of the compressionrate (relative number of parameters). We see that as the com-pression rate increases, the best λ` goes down. Our theoremtells that λ` is related to the compression error through (3),that is, as the width goes up, λ` must goes down. This experi-ment supports the theoretical evaluation. Figure 2b shows therelation between the test classification accuracy and the hy-perparameter θ. We can see that the best accuracy is achievedaround θ = 0.3 for all compression rates, which indicatesthe superiority of the “combination” of input- and output-

information loss and supports our theoretical bound. For lowcompression rate, the choice of λ` and θ does not affect theresult so much, which indicates the robustness of the hyper-parameter choice.

5.2 Compression on ImageNet DatasetWe applied our method to the ImageNet (ILSVRC2012)dataset [Deng et al., 2009]. We compared our method us-ing the ResNet-50 network [He et al., 2016] (experiments forVGG-16 network [Simonyan and Zisserman, 2014] are alsoshown in the long version [Suzuki et al., 2020a]). Our methodwas compared with the following pruning methods: ThiNet[Luo et al., 2017], NISP [Yu et al., 2018], and sparse regu-larization [He et al., 2017] (which we call Sparse-reg). Asthe initial ResNet network, we used two types of networks:ResNet-50-1 and ResNet-50-2. For training ResNet-50-1, wefollowed the experimental settings in [Luo et al., 2017] and[Yu et al., 2018]. During training, images were resized asin [Luo et al., 2017]. to 256 × 256; then, a 224 × 224 ran-dom crop was fed into the network. In the inference stage, wecenter-cropped the resized images to 224 × 224. For trainingResNet-50-2, we followed the same settings as in [He et al.,2017]. In particular, images were resized such that the shorterside was 256, and a center crop of 224× 224 pixels was usedfor testing. The augmentation for fine tuning was a 224 ×224 random crop and its mirror.

We compared ThiNet and NISP for ResNet-50-1 (we callour model for this situation “Spec-ResA”) and Sparse-reg


2844

Model Top-1 Top-5 # Param. FLOPsResNet-50-1 72.89% 91.07% 25.56M 7.75GThiNet-70 72.04 % 90.67% 16.94M 4.88GThiNet-50 71.01 % 90.02% 12.38M 3.41GNISP-50-A 72.68% — 18.63M 5.63GNISP-50-B 71.99% — 14.57M 4.32GSpec-ResA 72.99% 91.56% 12.38M 3.45GResNet-50-2 75.21% 92.21% 25.56M 7.75GSparse-reg wo/ ft — 84.2% 19.78M 5.25GSparse-reg w/ ft — 90.8% 19.78M 5.25GSpec-ResB wo/ ft 66.12% 86.67% 20.69M 5.25GSpec-ResB w/ ft 74.04% 91.77% 20.69M 5.25G

Table 2: Performance comparison of our method and existing onesfor ResNet-50 on ImageNet. “ft” indicates fine tuning after com-pression.

for ResNet-50-2 (we call our model for this situation “Spec-ResB”) for fair comparison. The size of compressed networkf ] was determined to be as close to the compared networkas possible (except, for ResNet-50-2, we did not adopt the“channel sampler” proposed by [He et al., 2017] in the firstlayer of the residual block; hence, our model became slightlylarger). The accuracies are borrowed from the scores pre-sented in each paper, and thus we used different models be-cause the original papers of each model reported for each dif-ferent model. We employed the simultaneous procedure forcompression. After pruning, we carried out fine tuning over10 epochs, where the learning rate was 10−3 for the first fourepochs, 10−4 for the next four epochs, and 10−5 for the lasttwo epochs. We employed λ` = 10−6×Tr[Σ(`)] and θ = 0.5.

Table 2 summarizes the performance comparison forResNet-50. We can see that for both settings, our methodoutperforms the others for about 1% accuracy. This is an in-teresting result because ResNet-50 is already compact [Luoet al., 2017] and thus there is less room to produce better per-formance. Moreover, we remark that all layers were simul-taneously trained in our method, while other methods weretrained one layer after another. Since our method did notadopt the channel sampler proposed by [He et al., 2017], ourmodel was a bit larger. However, we could obtain better per-formance by combining it with our method.

6 Conclusion

In this paper, we proposed a simple pruning algorithm forcompressing a network and gave its approximation and gen-eralization error bounds using the degrees of freedom. Unlikethe existing compression based generalization error analysis,our analysis is compatible with a practically useful methodand further gives a tighter intrinsic dimensionality bound.The proposed algorithm is easily implemented and only re-quires linear algebraic operations. The numerical experi-ments showed that the compression ability is related to theeigenvalue distribution, and our algorithm has favorable per-formance compared to existing methods.

AcknowledgementsTS was partially supported by MEXT Kakenhi (18K19793,18H03201 and 20H00576) and JST-CREST, Japan.

References[Aghasi et al., 2017] Alireza Aghasi, Afshin Abdi, Nam

Nguyen, and Justin Romberg. Net-trim: Convex pruningof deep neural networks with performance guarantee. InAdvances in Neural Information Processing Systems 30,pages 3180–3189. 2017.

[Arora et al., 2018] Sanjeev Arora, Rong Ge, BehnamNeyshabur, and Yi Zhang. Stronger generalization boundsfor deep nets via a compression approach. In Proceedingsof the 35th International Conference on Machine Learn-ing, volume 80, pages 254–263. PMLR, 2018.

[Bach, 2017] Francis Bach. On the equivalence between ker-nel quadrature rules and random feature expansions. Jour-nal of Machine Learning Research, 18(21):1–38, 2017.

[Bartlett et al., 2019] Peter Bartlett, Nick Harvey, Christo-pher Liaw, and Abbas Mehrabian. Nearly-tight VC-dimension and pseudodimension bounds for piecewise lin-ear neural networks. Journal of Machine Learning Re-search, 20(63):1–17, 2019.

[Caponnetto and de Vito, 2007] Andrea Caponnetto andErnesto de Vito. Optimal rates for regularized least-squares algorithm. Foundations of ComputationalMathematics, 7(3):331–368, 2007.

[Deng et al., 2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scalehierarchical image database. In Computer Vision and Pat-tern Recognition, 2009, pages 248–255, 2009.

[Denil et al., 2013] Misha Denil, Babak Shakibi, LaurentDinh, and Nando de Freitas. Predicting parameters in deeplearning. In Advances in neural information processingsystems 26, pages 2148–2156, 2013.

[Denton et al., 2014] Emily L Denton, Wojciech Zaremba,Joan Bruna, Yann LeCun, and Rob Fergus. Exploitinglinear structure within convolutional networks for efficientevaluation. In Advances in Neural Information ProcessingSystems 27, pages 1269–1277. 2014.

[Dziugaite and Roy, 2017] Gintare Karolina Dziugaite andDaniel M. Roy. Computing nonvacuous generalizationbounds for deep (stochastic) neural networks with manymore parameters than training data. In Proceedings of the33rd Conference on Uncertainty in Artificial Intelligence,2017.

[Gunasekar et al., 2018] Suriya Gunasekar, Jason D Lee,Daniel Soudry, and Nati Srebro. Implicit bias of gradientdescent on linear convolutional networks. In Advances inNeural Information Processing Systems 31, pages 9482–9491, 2018.

[Han et al., 2015] Song Han, Huizi Mao, and William JDally. Deep compression: Compressing deep neural net-works with pruning, trained quantization and Huffmancoding. arXiv preprint arXiv:1510.00149, 2015.


2845

[Hardt et al., 2016] Moritz Hardt, Ben Recht, and YoramSinger. Train faster, generalize better: Stability of stochas-tic gradient descent. In Proceedings of the 33rd Interna-tional Conference on Machine Learning, volume 48, pages1225–1234. PMLR, 2016.

[He et al., 2016] Kaiming He, Xiangyu Zhang, ShaoqingRen, and Jian Sun. Deep residual learning for imagerecognition. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages 770–778,2016.

[He et al., 2017] Yihui He, Xiangyu Zhang, and Jian Sun.Channel pruning for accelerating very deep neural net-works. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, pages 1389–1397,2017.

[Krause and Golovin, 2014] Andreas Krause and DanielGolovin. Submodular function maximization. InTractability, 2014.

[Krogh and Hertz, 1992] Anders Krogh and John A Hertz. Asimple weight decay can improve generalization. In Ad-vances in Neural Information Processing Systems 5, pages950–957, 1992.

[Lebedev and Lempitsky, 2016] Vadim Lebedev and VictorLempitsky. Fast convnets using group-wise brain damage.In Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition, pages 2554–2564, 2016.

[Luo et al., 2017] Jian-Hao Luo, Jianxin Wu, and WeiyaoLin. ThiNet: a filter level pruning method for deep neuralnetwork compression. In Proceedings of the IEEE Interna-tional Conference on Computer Vision, pages 5068–5076,2017.

[Maas et al., 2013] Andrew Maas, Awni Hannun, and An-drew Ng. Rectifier nonlinearities improve neural networkacoustic models. In ICML Workshop on Deep Learningfor Audio, Speech and Language Processing, 2013.

[Mallows, 1973] Colin L Mallows. Some comments on Cp.Technometrics, 15(4):661–675, 1973.

[Simonyan and Zisserman, 2014] Karen Simonyan and An-drew Zisserman. Very deep convolutional networksfor large-scale image recognition. arXiv preprintarXiv:1409.1556, 2014.

[Suzuki et al., 2020a] Taiji Suzuki, Hiroshi Abe, TomoyaMurata, Shingo Horiuchi, Kotaro Ito, Tokuma Wachi,So Hirai, Masatoshi Yukishima, and Tomoaki Nishimura.Spectral pruning: Compressing deep neural networksvia spectral analysis and its generalization error. arXivpreprint arXiv:1808.08558v2, 2020.

[Suzuki et al., 2020b] Taiji Suzuki, Hiroshi Abe, and To-moaki Nishimura. Compression based bound for non-compressed network: Unified generalization error analy-sis of large compressible deep neural network. In Interna-tional Conference on Learning Representations, 2020.

[Suzuki, 2018] Taiji Suzuki. Fast generalization error boundof deep learning from a kernel perspective. In Proceed-ings of the 21st International Conference on Artificial In-

telligence and Statistics, volume 84, pages 1397–1406.PMLR, 2018.

[Wager et al., 2013] Stefan Wager, Sida Wang, and Percy SLiang. Dropout training as adaptive regularization. InAdvances in Neural Information Processing Systems 26,pages 351–359. 2013.

[Wan et al., 2013] Li Wan, Matthew Zeiler, Sixin Zhang,Yann Le Cun, and Rob Fergus. Regularization of neu-ral networks using dropconnect. In Proceedings of the30th International Conference on Machine Learning, vol-ume 28, pages 1058–1066. PMLR, 2013.

[Wen et al., 2016] Wei Wen, Chunpeng Wu, Yandan Wang,Yiran Chen, and Hai Li. Learning structured sparsity indeep neural networks. In Advances in Neural InformationProcessing Systems 29, pages 2074–2082. 2016.

[Yu et al., 2018] Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I Morariu, Xintong Han, Mingfei Gao,Ching-Yung Lin, and Larry S Davis. NISP: Pruning net-works using neuron importance score propagation. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 9194–9203, 2018.

[Zhou et al., 2019] Wenda Zhou, Victor Veitch, MorganeAustern, Ryan P. Adams, and Peter Orbanz. Non-vacuousgeneralization bounds at the imagenet scale: a PAC-bayesian compression approach. In International Confer-ence on Learning Representations, 2019.


2846

spectral pruning: compressing deep neural networks via spectral … · 2020. 7. 20. · that the...

Documents