abstract arxiv:1909.11321v2 [cs.cv] 29 dec 2020(pwconv), or 1 1 convolution, is a standard...

FALCON: LIGHTWEIGHT AND ACCURATE CONVOLU-TION

Jun-Gi Jang1∗, Chun Quan2*, Hyun Dong Lee3, U Kang1

1 Department of Computer Science and Engineering, Seoul National University2 CCB Fintech3 Department of Computer Science, Columbia University

ABSTRACT

How can we efficiently compress Convolutional Neural Network (CNN) while re-taining their accuracy on classification tasks? Depthwise Separable Convolution(DSConv), which replaces a standard convolution with a depthwise convolutionand a pointwise convolution, has been used for building lightweight architectures.However, previous works based on depthwise separable convolution are limitedwhen compressing a trained CNN model since 1) they are mostly heuristic ap-proaches without a precise understanding of their relations to standard convolu-tion, and 2) their accuracies do not match that of the standard convolution.In this paper, we propose FALCON, an accurate and lightweight method to com-press CNN. FALCON uses GEP, our proposed mathematical formulation to ap-proximate the standard convolution kernel, to interpret existing convolution meth-ods based on depthwise separable convolution. By exploiting the knowledge ofa trained standard model and carefully determining the order of depthwise sepa-rable convolution via GEP, FALCON achieves sufficient accuracy close to that ofthe trained standard model. Furthermore, this interpretation leads to developinga generalized version rank-k FALCON which performs k independent FALCONoperations and sums up the result. Experiments show that FALCON 1) provideshigher accuracy than existing methods based on depthwise separable convolutionand tensor decomposition, and 2) reduces the number of parameters and FLOPsof standard convolution by up to a factor of 8 while ensuring similar accuracy.We also demonstrate that rank-k FALCON further improves the accuracy whilesacrificing a bit of compression and computation reduction rates.

1 INTRODUCTION

How can we efficiently reduce the number of parameters and FLOPS of convolutional neural net-works (CNN) while maintaining their accuracy on classification tasks? Nowadays, CNN is widelyused in various areas including recommendation system (Kim et al. (2016a)), computer vision(Krizhevsky et al. (2012); Simonyan & Zisserman (2015); Szegedy et al. (2017)), natural languageprocessing (Yin et al. (2016)), etc. Due to an increase in the model capacity of CNNs, there has beena considerable interest in building lightweight CNNs. A major research direction utilizes depth-wise separable convolution (DSConv) (Sifre (2014)), which consists of a depthwise convolutionand a pointwise convolution, instead of standard convolution. The depthwise convolution appliesa separate 2D convolution kernel to each input channel, and the pointwise convolution changes thechannel size using 1×1 convolution. The depthwise convolution extracts spatial features, and thepointwise convolution merges features along the channel dimension, while a standard convolutionsimultaneously performs the two tasks. Decoupling the tasks preserves the representation power ofthe network, and reduces the number of parameters and FLOPS. Several recent methods (Howardet al. (2017); Sandler et al. (2018); Zhang et al. (2018); Ma et al. (2018)) have successfully builtlightweight architectures by 1) constructing CNN with DSConv, and 2) training the model fromscratch.

∗These authors contributed equally.

1

arX

iv:1

909.

1132

1v2

[cs

.CV

] 2

9 D

ec 2

020

0.0 0.2 0.4 0.6 0.8 1.00.0

0.5

1.0

FALCON FALCON-Branch StConv DPConv PDPConv GDGConv PDPConv-split Tucker CP0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

FALCON FALCON-Branch StConv DPConv PDPConv GDGConv PDPConv-split Tucker CP

0 10 20# of Parameters (M)

90

91

92

93

94

Acc

urac

y(%

)

Best

(a) VGG19-CIFAR10


93

94

95

96

Acc

urac

y(%

)

Best

(b) VGG19-SVHN


65

67

70

72

Acc

urac

y(%

)

Best

(c) VGG16-ImageNet


91

92

93

94

Acc

urac

y(%

)

Best

(d) ResNet34-CIFAR10


92

94

96

Acc

urac

y(%

)Best

(e) ResNet34-SVHN


60

65

70

Acc

urac

y(%

)

Best

(f) ResNet18-ImageNet

Figure 1: Accuracy w.r.t. number of parameters from different models and datasets. The three bluecircles correspond to rank-1, 2, and 3 FALCON (from left to right order), respectively. FALCONprovides the best accuracy for a given number of parameters.

However, existing methods based on depthwise separable convolution cannot effectively compresstrained CNNs. They fail to precisely formulate the relation between standard convolution and depth-wise separable convolution, leading to the following limitations. First, it is difficult to utilize theknowledge stored in the trained standard convolution kernels to fit the depthwise separable convolu-tion kernels; i.e., existing approaches based on depthwise separable convolution need to be trainedfrom scratch. Second, although existing methods may give reasonable compression and computa-tion reduction, there is a gap between the resulting accuracy and that of a standard-convolution-basedmodel. Finally, generalizing the methods based on depthwise separable convolution is difficult dueto their heuristic nature.

In this paper, we propose FALCON, an accurate and lightweight method to compress CNN by lever-aging depthwise separable convolution. We first precisely define the relationship between the stan-dard convolution and the depthwise separable convolution using GEP (Generalized ElementwiseProduct), our proposed mathematical formulation to approximate a standard convolution kernel witha depthwise convolution kernel and a pointwise convolution kernel. GEP allows FALCON to be fitby a trained standard model, leading to a better accuracy. FALCON minimizes the gap of the ac-curacy between the compressed model with FALCON and the standard-convolution-based model bycarefully aligning depthwise and pointwise convolutions via GEP. Based on the precise definition,we generalize FALCON to design rank-k FALCON, which performs k independent FALCON opera-tions and sums up the result. We also propose a variant of FALCON, FALCON-branch, by integratingFALCON with the channel split technique (Ma et al. (2018)), which splits feature channels into twobranches, independently processes each branch, and concatenates the results from the branches.As a result, FALCON and FALCON-branch provide a superior accuracy compared to other methodsbased on depthwise separable convolution and tensor decomposition, with similar compression (seeFigure 1) and the computation reduction rates (see Figure 4). Rank-k FALCON further improvesaccuracy, outperforming even the original convolution in many cases (see Table 5).

Our contributions are summarized as follows:

2

• Algorithm. We propose FALCON, a CNN compression method based on depthwise sepa-rable convolution. FALCON and its variant are carefully designed to compress CNN withlittle accuracy loss. We also give theoretical analysis of compression and computationreduction rates of FALCON.

• Generalization. We generalize FALCON to design rank-k FALCON, which further im-proves the accuracy, and often outperforms standard convolution, with a little sacrifice incompression and computation reduction rates.

• Experiments. Extensive experiments show that FALCON 1) outperforms other state-of-the-art methods based on depthwise separable convolution and tensor decompositions forcompressing CNN (see Figure 1), and 2) provides up to 8.61× compression and 8.44×computation reduction compared to the standard convolution while giving similar accura-cies.

The code of FALCON is at https://github.com/snudm-starlab/cnn-compression/tree/master/FALCON2. The rest of this paper is organized asfollows. We describe preliminaries, our proposed method FALCON, and experimental results. Afterdiscussing related works, we present conclusion.

2 PRELIMINARY

We describe preliminaries on depthwise separable convolution, and methods based on it. We use thesymbols listed in Table 1.

2.1 CONVOLUTIONAL NEURAL NETWORK

Convolution Neural Network (CNN) is a type of deep neural network used mainly for structureddata. CNN uses convolution operation in convolution layers. In the following, we discuss CNNwhen applied to typical image data with RGB channels.

Each convolution layer has three components: input feature maps, convolution kernel, and outputfeature maps. The input feature maps I ∈ RH×W×M and the output feature maps O ∈ RH′×W ′×N

are 3-dimensional tensors, and the convolution kernel K ∈ RD×D×M×N is a 4-dimensional tensor.

The convolution operation is defined as:

Oh′,w′,n =

D∑i=1

D∑j=1

M∑m=1

Ki,j,m,n · Ihi,wj ,m (1)

where the relations between height hi and width wj of input, and height h′ and width w′ of outputare as follows:

hi = (h′ − 1)s+ i− p and wj = (w′ − 1)s+ j − p (2)

where s is the stride size, and p is the padding size. The third and the fourth dimensions of theconvolution kernel K must match the number M of input channels, and the number N of outputchannels, respectively.

Convolution kernel K can be seen as N 3-dimensional filters Fn ∈ RD×D×M . Each filter Fn inkernel K performs convolution operation while sliding over all spatial locations on input featuremaps. Each filter produces one output feature map.

2.2 DEPTHWISE SEPARABLE CONVOLUTION

Depthwise Separable Convolution (DSConv) (Sifre (2014)) consists of two sub-layers: depthwiseconvolution and pointwise convolution. By decoupling tasks done by a standard convolution kernel,

3

https://github.com/snudm-starlab/cnn-compression/tree/master/FALCON2

https://github.com/snudm-starlab/cnn-compression/tree/master/FALCON2

Table 1: Symbols.Symbol Description

K convolution kernel of size RD×D×M×N

I input feature maps of size RH×W×M

O output feature maps of size RH′×W ′×N

D height and width of kernel (kernel size)M number of input feature map (input channels)N number of output feature map (output channels)H height of input feature mapW width of input feature mapH ′ height of output feature mapW ′ width of output feature map�E Generalized Elementwise Product (GEP)

each of the two decomposed kernels independently performs its own task. Note that standard con-volution performs two tasks: 1) extracts spatial features (depthwise convolution) and 2) merges fea-tures along the channel dimension (pointwise convolution). Decoupling the tasks allows DSConv topreserve the representation power while reducing the number of parameters and FLOPS at the sametime. Depthwise convolution (DWConv) kernel consists of several D × D 2-dimensional filters.The number of 2-dimension filters is the same as that of input feature maps. Each filter is appliedon the corresponding input feature map, and produces an output feature map. Pointwise convolution(PWConv), or 1 × 1 convolution, is a standard convolution with kernel size 1. DSConv is definedas:

O′h′,w′,m =

D∑i=1

D∑j=1

Di,j,m · Ihi,wj ,m (3)

Oh′,w′,n =

M∑m=1

Pm,n ·O′h′,w′,m (4)

where Di,j,m and Pm,n are depthwise convolution kernel and pointwise convolution kernel, respec-tively. O′h′,w′,m ∈ RH′×W ′×M denotes intermediate feature maps. DSConv performs DWConv oninput feature maps Ihi,wj ,m using equation 3, and generates intermediate feature maps O′h′,w′,m.Then, DSConv performs PWConv on O′h′,w′,m using equation 4, and generates output feature mapsOh′,w′,n.

3 PROPOSED METHOD

In this section, we describe FALCON, our proposed method for compressing CNN using depthwiseseparable convolution. We first present an overview of FALCON. Then, we define GeneralizedElementwise Product (GEP), a key mathematical formulation to generalize depthwise separableconvolution. We propose FALCON and explain why FALCON can replace standard convolution.Then, we propose rank-k FALCON, which extends the basic FALCON. We show that FALCON canbe easily integrated into a branch architecture for compressing CNN. We discuss relations of GEPand other modules based on depthwise separable convolution. Finally, we theoretically analyze theperformance of FALCON.

3.1 OVERVIEW

FALCON compresses a trained CNN model which consists of standard convolution. There are severalchallenges to be tackled to compress CNN using depthwise separable convolution.

1. Utilizing the knowledge of a trained model. The knowledge of a trained model improvesthe performance of a compressed model with depthwise separable convolution. How canwe utilize trained standard convolution kernels to better train depthwise separable convo-lution kernels?

4

2. Minimize accuracy gap. Existing modules based on depthwise separable convolutionsuffers from accuracy drop when we replace standard convolution with those modules.How can we design a depthwise separable convolution module to minimize the gap ofaccuracy between the trained model and the compressed model?

3. Flexibility. How can we provide a flexible method to adjust the accuracy and efficiency ofa compressed model for various application scenarios?

To address the above challenges, we propose the following three main ideas which we elaborate indetail in the following subsections.

1. Formulating the precise relation between standard convolution and depthwise separableconvolution allows us to fit depthwise separable convolution kernels to trained standardconvolution kernels.

2. Carefully determining the order of depthwise convolution and pointwise convolutionimproves the accuracy.

3. Generalizing the formulation provides an effective trade-off between accuracy and effi-ciency.

Given a trained model, we first replace the standard convolution with FALCON which performsdepthwise convolution after pointwise convolution. We fit kernels of FALCON to those of stan-dard convolution based on our precise formulation of the relation between standard convolution anddepthwise separable convolution. Then, we fine-tune the entire model using training data.

3.2 GENERALIZED ELEMENTWISE PRODUCT (GEP)

The first challenge of compression with depthwise separable convolution is how to exploit theknowledge stored in a trained CNN model. A naive approach would replace standard convolu-tion with heuristic modules based on depthwise separable convolution and randomly initialize thekernels; however, it does not leads to a good accuracy since it cannot use the knowledge of thetrained model. Our goal is to precisely figure out the relation between standard convolution anddepthwise separable convolution, such that we fit the kernels of depthwise separable convolution tothose of standard convolution.

For the purpose, we define Generalized Elementwise Product (GEP), a generalized elementwiseproduct for two operands of different shapes, to generalize the formulation of the relation betweenstandard convolution and depthwise separable convolution. Before generalizing the formulation, wegive an example of formulating the relation between standard convolution and depthwise separableconvolution. Suppose we have a 4-order standard convolution kernel K ∈ RI×J×M×N , a 3-orderdepthwise convolution kernel D ∈ RI×J×M , and a pointwise convolution kernel P ∈ RM×N . LetKi,j,m,n be (i, j,m, n)-th element of K, Di,j,m be (i, j,m)-th element of D, and Pm,n be (m,n)-th element of P. Then, it can be shown that applying depthwise convolution with D and pointwiseconvolution with P is equivalent to applying standard convolution kernel K where Ki,j,m,n =Di,j,m ·Pm,n (see Theorem 2).

To formally express this relation, we define Generalized Elementwise Product (GEP) as follows.Definition 1 (Generalized Elementwise Product). Given a p-order tensor D ∈ RI1×···×Ip−1×M anda q-order tensor P ∈ RM×J1×···×Jq−1 , the Generalized Elementwise Product D�E P of D and Pis defined to be the (p + q − 1)-order tensor K ∈RI1×...×Ip−1×M×J1×...×Jq−1 where the last axisof D and the first axis of P are the common axes such that for all elements of K,

Ki1,...,ip−1,m,j1,...,jq−1= Di1,...,ip−1,m ·Pm,j1,...,jq−1

�

Contrary to Hadamard Product which is defined only when the shapes of the two operands arethe same, Generalized Elementwise Product (GEP) deals with tensors of different shapes. Now,we define a special case of Generalized Elementwise Product (GEP) for a third-order tensor and amatrix.Definition 2 (Generalized Elementwise Product for a third order tensor and a matrix). Given athird-order tensor D ∈ RI×J×M and a matrix P ∈ RM×N , the Generalized Elementwise ProductD�E P of D and P is defined to be the tensor K ∈RI×J×M×N where the third axis of the tensorD and the first axis of the matrix P are the common axes such that for all elements of K,

Ki,j,m,n = Di,j,m ·Pm,n �

5

⋯

𝑁

𝐷 𝐷

𝑀=

Standard Convolution Convolution of FALCON

𝐷 𝐷

𝑁⊙'

𝑀

𝑁𝑇𝑇 ),+,,,-

𝓚 ∈ ℝ𝑫×𝑫×𝑴×𝑵 𝓓 ∈ ℝ𝑫×𝑫×𝑵 𝐏 ∈ ℝ𝑵×𝑴

Figure 2: Relation between standard convolution and FALCON expressed with GEP. The commonaxes correspond to the output channel-axis of standard convolution. TT(1,2,4,3) indicates tensortranspose operation to permute the third and the fourth dimensions of a tensor.

We will see that a standard convolution with kernel K is essentially equivalent to GEP of a depthwiseconvolution kernel D and a pointwise convolution kernel P. This allows to fit the kernels of depth-wise separable convolution to those of standard convolution. GEP is a core module of FALCON, andits variants rank-k FALCON and FALCON-branch. We also show that GEP is a core primitive thathelps us understand other modules based on DSConv.

3.3 FALCON: LIGHTWEIGHT AND ACCURATE CONVOLUTION

The second and the most important challenge is how to minimize the accuracy gap between a trainedmodel and a compressed model. Existing modules based on depthwise separable convolution fail toaddress the challenge since they are heuristically designed and randomly initialized.

We design FALCON, a novel lightweight and accurate convolution that replaces standard convolu-tion. FALCON is an efficient method with fewer parameters and computations than what the standardconvolution requires. In addition, FALCON has a better accuracy than competitors while having sim-ilar compression and computation reduction rates. Our main idea is to construct FALCON with thefollowing guidelines:

1. FALCON utilizes one depthwise convolution and one pointwise convolution to approximatethe trained standard convolution. As we will see in Theorem 1, applying one pointwiseconvolution and one depthwise convolution is equivalent to applying a standard convolutionwhose kernel is defined with GEP of depthwise and pointwise kernels.

2. We observe that a typical convolution has more output channels than input channels. Insuch a setting, performing depthwise convolution after pointwise convolution would al-low the depthwise convolution to extract more features from richer feature space; on theother hand, performing pointwise convolution after depthwise convolution only combinesfeatures extracted from a limited feature space.

Based on the guidelines, FALCON first applies pointwise convolution to generate an intermediatetensor O′ ∈ RH×W×N and then applies depthwise convolution.

We represent the relationship between standard convolution kernel K ∈ RD×D×M×N and FALCONusing an GEP operation of pointwise convolution kernel P ∈ RN×M and depthwise convolutionkernel D ∈ RD×D×N in Figure 2. In FALCON, the kernel K is represented as follows:

K = TT(1,2,4,3)(D�E P) s.t. Ki,j,m,n = Pn,m ·Di,j,n. (5)

where TT(1,2,4,3) indicates tensor transpose operation to permute the third and the fourth dimensionsof a tensor. Note that the common axis is the output channel axis of the standard convolution, unlikeGEP for depthwise separable convolution where the common axis is the input channel axis of thestandard convolution.

We show that applying FALCON is equivalent to applying standard convolution with a speciallyconstructed kernel.

Theorem 1. FALCON which applies pointwise convolution with kernel P ∈ RN×M and then depth-wise convolution with kernel D ∈ RD×D×N is equivalent to applying standard convolution withkernel K = TT(1,2,4,3)(D�E P). �

6

Algorithm 1: FALCON

Input: training data, and a trained model consisting of standard convolutionOutput: a trained compressed model

1: Stage 1: Replacement and initialization. Replace each standard convolution with FALCON. Then,initialize the pointwise convolution kernel D and the depthwise convolution kernel P of FALCON byfitting them to the standard convolution kernels of the trained model; i.e.,D,P = argminD′,P′ ||K− TT(1,2,4,3)(D

′ �E P′)||F . We repeatedly update D and P by minimizing||K− TT(1,2,4,3)(D

′ �E P′)||F until a maximum number of iterations is exceeded.2: Stage 2: Fine-tuning. Train all the parameters from the initialized model using training data.

Proof. Based on the Equation equation 5, we re-express Equation equation 1 by replacing the kernelKi,j,m,n with the pointwise convolution kernel Pn,m and the depthwise convolution kernel Di,j,n.

Oh′,w′,n =

M∑m=1

D∑i=1

D∑j=1

Pn,m ·Di,j,n · Ihi,wj ,m

where Ihi,wj ,m is the (hi, wj ,m)-th entry of the input I. We split the above equation into thefollowing two equations.

O′hi,wj ,n =

M∑m=1

Pn,m · Ihi,wj ,m (6)

Oh′,w′,n =

D∑i=1

D∑j=1

Di,j,n ·O′hi,wj ,n (7)

where I, O′, and O are the input, the intermediate, and the output tensors of convolution layer,respectively. Note that equation 6 and equation 7 correspond to pointwise convolution and depth-wise convolution, respectively. Therefore, the output Oh′,w′,n is equal to the output from applyingFALCON.

Algorithm 1 summarizes how FALCON works. Given a trained model with standard convolution,we replace each standard convolution with FALCON. Based on the equivalence, we fit the pointwiseconvolution and the depthwise convolution kernels D and P of FALCON to the convolution kernelsof the trained standard model; i.e., D,P = argminD′,P′ ||K− TT(1,2,4,3)(D′ �E P′)||F (stage 1in Algorithm 1). We repeatedly update D and P by minimizing ||K − TT(1,2,4,3)(D′ �E P′)||Funtil a maximum number of iterations is exceeded. D and P, obtained by minimizing ||K −TT(1,2,4,3)(D

′�E P′)||F , allow a compressed model to learn from better initialization than randominitialization. After pointwise convolution and depthwise convolution, we add batch-normalizationand ReLU activation function. Finally, we fine-tune all the parameters of the comrpessed model withFALCON using training data (stage 2 in Algorithm 1). We note that FALCON significantly reducesthe numbers of parameters and FLOPs compared to standard convolution.

3.4 RANK-k FALCON

The third challenge is to flexibly adjust the accuracy and efficiency of a compressed model for vari-ous application scenarios. In general, a compressed model provides a lower accuracy while requir-ing smaller number of parameters compared to the original model. How can we make a compressedmodel improve its accuracy while sacrificing compression?

We propose rank-k FALCON, an extended version of FALCON that further improves accuracy whilesacrificing a bit of compression and computation reduction rates. The main idea is to perform kindependent FALCON operations and sum up the result. Then, we apply batch-normalization (BN)and ReLU activation function to the summed result. Since each FALCON operation requires inde-pendent parameters for pointwise convolution and depthwise convolution, the number of parametersincreases and thus the compression and the computation reduction rates decrease; however, it im-proves accuracy by enlarging the model capacity. We formally define the rank-k FALCON with GEPas follows.

7

Input

StandardConvolution

Output

Concat

M/2 Channels

Channel Shuffle

Channel Split

M/2Channels

M/2 ChannelsBN+Relu

M Channels

M Channels

M Channels

(a) StConv-branch

Pointwise Convolution

Input

Depthwise Convolution

Output

Concat

M/2 Channels

Channel Shuffle

Channel Split

M/2 ChannelsBN+Relu

M Channels

M Channels

M Channels

M/2 Channels

M/2Channels

(b) FALCON-branch (proposed)

Figure 3: Comparison of architectures. BN denotes batch-normalization. Relu and Relu6 areactivation functions. (a) Standard convolution with branch (StConv-branch). (b) FALCON-branchwhich combines FALCON with StConv-branch.

Definition 3 (Rank-k FALCON with Generalized Elementwise Product). Rank-k FALCON expressesstandard convolution kernel K ∈ RD×D×M×N as GEP of depthwise convolution kernel D(r) ∈RD×D×N and pointwise convolution kernel P(r) ∈ RN×M for r = 1, 2, ..., k.

K =

k∑r=1

TT(1,2,4,3)(D(r) �E P(r)) s.t. Ki,j,m,n =

k∑r=1

P(r)n,m ·D

(r)i,j,n �

For each r = 1, 2, ..., k, we construct the tensor K(r) using GEP of the depthwise convolution kernelD(r) and the pointwise convolution kernel P(r). Then, we construct the standard kernel K by thesum of the tensors K(r) for all r.

3.5 FALCON-BRANCH

FALCON can be easily integrated into a CNN architecture called standard convolution with branch(StConv-branch), which consists of two branches: standard convolution on the left branch and aresidual connection on the right branch. ShuffleNetV2 (Ma et al. (2018)) improved the performanceof CNN by applying depthwise and pointwise convolutions on the left branch of StConv-branch.Since FALCON replaces standard convolution, we observe that StConv-branch can be easily com-pressed by applying FALCON on the left branch.

Figure 3(a) illustrates StConv-branch. StConv-branch first splits an input in half along the depth di-mension. A standard convolution operation is applied to one half, and no operation to the other half.The two are concatenated along the depth dimension, and an output is produced by shuffling thechannels of the concatenated tensor. As shown in Figure 3(b), FALCON-branch is constructed by re-placing the standard convolution branch (left branch) of StConv-branch with FALCON. Advantagesof FALCON-branch are that 1) the branch architecture improves the efficiency since convolutions areapplied to only half of input feature maps and that 2) FALCON further compresses the left brancheffectively. FALCON-branch is initialized by fitting kernels of FALCON to the standard convolutionkernel of the left branch of StConv-branch.

3.6 RELATION TO MODULES USING DSCONV

Recent works (Howard et al. (2017); Sandler et al. (2018); Howard et al. (2019); Zhang et al. (2018);Ma et al. (2018)) have constructed lightweight architectures by designing modules based on depth-wise separable convolution. Our proposed formulation GEP is crucial to understand modules basedon depthwise separable convolution. We discuss the relation between GEP and those modules.

8

3.6.1 DEPTHWISE-POINTWISE CONVOLUTION MODULE (DPCONV)

DPConv, which is used in MobilenetV1 (Howard et al. (2017)), performs pointwise convolutionafter depthwise convolution. We show that applying depthwise separable convolution with D andP is equivalent to applying standard convolution with a kernel K which is constructed from D andP using GEP.Theorem 2. Applying depthwise separable convolution with depthwise convolution kernel D ∈RD×D×M and pointwise convolution kernel P ∈ RM×N is equivalent to applying standard convo-lution with kernel K = D�E P.

Proof. From the definition of GEP, Ki,j,m,n = Di,j,m ·Pm,n. We replace the kernel Ki,j,m,n withthe depthwise convolution kernel Di,j,m and the pointwise convolution kernel Pm,n.

Oh′,w′,n =

D∑i=1

D∑j=1

M∑m=1

Di,j,m ·Pm,n · Ihi,wj ,m

where Ihi,wj ,m is the (hi, wj ,m)-th entry of the input. We split the above equation into the follow-ing two equations.

O′h′,w′,m =

D∑i=1

D∑j=1

Di,j,m · Ihi,wj ,m (8)

Oh′,w′,n =

M∑m=1

Pm,n ·O′h′,w′,m (9)

where O′h′,w′,m ∈ RH′×W ′×M is an intermediate tensor. Note that equation 8 and equation 9 cor-respond to applying a depthwise convolution and a pointwise convolution, respectively. Therefore,the output O′h′,w′,m is equal to the output after applying depthwise separable convolution, DPConv,used in MobilenetV1.

3.6.2 POINTWISE-DEPTHWISE-POINTWISE CONVOLUTION MODULE (PDPCONV)

PDPConv, which is used in Sandler et al. (2018); Howard et al. (2019), can be understood as DPConvpreceded with an additional pointwise convolution. This module is equivalent to a layer consistingof pointwise convolution followed by standard convolution expressed by GEP.

3.6.3 GROUP-DEPTHWISE-GROUP CONVOLUTION MODULE (GDGCONV)

GDGConv, which is used in Shufflenet (Zhang et al. (2018)), consists of four sublayers, group point-wise convolution, channel shuffle, depthwise convolution, and another group pointwise convolution,as well as a shortcut. We examine the relation between standard convolution and the last two convo-lutions of GDGConv using GEP as follows. Let g be the number of groups and Kl ∈ RD×D×M

g ×Ng

be the lth group standard convolution kernel. Note that the input tensor is split into g group ten-sors along the channel axis and each group standard convolution performs a convolution operationto its corresponding group tensor. Then, the relation of l-th group standard convolution kernelKl ∈ RD×D×M

g ×Ng with regard to l-th depthwise convolution kernel Dl ∈ RD×D×M

g and l-thpointwise group convolution kernel Pl ∈ R

Mg ×

Ng is

Kl = Dl �E Pl s.t. Kli,j,mg,ng

= Dli,j,mg

·Plmg,ng

where mg = 1, 2, ..., Mg and ng = 1, 2, ..., Ng . Each group standard convolution is equivalent tothe combination of a depthwise convolution and a pointwise convolution, and thus easily expressedwith GEP.

3.6.4 PDPCONV WITH SPLIT (PDPCONV-SPLIT)

PDPConv-split module is similar to PDPConv except that 1) the input channels are split into twobranches, 2) the left branch undergoes the same convolutions as those of PDPConv which are ex-pressed as GEP, and 3) the right branch is an identity connection. ShufflenetV2 (Ma et al. (2018)) isconstructed by adopting this module.

9

Table 2: Numbers of parameters and FLOPs of FALCON and competitors. Symbols are described inTable 1.

Convolution # of parameters # of FLOPs

FALCON MN +D2N HWMN +H ′W ′D2NFALCON-branch 1

4M2 + 1

2D2M 1

4HWM2 + 12HWD2M

DPConv MN +D2M HWD2M +H ′W ′MNPDPConv tM2 + tD2M + tMN tHWM2 + tH ′W ′D2M + tH ′W ′MN

GDGConv 14 (

MNg +D2N + N2

g ) 14 (

HWMNg +H ′W ′D2N + H′W ′N2

g )

PDPConv-split 12 (M

2 +D2M) 12HW (M2 +D2M)

StConv-branch 14D

2M2 14HWD2M2

Standard convolution D2MN H ′W ′D2MN

Since we can easily replace standard convolution with the modules mentioned above, we compareFALCON with such modules in the experiment section. Based on Theorem 2, we initialize kernels ofDPConv similar to kernels of FALCON. Kernels D and P of DPConv are fitted from the pretrainedstandard convolution kernel K; i.e., D,P = argminD′,P′ ||K − D′ �E P′||F . In contrast toDPConv, compressed models with PDPConv, GDGConv, and PDPConv-split fail to use GEP toinitialize kernels and are trained from scratch, since they are not equivalent to standard convolution.Therefore, the models with the three convolution modules are trained from scratch.

3.7 ANALYSIS

We evaluate the compression and the computation reduction of FALCON and rank-k FALCON. Allthe analysis is based on one convolution layer. The comparison of the numbers of parameters andFLOPs of FALCON and other competitors is in Table 2.

3.7.1 FALCON

We analyze the compression and the computation reduction rates of FALCON in Theorems 3 and 4.

Theorem 3. Compression Rate (CR) of FALCON is given by

CR =# of parameters in standard convolution

# of parameters in FALCON=

D2MN

MN +D2N

where D2 is the size of standard kernel, M is the number of input channels, and N is the numberof output channels. �

Proof. Standard convolution kernel has D2MN parameters. FALCON includes pointwise convolu-tion and depthwise convolution which requires MN and D2N parameters, respectively. Thus, thecompression rate of FALCON is CR = D2MN

MN+D2N .

Theorem 4. Computation Reduction Rate (CRR) of FALCON is described as:

CRR =# of FLOPs in standard convolution

# of FLOPs in FALCON=

H ′W ′MD2N

HWMN +H ′W ′D2N

where H ′ and W ′ are the height and the width of output, respectively, and H and W are the heightand the width of input, respectively.

Proof. The standard convolution operation requires H ′W ′D2MN FLOPs (Molchanov et al.(2017)). FALCON includes pointwise convolution and depthwise convolution. Pointwise convo-lution has kernel size D = 1 with stride s = 1 and no padding, so the intermediate tensor O′ hasthe same height and width as those of the input feature maps. Thus, pointwise convolution needsHWMN FLOPs. Depthwise convolution has the number of input channel M = 1, so it needsH ′W ′D2N FLOPs. The total FLOPs of FALCON is HWMN +H ′W ′D2N , thus the computation

reduction rate of FALCON is CRR =H ′W ′D2MN

HWMN +H ′W ′D2N.

10

Table 3: Datasets.dataset # of classes input size # of train # of test

CIFAR-101 10 32× 32× 3 10× 6000 10000CIFAR-1001 100 32× 32× 3 100× 600 10000SVHN2 10 32× 32 73257 26032ImageNet3 1000 224× 224× 3 1.2× 106 150000

3.7.2 RANK-K FALCON

We analyze the compression and computation reduction rates of rank-k FALCON in Theorem 5.Theorem 5. Compression Rate (CRk) and Computation Reduction Rate (CRRk) of rank-k FAL-CON are described as:

CRk =CR

kCRRk =

CRR

k�

Proof. The numbers of parameters and FLOPs increase by k times since rank-k FALCON duplicatesFALCON for k times. Thus, the compression rate and the computation reduction rate are calculated

as CRk =CR

kand CRRk =

CRR

k.

4 EXPERIMENTS

We validate the performance of FALCON through extensive experiments. We aim to answer thefollowing questions:

• Q1. Performance. Which method gives the best accuracy for a given compression andcomputation reduction rate?

• Q2. Ablation Study. How do the initialization and the alignment of depthwise separableconvolution affect the performance of FALCON?

• Q3. Rank-k FALCON. How do the accuracy, the number of parameters, and the numberof FLOPs change as the rank k increases in FALCON?

• Q4. Comparison with Tensor decomposition. How effectively does FALCON compressCNN models compared to CP and Tucker decomposition methods?

4.1 EXPERIMENTAL SETUP

We construct all models using Pytorch framework. All the models are trained and tested on amachine with GeForce GTX 1080 Ti GPU. We perform image classification task on four famousdatasets in Table 3: CIFAR10, CIFAR100, SVHN, and ImageNet.

4.1.1 PREDICTION MODELS

For CIFAR10, CIFAR100, and SVHN datasets, we choose VGG19 and ResNet34 to evaluate theperformance. We shrink the sizes of both models since the sizes of these three datasets are smallerthan that of Imagenet. On both models, we replace all standard convolution layers (except forthe first convolution layer) with those of FALCON or other competitors in order to compress andaccelerate the model. For ImageNet, we choose VGG16 BN (VGG16 with batch normalizationafter every convolution layer) and ResNet18. We use the pretrained model from Pytorch model zooas the baseline model with standard convolution.

In VGG19, we reduce the number of fully connected layers and the number of features in fullyconnected layers: three large fully connected layers (4096-4096-1000) in VGG19 are replaced with

1https://www.cs.toronto.edu/˜kriz/cifar.html2http://ufldl.stanford.edu/housenumbers/3http://www.image-net.org

11

https://www.cs.toronto.edu/~kriz/cifar.html

http://ufldl.stanford.edu/housenumbers/

http://www.image-net.org

two small fully connected layers (512-10 or 512-100). In ResNet34, we remove the first 7 × 7convolution layer and max-pooling layer since the input size (32 × 32) of these datasets is smallerthan the input size (224× 224) of ImageNet.

4.1.2 COMPETITORS AND EVALUATION

We compare FALCON and FALCON-branch with four convolution units consisting of depthwise con-volution and pointwise convolution: DPConv, PDPConv, GDGConv, PDPConv-split. To evaluatethe effectiveness of fitting depthwise and pointwise convolution kernels to standard convolutionkernel, we build GEP-in which is DPConv where kernels D and P are fitted from the pretrainedstandard convolution kernel K; i.e., D,P = argminD′,P′ ||K −D′ �E P′||F . We also compareFALCON with two tensor decomposition methods, CP and Tucker. We take each standard convo-lution layer (StConv) as a unit, and replace StConv with those from FALCON or other competitors.We evaluate the classification accuracy, the number of parameters in the model, and the number ofFLOPs needed for forwarding one image.

4.1.3 OPTIMIZATION

For kernel fitting of FALCON, we use the AdamW optimizer. We set the learning rate to 0.05 forCIFAR10, and 0.001 for the remaining datasets. We choose the maximum number of iterationsbetween 600 to 1, 000. For training and fine-tuning of all methods, we use the SGD optimizer. Thebest learning rate is chosen among 0.5, 0.1, and 0.01. Weight decay is set to 0.0001 and momentumis set to 0.9. For CIFAR10, CIFAR100, and SVHN, the models are trained and fine-tuned for 350epochs. At epochs 150 and 250, the learning rate is decreased by a factor of 10. For ImageNet, themodels are trained and fine-tuned for 90 epochs; the learning rate is decreased by a factor of 10 atepochs 30 and 60.

4.1.4 COMPRESSION METHODS

The hyperparameter settings for each compression method is as follows.

FALCON. When replacing StConv with FALCON, we use the same setting as that of StConv. That is,if there are BN and ReLU after StConv, we add BN and ReLU at the end of FALCON; if there is onlyReLU after StConv, we add only ReLU at the end of FALCON. This is because FALCON is initializedby approximating the StConv kernel using GEP; using the same setting for BN and ReLU as StConvis more effective for FALCON to approximate the StConv. We fit the pointwise convolution kerneland the depthwise convolution kernel of FALCON to the pretrained standard convolution kernel usingGEP, similarly to GEP-in. Rank-k FALCON uses the same fitting method.

DPConv. DPConv has the most similar architecture as FALCON among competitors, and thus DP-Conv has nearly the same number of parameters as that of FALCON. As in FALCON, the existenceof BN and ReLU at the end of DPConv depends on that of StConv.

PDPConv. In PDPConv, we adjust the numbers of parameters and FLOPs by changing the expan-sion ratio t, which is represented as ‘PDPConv-t’. We choose t = 0.5 as the baseline PDPConv tocompare with FALCON, since two pointwise convolutions impose lots of parameters and FLOPs toPDPConv.

GDGConv. In GDGConv, we adjust the numbers of parameters and FLOPs by changing the widthmultiplier α (Howard et al. (2017)) and the number g of groups, which is represented as ‘GDGConvα×(g=g)’. Note that the width multiplier is used to adjust the number M of input channels andthe number N of output channels of a convolution layer; if the width multiplier is α, the numbersof input and output channels become αM and αN , respectively. We also observe that GDGConvdoes not cooperate well with ResNet: ResNet34 with GDGConv does not converge. We suspect thatresidual block and GDGConv may conflict with each other because of redundant residual connec-tions: a gradient may not find the right path towards previous layers. For this reason, we delete theshortcuts of all residual blocks in ResNet34 when using GDGConv.

PDPConv-split. In PDPConv-split, we also adjust the number of parameters and FLOPs by chang-ing the width multiplier α, which is represented as ’PDPConv-split α×’. Other operations ofPDPConv-split stay the same as in Ma et al. (2018).

12

0.0 0.2 0.4 0.6 0.8 1.00.0

0.5

1.0

FALCON FALCON-Branch StConv DPConv PDPConv GDGConv PDPConv-split Tucker CP0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

FALCON FALCON-Branch StConv DPConv PDPConv GDGConv PDPConv-split Tucker CP

25 27 29

MFLOPs

90

91

92

93

94

Acc

urac

y(%

)

Best

(a) VGG19-CIFAR10

25 27 29

MFLOPs

93

94

95

96

Acc

urac

y(%

)

Best

(b) VGG19-SVHN

210 212 214

MFLOPs

65

67

70

72

Acc

urac

y(%

)

Best

(c) VGG16-ImageNet

25 27 29

MFLOPs

91

92

93

94

Acc

urac

y(%

)

Best

(d) ResNet34-CIFAR10

25 27 29

MFLOPs

92

94

96

Acc

urac

y(%

)

Best

(e) ResNet34-SVHN

28 210

MFLOPs

60

65

70

Acc

urac

y(%

)

Best

(f) ResNet18-ImageNet

Figure 4: Accuracy w.r.t. number of parameters on different models and datasets. The three bluecircles correspond to rank-1, 2, and 3 FALCON (from left to right order), respectively. FALCONprovides the best accuracy for a given number of parameters.

CP and Tucker. The ranks for CP and Tucker decomposition methods are determined from Varia-tional Bayesian Matrix Factorization (Nakajima et al. (2012)).

4.2 PERFORMANCE (Q1)

We evaluate the performance of FALCON in terms of accuracy, compression rate, and the amount ofcomputation.

Accuracy vs. Compression. We evaluate the accuracy and the compression rate of FALCON andcompetitors. Table 4 shows the results on four image datasets. Note that FALCON or FALCON-branch provides the highest accuracy in most cases (5 out of 6) while using a similar or smallernumber of parameters than competitors. Specifically, FALCON and FALCON-branch achieve up to8.61× compression rates with minimizing accuracy drop compared to that of the standard convolu-tion (StConv). Figure 1 shows the tradeoff between accuracy and the number of parameters. Notethat FALCON and FALCON-branch show the best tradeoff (closest to the “best” point) between ac-curacy and compression rate, giving the highest accuracy with similar compression rates.

Accuracy vs. Computation. We evaluate the accuracy and the amount of computation of FALCONand competitors. We use the number of multiply-add floating point operations (FLOPs) needed forforwarding one image to a model as the metric of computation. Table 4 also shows the accuraciesand the number of FLOPs of methods on four image datasets. Note that FALCON or FALCON-branch provides the highest accuracy in most cases while using similar FLOPs as competitors do.Compared to StConv, FALCON and FALCON-branch achieve up to 8.44× FLOPs reduction acrossdifferent models on different datasets. Figure 4 shows the tradeoff between accuracy and the numberof FLOPs. Note that FALCON and FALCON-branch show the best tradeoff (closest to the ”best”point) between accuracy and computation, giving the highest accuracy with a similar number ofFLOPs.

13

Table 4: FALCON and FALCON-branch give the best accuracy for similar number of parameters andFLOPs. Bold font indicates the best accuracy among competing compression methods.

ConvType VGG19-CIFAR100 ConvType ResNet34-CIFAR100Accuracy # param # FLOPs Accuracy # param # FLOPs

StConv 72.10% 20.35M 398.75M StConv 73.94% 21.34M 292.57M

FALCON 72.01% 2.61M 47.28M FALCON 71.83% 2.67M 46.38MFALCON without fitting 70.80% 2.61M 47.28M FALCON without fitting 71.80% 2.67M 46.38MFALCON-branch 1.75× 73.05% 2.68M 54.21M FALCON-branch 1.625× 70.26% 2.54M 58.93MGEP-in 68.29% 2.61M 46.46M GEP-in 66.88% 2.67M 38.45MDPConv 68.18% 2.61M 48.07M DPConv 66.30% 2.67M 38.45MPDPConv-0.5 72.50% 2.71M 51.85M PDPConv-0.5 65.00% 2.59M 39.83MGDGConv 2×(g=2) 72.73% 2.79M 46.71M GDGConv 2×(g=2) 68.97% 3.17M 49.88MPDPConv-split 1.375× 72.32% 2.91M 58.29M PDPConv-split 1.375× 67.38% 3.04M 51.36M

Tucker 57.32% 4.10M 84.89M Tucker 58.10% 2.66M 54.03MCP 61.08% 2.63M 68.71M CP 65.82% 2.66M 42.58M

ConvType VGG19-SVHN ConvType ResNet34-SVHNAccuracy # param # FLOPs Accuracy # param # FLOPs

StConv 95.28% 20.30M 398.70M StConv 94.83% 21.29M 292.52M

FALCON 95.45% 2.56M 47.23M FALCON 94.83% 2.63M 46.33MFALCON without fitting 94.51% 2.56M 47.23M FALCON without fitting 94.75% 2.63M 46.33MFALCON-branch 1.75× 94.56% 2.64M 54.17M FALCON-branch 1.625× 94.98% 2.47M 58.86MGEP-in 94.99% 2.56M 46.41M GEP-in 94.06% 2.62M 38.41MDPConv 94.37% 2.56M 48.02M DPConv 94.03% 2.62M 38.41MPDPConv-0.5 93.28% 2.67M 51.80M PDPConv-0.5 93.16% 2.55M 39.78MGDGConv 2×(g=2) 93.15% 2.74M 46.66M GDGConv 2×(g=2) 93.68% 3.08M 49.78MPDPConv-split 1.375× 94.36% 2.86M 58.24M PDPConv-split 1.375× 94.35% 2.98M 51.30M

Tucker 95.30% 2.35M 69.28M Tucker 91.22% 2.88M 125.23MCP 93.13% 2.54M 68.15M CP 94.09% 2.63M 81.34M

ConvType VGG16 BN-ImageNet ConvType ResNet18-ImageNetTop-1 Top-5 # param # FLOPs Top-1 Top-5 # param # FLOPs

StConv 73.37% 91.50% 138.37M 15484.82M StConv 69.76% 89.08% 11.69M 1814.07M

FALCON 71.93% 90.57% 125.33M 1950.75M FALCON 66.64% 87.09% 1.97M 395.40MFALCON without fitting 71.65% 90.47% 125.33M 1950.75M FALCON without fitting 66.19% 86.86% 1.97M 395.40MFALCON-branch 1.5× 68.24% 88.51% 125.30M 1898.39M FALCON-branch 1.375× 64.01% 85.16% 1.91M 434.44MGEP-in 70.98% 90.19% 125.33M 1910.56M GEP-in 66.21% 86.93% 1.96M 336.81MDPConv 70.34% 89.71% 125.33M 1989.49M DPConv 65.30% 86.30% 1.96M 336.81MPDPConv-0.5 67.80% 87.90% 125.44M 2180.49M PDPConv-0.5 58.99% 81.55% 1.90M 340.06MGDGConv 2×(g=2) 70.40% 89.84% 125.77M 2014.73M GDGConv 2×(g=2) 65.73% 86.75% 2.22M 438.89MPDPConv-split 1.25× 71.34% 90.34% 125.57M 2180.65M PDPConv-split 1.1875× 66.29% 87.32% 2.01M 376.15M

Tucker 70.23% 89.61% 126.30M 4214.26M Tucker 64.10% 85.58% 2.01M 334.73MCP 64.40% 85.58% 124.79M 1796.79M CP 64.10% 85.51% 1.97M 359.80M

4.3 ABLATION STUDY (Q2)

We perform an ablation study on two components: 1) fitting the depthwise and the pointwise kernelsto the trained standard convolution kernel, and 2) the order of depthwise and pointwise convolutions.From Table 4, we have two observations. First, fitting kernels using GEP gives better accuracy:FALCON and GEP-in give better accuracy than FALCON without fitting and DPConv, respectively.Second, with a similar number of parameters, FALCON and FALCON without fitting always result inbetter accuracy than GEP-in and DPConv, respectively; it shows that applying pointwise convolutionfirst and then depthwise convolution is better than the other way. These observations show that ourmain ideas of fitting kernels using GEP and careful ordering of convolutions are key factors for thesuperior performance.

4.4 RANK-K FALCON (Q3)

We evaluate the performance of rank-k FALCON by increasing k and monitoring the changes inthe numbers of parameters and FLOPs. In Table 5, we observe three trends as k increases: 1) the

14

Table 5: Rank-k FALCON further improves accuracy while sacrificing a bit of compression andcomputation reduction.

ConvType VGG19-CIFAR100 ResNet34-CIFAR100Accuracy # param # FLOPs Accuracy # param # FLOPs

StConv 72.10% 20.35M 398.75M 73.94% 21.34M 292.57M

FALCON-k1 72.01% 2.61M (7.80×) 47.28M (8.43×) 71.83% 2.67M (7.99×) 46.38M (6.31×)FALCON-k2 73.71% 4.88M (4.17×) 94.24M (4.23×) 72.94% 5.08M (4.20×) 88.25M (3.32×)FALCON-k3 73.41% 7.16M (2.84×) 141.21M (2.82×) 72.85% 7.49M (2.85×) 130.13M (2.25×)

ConvType VGG19-SVHN ResNet34-SVHNAccuracy # param # FLOPs Accuracy # param # FLOPs

StConv 95.28% 20.30M 398.70M 94.83% 21.29M 292.52M

FALCON-k1 95.45% 2.56M (7.93×) 47.23M (8.44×) 94.83% 2.63M (8.10×) 46.33M (6.31×)FALCON-k2 95.72% 4.84M (4.19×) 94.20M (4.23×) 95.34% 5.04M (4.22×) 88.21M (3.32×)FALCON-k3 95.54% 7.11M (2.86×) 141.16M (2.82×) 94.83% 7.45M (2.86×) 130.08M (2.25×)

ConvType VGG16 BN-ImageNet ResNet18-ImageNetTop-1 Top-5 # param # FLOPs Top-1 Top-5 # param # FLOPs

StConv 73.37% 91.50% 138.37M 15484.82M 69.76% 89.08% 11.69M 1814.07M

FALCON-k1 71.93% 90.57% 125.33M (1.10×) 1950.75M (7.94×) 66.64% 87.09% 1.97M (5.93×) 395.40M (4.59×)FALCON-k2 72.88% 91.19% 127.00M (1.09×) 3777.86M (4.10×) 68.03% 88.26% 3.22M (3.63×) 653.00M (2.78×)FALCON-k3 73.24% 91.54% 128.68M (1.08×) 5604.97M (2.76×) 69.07% 88.56% 4.48M (2.61×) 910.61M (1.99×)

accuracy becomes higher, 2) the number of parameters increases, and 3) the number of floatingpoint operations (FLOPs) increases. Although the k that provides the best tradeoff of rank andcompression/computation reduction varies, rank-k FALCON improves the accuracy of FALCON inall cases. Especially, we note that rank-k FALCON often results in even higher accuracy than thestandard convolution, while using smaller number of parameters and FLOPs. For example, rank-3 FALCON applied to VGG19 on CIFAR100 dataset shows 1.31 percentage point higher accuracycompared to the standard convolution, with 2.8× smaller number of parameters and 2.8× smallernumber of FLOPs. Thus, rank-k FALCON is a versatile method to further improve the accuracy ofFALCON while sacrificing a bit of compression and computation reduction.

4.5 FALCON VS. TENSOR DECOMPOSITION (Q4)

Since FALCON can be thought of as a variant of tensor decomposition, we compare FALCON withCP and Tucker decomposition methods. We compress standard convolution kernels using CP andTucker, and set the number of parameters and FLOPs to be approximately equal to those of FALCON.We have two main observations from Figure 1, Figure 4, and Table 4. First, FALCON outperformsthe two tensor decomposition methods for all datasets. Second, FALCON has a better model capacitythan those of CP and Tucker, for a given number of parameters and FLOPs. CP and Tucker givepoor accuracy compared to the standard convolution for CIFAR100, while the accuracy gap becomessmaller for CIFAR10 and SVHN. Note that CIFAR100 is a more complex dataset with more classescompared to those of CIFAR10 and SVHN, while the numbers of training instances are similar; CPand Tucker lack model capacity to handle CIFAR100. On the other hand, FALCON consistentlyshows the best accuracy due to its rich model capacity for handling convolutions.

5 RELATED WORK

Over the past several years, a lot of studies focused on compressing and accelerating deep neuralnetworks to reduce model size, running time, and energy consumption.

Deep neural networks are often over-parameterized. Weight-sharing (Ullrich et al. (2017); Chenet al. (2015); Han et al. (2016); Choi et al. (2017); Agustsson et al. (2017); Chen et al. (2016); Cho& Kang (2020)) is a common compression method which stores only assignments and centroidsof weights. While using the model, weights are loaded according to assignments and centroids.Pruning (Han et al. (2015); Li et al. (2017); Guo et al. (2020); Frankle & Carbin (2019); Zhuang

15

et al. (2018)) aims at removing useless weights or setting them to zero. Although weight-sharingand pruning can significantly reduce the model size, they are not efficient in reducing the amount ofcomputation. Quantizing (Courbariaux et al. (2015); Hubara et al. (2016); Hou et al. (2017); Zhuet al. (2017); Wang et al. (2018)) the model into binary or ternary weights reduces model size andcomputation simultaneously: replacing arithmetic operations with bit-wise operations remarkablyaccelerates the model. Layer-wise approaches are also employed to efficiently compress models. Atypical example of such approaches is low-rank approximation (Kim et al. (2016b); Lebedev et al.(2015); Novikov et al. (2015); Yoo et al. (2019)); it treats the weights as a tensor and uses generaltensor approximation methods like Tucker and CP to compress the tensor. To reduce computation,approximation methods should be carefully chosen, since some of approximation methods mayincrease computation of the model. A reinforcement learning based framework is also proposed inLee et al. (2020).

A recent major trend is to design a brand new architecture that is small and efficient using depth-wise separable convolution. Mobilenet (Howard et al. (2017)), MobilenetV2 (Sandler et al. (2018)),Shufflenet (Zhang et al. (2018)), and ShufflenetV2 (Ma et al. (2018)) are the most representativeapproaches, and many works (Chollet (2017); Howard et al. (2019); Wu et al. (2019); Wan et al.(2020)) including them use depthwise and pointwise convolutions as building blocks for designingconvolution layers. However, they do not utilize the knowledge of a trained CNN model, and pro-vide heuristic methods without precise understanding of the relation between depthwise separableconvolution and standard convolution. Our proposed FALCON gives a thorough interpretation ofdepthwise and pointwise convolutions, and exploits them into model compression, giving the bestaccuracies with regard to compression and computation.

6 CONCLUSION

We propose FALCON, an accurate and lightweight convolution method for CNN. By interpretingexisting convolution methods based on depthwise separable convolution using GEP operation, FAL-CON and its general version rank-k FALCON provide accurate and efficient compression on CNN.We also propose FALCON-branch, a variant of FALCON integrated into a branch architecture of CNNfor model compression. Extensive experiments show that FALCON and its variants give the best ac-curacy for a given number of parameter or computation, outperforming other compression modelsbased on depthwise separable convolution and tensor decompositions. Compared to the standardconvolution, FALCON and FALCON-branch give up to 8.61× compression and 8.44× computationreduction while giving similar accuracy. We also show that rank-k FALCON provides even betteraccuracy often outperforming the standard convolution, while using smaller numbers of parametersand computations. Future works include extending FALCON for other deep learning models beyondCNN.

REFERENCES

Eirikur Agustsson, Fabian Mentzer, Michael Tschannen, Lukas Cavigelli, Radu Timofte, LucaBenini, and Luc Van Gool. Soft-to-hard vector quantization for end-to-end learning compressiblerepresentations. In NeurIPS, pp. 1141–1151, 2017.

Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen. Compressingneural networks with the hashing trick. In ICML, pp. 2285–2294, 2015.

Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen. Compressingconvolutional neural networks in the frequency domain. In SIGKDD, pp. 1475–1484, 2016.

Ikhyun Cho and U Kang. Pea-kd: Parameter-efficient and accurate knowledge distillation on bert,2020.

Yoojin Choi, Mostafa El-Khamy, and Jungwon Lee. Towards the limit of network quantization. InICLR, 2017.

Francois Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, pp.1800–1807, 2017.

16

Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neuralnetworks with binary weights during propagations. In NeurIPS, pp. 3123–3131, 2015.

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neuralnetworks. In ICLR. OpenReview.net, 2019.

Jinyang Guo, Wanli Ouyang, and Dong Xu. Multi-dimensional pruning: A unified framework formodel compression. In CVPR, pp. 1505–1514. IEEE, 2020.

Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections forefficient neural network. In NeurIPS, pp. 1135–1143, 2015.

Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networkwith pruning, trained quantization and huffman coding. In ICLR, 2016.

Lu Hou, Quanming Yao, and James T. Kwok. Loss-aware binarization of deep networks. In ICLR,2017.

Andrew Howard, Ruoming Pang, Hartwig Adam, Quoc V. Le, Mark Sandler, Bo Chen, WeijunWang, Liang-Chieh Chen, Mingxing Tan, Grace Chu, Vijay Vasudevan, and Yukun Zhu. Search-ing for mobilenetv3. In ICCV, pp. 1314–1324. IEEE, 2019.

Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks formobile vision applications. CoRR, abs/1704.04861, 2017.

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarizedneural networks. In NeurIPS, pp. 4107–4115, 2016.

Dong Hyun Kim, Chanyoung Park, Jinoh Oh, Sungyoung Lee, and Hwanjo Yu. Convolutionalmatrix factorization for document context-aware recommendation. In Recsys, 2016, pp. 233–240,2016a.

Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Com-pression of deep convolutional neural networks for fast and low power mobile applications. InICLR, 2016b.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convo-lutional neural networks. In NeurIPS, pp. 1106–1114, 2012.

Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan V. Oseledets, and Victor S. Lempitsky.Speeding-up convolutional neural networks using fine-tuned cp-decomposition. In ICLR, 2015.

Hyun Dong Lee, Seongmin Lee, and U Kang. Auber: Automated bert regularization, 2020.

Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters forefficient convnets. In ICLR, 2017.

Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet V2: practical guidelinesfor efficient CNN architecture design. In ECCV, pp. 122–138, 2018.

Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutionalneural networks for resource efficient inference. In ICLR, 2017.

Shinichi Nakajima, Ryota Tomioka, Masashi Sugiyama, and S. Derin Babacan. Perfect dimension-ality recovery by variational bayesian PCA. In NeurIPS, pp. 980–988, 2012.

Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry P. Vetrov. Tensorizing neuralnetworks. In NeurIPS, pp. 442–450, 2015.

Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, pp. 4510–4520, 2018.

Laurent Sifre. Rigid-motion scattering for image classification. PhD thesis, Ecole Polytechnique,2014.

17

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale imagerecognition. In ICLR, 2015.

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. Inception-v4,inception-resnet and the impact of residual connections on learning. In AAAI, 2017, pp. 4278–4284, 2017.

Karen Ullrich, Edward Meeds, and Max Welling. Soft weight-sharing for neural network compres-sion. In ICLR, 2017.

Alvin Wan, Xiaoliang Dai, Peizhao Zhang, Zijian He, Yuandong Tian, Saining Xie, Bichen Wu,Matthew Yu, Tao Xu, Kan Chen, Peter Vajda, and Joseph E. Gonzalez. Fbnetv2: Differentiableneural architecture search for spatial and channel dimensions. In CVPR, pp. 12962–12971. IEEE,2020.

Yunhe Wang, Chang Xu, Jiayan Qiu, Chao Xu, and Dacheng Tao. Towards evolutionary compres-sion. In SIGKDD, pp. 2476–2485, 2018.

Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian,Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design viadifferentiable neural architecture search. In CVPR, pp. 10734–10742. Computer Vision Founda-tion / IEEE, 2019.

Wenpeng Yin, Hinrich Schutze, Bing Xiang, and Bowen Zhou. ABCNN: attention-based convolu-tional neural network for modeling sentence pairs. TACL, 4:259–272, 2016.

Jaemin Yoo, Minyong Cho, Taebum Kim, and U Kang. Knowledge extraction with no observabledata. In Advances in Neural Information Processing Systems 32: Annual Conference on Neu-ral Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC,Canada, pp. 2701–2710, 2019.

Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficientconvolutional neural network for mobile devices. In CVPR, pp. 6848–6856, 2018.

Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally. Trained ternary quantization. In ICLR,2017.

Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, JunzhouHuang, and Jin-Hui Zhu. Discrimination-aware channel pruning for deep neural networks. InSamy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolo Cesa-Bianchi, andRoman Garnett (eds.), NeurIPS, pp. 883–894, 2018.

18

abstract arxiv:1909.11321v2 [cs.cv] 29 dec 2020(pwconv), or 1 1 convolution, is a standard...

Documents