efficient vision transformers via fine-grained manifold

11
Efficient Vision Transformers via Fine-Grained Manifold Distillation Ding Jia 1,2 Kai Han 2 Yunhe Wang 2 Yehui Tang 1,2 Jianyuan Guo 2,3 Chao Zhang 1* Dacheng Tao 3 1 Key Lab of Machine Perception (MOE), Dept. of Machine Intelligence, Peking University 2 Noah’s Ark Lab, Huawei Technologies 3 University of Sydney {jiading1,kai.han,yunhe.wang,jianyuan.guo}@huawei.com {yhtang,c.zhang}@pku.edu.cn, [email protected] Abstract This paper studies the model compression problem of vision transformers. Benefit from the self-attention module, transformer architectures have shown extraordinary performance on many computer vision tasks. Although the network performance is boosted, transformers are often required more computational resources includ- ing memory usage and the inference complexity. Compared with the existing knowledge distillation approaches, we propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches. We then explore an efficient fine-grained manifold distillation approach that simultaneously calculates cross-images, cross-patch, and random- selected manifolds in teacher and student models. Experimental results conducted on several benchmarks demonstrate the superiority of the proposed algorithm for distilling portable transformer models with higher performance. For example, our approach achieves 75.06% Top-1 accuracy on the ImageNet-1k dataset for training a DeiT-Tiny model, which outperforms other ViT distillation methods. 1 Introduction Transformer model [30] has been widely used in natural language processing (NLP) tasks like machine translation [19] and question answering system [2] as the self-attention mechanism can build a long-range dependency. Inspired by its success in NLP, many researches have made efforts to apply the transformer structure into the computer vision (CV) field where convolutional neural network (CNN) was once treated as the fundamental component. Carion, N. et. al [4] introduce DERT which uses a CNN backbone and the transformer-based encoder-decoder structure for object detection. Dosovitskiy et. al [10] propose ViT, which is a pure transformer structure and works well in the visual recognition task with large-scale training on JFT-300M dataset. To achieve data-efficient training of ViT, DeiT is presented by Touvron et. al [28] which shows ViT can perform better by only training on the ImageNet-1k dataset while with data augmentation policies. Although transformer has made a wide influence in the CV field and has the ability to support vision tasks [13, 6], it still suffers from the high demand for storage and computing resources when deploying on edge devices, such as mobile phones and IoT devices. There are many model acceleration technologies developed for CNNs, such as knowledge distillation [16], network quantization [38] and network pruning [21]. These methods do not consider the characters of transformers and cannot directly be applied to vision transformers. To build lightweight vision transformers, many new transformer architectures [9, 29, 11] and transformer pruning strategies [39, 26] are proposed, but generally smaller transformers perform worse than larger ones. Preprint. Under review. arXiv:2107.01378v1 [cs.CV] 3 Jul 2021

Upload: others

Post on 18-Dec-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Vision Transformers via Fine-Grained Manifold

Efficient Vision Transformers via Fine-GrainedManifold Distillation

Ding Jia1,2 Kai Han2 Yunhe Wang2 Yehui Tang1,2 Jianyuan Guo2,3Chao Zhang1∗ Dacheng Tao3

1Key Lab of Machine Perception (MOE), Dept. of Machine Intelligence, Peking University2Noah’s Ark Lab, Huawei Technologies 3University of Sydney

{jiading1,kai.han,yunhe.wang,jianyuan.guo}@huawei.com{yhtang,c.zhang}@pku.edu.cn, [email protected]

Abstract

This paper studies the model compression problem of vision transformers. Benefitfrom the self-attention module, transformer architectures have shown extraordinaryperformance on many computer vision tasks. Although the network performanceis boosted, transformers are often required more computational resources includ-ing memory usage and the inference complexity. Compared with the existingknowledge distillation approaches, we propose to excavate useful informationfrom the teacher transformer through the relationship between images and thedivided patches. We then explore an efficient fine-grained manifold distillationapproach that simultaneously calculates cross-images, cross-patch, and random-selected manifolds in teacher and student models. Experimental results conductedon several benchmarks demonstrate the superiority of the proposed algorithm fordistilling portable transformer models with higher performance. For example, ourapproach achieves 75.06% Top-1 accuracy on the ImageNet-1k dataset for traininga DeiT-Tiny model, which outperforms other ViT distillation methods.

1 Introduction

Transformer model [30] has been widely used in natural language processing (NLP) tasks likemachine translation [19] and question answering system [2] as the self-attention mechanism canbuild a long-range dependency. Inspired by its success in NLP, many researches have made effortsto apply the transformer structure into the computer vision (CV) field where convolutional neuralnetwork (CNN) was once treated as the fundamental component. Carion, N. et. al [4] introduceDERT which uses a CNN backbone and the transformer-based encoder-decoder structure for objectdetection. Dosovitskiy et. al [10] propose ViT, which is a pure transformer structure and works wellin the visual recognition task with large-scale training on JFT-300M dataset. To achieve data-efficienttraining of ViT, DeiT is presented by Touvron et. al [28] which shows ViT can perform better byonly training on the ImageNet-1k dataset while with data augmentation policies.

Although transformer has made a wide influence in the CV field and has the ability to support visiontasks [13, 6], it still suffers from the high demand for storage and computing resources when deployingon edge devices, such as mobile phones and IoT devices. There are many model accelerationtechnologies developed for CNNs, such as knowledge distillation [16], network quantization [38]and network pruning [21]. These methods do not consider the characters of transformers and cannotdirectly be applied to vision transformers. To build lightweight vision transformers, many newtransformer architectures [9, 29, 11] and transformer pruning strategies [39, 26] are proposed, butgenerally smaller transformers perform worse than larger ones.

Preprint. Under review.

arX

iv:2

107.

0137

8v1

[cs

.CV

] 3

Jul

202

1

Page 2: Efficient Vision Transformers via Fine-Grained Manifold

(a) Image-level manifold space (b) Patch-level manifold space

Figure 1: Patch-level manifold learning can build a more fine-grained manifold space and transfermore information into the student network.

To improve the performance of vision transformer, DeiT [28] introduces a knowledge distillationmethod named ‘hard-label distillation’ and chooses a CNN network as the teacher. Although theexisting ‘hard-label distillation’ can improve the performance of the student visual transformer, theintermediate features and the intrinsic information calculated by the attention module are ignored.There are several works [25, 18] which propose knowledge distillation for BERT [19] models wherea student network learns from multiple intermediate layers of the teacher for extracting internalknowledge. But these methods have high requirements for matching the structure of the studentand teacher transformers and they are still limited to the transfer of feature within a single sample.Manifold learning is about discovering the manifold space embedded in the feature space andreconstruct features with inter-sample relationships. It is also used for knowledge distillation [7] toenable the transfer of cross-sample information constructed by the teacher network to the student.But existing manifold approaches focus only on modeling the relationships between images, withouttaking advantage of the fine-grained patch structure when applied on vision transformer.

To fully excavate the strong capacity of the teacher transformer, we introduce a fine-grained manifoldknowledge distillation method, which builds a patch-level manifold space of all patch embeddings indifferent images and transfers the information about this manifold space to the student network. Butbuilding the whole patch-level manifold space requires measuring all relative distances between patchembeddings, which has a super high computation and memory complexity. Thus, we approximate thewhole manifold calculation by three decomposed manifolds, i.e., cross-image manifold, cross-patchmanifold, and randomly-sampled manifold. The cross-image manifold and cross-patch manifoldsample and calculate the inter-image and within-image relationships, respectively. And the random-selected manifold is set to correct the sampling error, which randomly takes k patches and calculatestheir relationships. Our algorithm is parameter-free and has few requirements for the teacher andstudent transformer architectures. Extensive experiments demonstrate that our method improves theDeiT-Tiny to achieve 75.06% Top-1 accuracy on the ImageNet-1k dataset and outperforms the otherdistillation strategies such as DeiT.

2 Related works

2.1 Vision transformer

Vision Transformer (ViT) was proposed by Google [10] and DeiT [28] improves its training processby data argumentation and knowledge distillation. Inspired by the design principles of the CNNstructure, PiT [15] and HVT [22] add the pooling operation to reduce the spatial dimension andincrease channels. TNT [14] introduces the Transformer-in-Transformer model to model the internalinformation of patches. DeepViT [36] and CaiT [29] solve the performance saturating issue and builda deeper vision transformer.

2

Page 3: Efficient Vision Transformers via Fine-Grained Manifold

ManifoldDistillation

Patch Em

bed

Input

Tokens

Class token

PositionEmbed

Blo

ck1

Blo

ck2

Blo

ck3

Blo

ck10

Blo

ck11

Blo

ck12… …

Patch

Emb

ed

Input

Tokens

Class token

PositionEmbed

SA B

lock1

SA B

lock2

SA B

lock3

SA B

lock2

2

SA B

lock2

3

SA B

lock2

4… …

……

MLP head

MLP head

MLP head

Classification

Hard Label Distillation

DeiT Student

CaiT Teacher

Distillation token

CA

Blo

ck1

CA

Blo

ck2

Figure 2: The diagram of the proposed manifold distillation scheme. The manifold loss function forpatch embedding is embedded into the blocks of the teacher and the student networks.

2.2 Knowledge distillation

The pioneering work [3] uses an ensemble of base classifiers to generate the synthetic labeled samples,which are used for training the student model together with the original training dataset. The conceptof knowledge distillation was formally introduced by Hinton et. al [16], which means taking thesoftened logits of the teacher model to train a student model. Subsequent works [23, 34] train aCNN student with both the logits and the intermediate representations of the teacher model. It is alsoapplied in other visual tasks, like object detection [5, 12] and other fields, such as Natural LanguageProcessing [25], Recommendation System [37] and Network Compression [32, 31].

2.3 Manifold learning

Manifold learning is firstly used in non-linear dimensionality reduction. It discovers the smoothmanifold embedded in the original feature space based on several discrete samples and uses themanifold space to construct low-dimensional features. Many manifold learning algorithms have beenproposed. For instance, the Locally Linear Embedding (LLE) algorithm [24] proposes an embeddingprojection flle which keeps the local proximity of samples in the manifold space and reconstructseach input data point as a weighted combination of its neighbors. Isomap [27] calculates all thedistances between each two data points based on the shortest path distance and Multi-DimensionalScaling (MDS) [20]. Laplacian Eigenmaps (LE) [1] optimizes the low-dimensional embedding bya loss function which encourages the adjacent data points xi and xj keep close in the embeddingspace. Recently, several works extend manifold learning for knowledge distillation. Chen et. al [7]adopt a linear method named Locality Preserving Projections (LPP) for knowledge distillation ofconvolutional neural networks by introducing the locality preserving loss. They describe the manifoldspace by relative distances between images in a batch and encourages the student network to learnthe embedding manifold space generated by the teacher network. These primary attempts are basedon the image-level manifolds and do not consider the characters of vision transformer.

3 Method

In this section, we review the structure of vision transformer and classic knowledge distillationmethods, then describe how to improve the distillation of vision transformer by our patch embeddingmanifold loss.

3

Page 4: Efficient Vision Transformers via Fine-Grained Manifold

(a) Cross-patch (b) Cross-image (c) Random-selected (d) Patch embedding

Figure 3: The patterns for sampling when building the simplified patch-level manifold space. Fourimages are involved in this figure and each image is split into four patches. So the total relation mapis 16x16.

3.1 Preliminaries

Vision transformer In vision transformer, the input images are firstly divided into a sequence ofpatches and the transformer network is utilized to extract the image features for visual recognition.Given an input image, we can obtain p patches by uniformly splitting. For example, if the resolutionof the image is 224×224 and the patch size is set as 16×16, this image will be split into 196 patches.These patches are flattened and projected into patch embeddings by a linear layer. Given a batchof n images, we have n× p patch embeddings X ∈ Rn×p×c where c is the embedding dimension.These patch embeddings are fed into the transformer network for relationship modeling and featureaggregation. The structure of vision transformer is composed of position encoding, multi-headself-attention (MSA) block, and multi-layer perceptron (MLP) block. The information flow can beformulated as

X ←MSA(LN(X)) +X (1)

X ←MLP (LN(X)) +X (2)where the input before the first MSA is added with position encoding, and LN is the layer normaliza-tion layer. The MSA mechanism can be described as

MSA(X) = FCout(Attention(FCq(X), FCk(X), FCv(X))) (3)

Attention(Q,K, V ) = Softmax(QKT /√d)V (4)

The FLOPs of MSA and MLP are 4nc2 + 2n2c and 8nc2 respectively when the hidden dimension ofMLP is 4c by default. The vision transformer with L blocks has about L(12nc2 + 2n2c) FLOPs.

Due to the large number of n and d (usually several hundred), the computational cost of visiontransformer is large. For example, the base version of ViT requires 17.6B FLOPs to process a224×224 input image. The lightweight ViT with fewer FLOPs is desired for practical deploymenton mobile devices. However, smaller vision transformers usually perform worse than larger ones.To improve the performance of small vision transformers, knowledge distillation (KD) is applied toutilize the knowledge in large teacher networks.

Knowledge distillation Knowledge distillation is about taking a well-trained, usually with complexstructure teacher network as the teacher to train a target student network. Besides the true label y forinput X , the softened output of the teacher model are also used in the training loss:

L = αH(ψ(fs(X)), y) + (1− α)t2KL(ψ(fs(X)/t), ψ(ft(X)/t)) (5)

whereH means the cross-entropy loss and ψ is the softmax operation. fs(X) and ft(X) are the logitsof the teacher and student for input X . t is a hyperparameter called temperature, which smooths thedistribution of the logits and amplifies the probability of negative classes so that the teacher network’sjudgment information about negative classes will also be learned. By doing this, the student networkis forced to fit the decision boundary of the teacher too while fitting the data distribution. It helps thestudent network avoid overfitting and achieve better performance.

4

Page 5: Efficient Vision Transformers via Fine-Grained Manifold

The embedding features can also be transferred. Romero et. al [23] distill the student by logits andthe feature map of middle layers. But this technology requires the student and teacher have the sameembedding dimension or extras dimensional transformation layers must be added.

However, DeiT introduces a variant of knowledge distillation which fixes the hyperparameter α to 12

and sets the temperature to 1. It also replaces the softened output with the hard decision made by theteacher network:

L =1

2H(ψ(fs(X)), y) +

1

2H(ψ(fs(X)), yt) (6)

where yt = argmax(ft(X)). Its experiments show that this method can get a higher accuracy forDeiT models on the ImageNet-1k dataset with a CNN teacher.

3.2 Fine-grained manifold distillation

The previous knowledge distillation methods [28] for vision transformers mainly focus on distillingthe output logits while ignoring the intermediate features and the relationship between them. Directlytransferring the embedding features may help, but it has the limitation that both the student and theteacher must have the same dimension of feature embedding, which greatly limits the choice ofteacher networks. In addition, it does not take advantage of the inter-patch information.

We solve these problems by introducing the patch-level manifold learning algorithm for knowledgedistillation of visual transformer. By viewing the transformer as a feature projector, we can embedall the image patches into a smooth manifold and obtain the corresponding patch embeddingsX ∈ Rn×p×c. As shown in Figure 1, this manifold space contains much more information than theimage-level one with the same amount of data. Passing this fine-grained manifold space to the studenthelps the student to maintain the relationships between patch embedding generated by the teachernetwork and learn the teacher better. Restrictions about embedding dimension and the number ofheads on distillation are also lifted in our method, which expands the selection of teacher networks.

Specifically, given a labeled training set (X,Y ) with n images in a batch (x1,y1), (x2,y2), · · · ,(xn,yn). We split each image into b patches and indicate the embedding dimension of each patchfor the student and teacher as cS and cT . We denote the output extracted by the teacher and studentnetworks from a specific block as FT ∈ Rn×p×cT and FS ∈ Rn×p×cS , respectively. Taking thestudent for example, we build the patch-level manifold using discrete samples FS on it by calculatingthe relative distances between patches as

M(FS) = γ(F ′S)γ(F′S)

T , (7)

where F ′S [i, j, :] =FS [i,j,:]‖FS [i,j,:]‖2 normalizes each embedding and γ means the reshape operation which

convert Rn,p,c → Rnp,c. To learn the fine-grained manifold structure of the teacher for betterrelationship construction, we propose to minimize the gap between teacher and student:

Lmanifold = ‖M(FS)−M(FT )‖2F . (8)

Calculating the whole relation map requires (np, np) memory space and the computational complexityis O(n2p2c). Take the DeiT-Tiny training setting as an example, it uses a batch size of 128 and thenumber of patches is 192, so the whole relation map will have over 600M 32bits-float numbers. Theembedding dimension of the DeiT-Tiny is 192, so computing and storing a single layer’s whole patchrelation map will need over one hundred billion operations and over 2.2GB memory space in GPU,which will be an unbearable load for practical computing. So we must simplify it.

Table 1: Specifications of teacher and student models. The FLOPs are calculated at resolution224x224. RegNetY-16GF is used in DeiT and we add it in for comparison.

Model Embeddingdimension heads layers params FLOPs

RegNetY-16GF — — — 84M 15.9BDeiT-Tiny 192 3 12 6M 1.2BDeiT-Small 384 6 12 22M 4.6BCaiT-XXS24 192 4 24 12M 2.5BCaiT-S24 384 8 24 46.9M 9.4B

5

Page 6: Efficient Vision Transformers via Fine-Grained Manifold

Inspired by the orthogonal decomposition of matrices, we propose a simplified patch embeddingmanifold loss, which takes two patterns to sample the whole relations and one random term to reducethe sampling error. The loss can be expressed as:

Lmanifold−sp = Lci + αLcp + βLrs (9)

α and β are two weight hyper-parameters used to control the relative contributions of cross-imagemanifold loss Lci, cross-patch manifold loss Lcp and random-selected manifold loss Lrs in the totalmanifold loss Lmanifold−sp. These losses are calculated as:

Lci =

p∑k=0

||M(FS [:, k, :])−M(FT [:, k, :])||2Fp

(10)

Lcp =

n∑s=0

||M(FS [s, :, :])−M(FT [s, :, :])||2Fn

(11)

Lrs = ||M(F rS)−M(F r

T )||2F (12)

where F rT and F r

S are randomly selected from γ(FT ) and γ(FS), their dimensions are (k, cT ) and(k, cS). k is a hyperparameter representing the number of patches we choose.

Lci, Lcp and Lrs have distinct geometric meanings as illustrated in Figure 3:

• The cosine similarity calculation between γ(F ′T [:, k, :]) and γ(F ′T [:, k, :])T samples the

relationships of patch embedding with the same index k but in different images in theteacher model. It is then compared with that of the student network to calculate the loss.Adding up losses of all indexes we get the cross-image manifold loss Lci. The timecomplexity is O(n2pc).

• M(FT [s, :, :]) builds the relationships of patch embedding which belongs to the same image.It is then compared with that of the student network to calculate the loss. Adding up lossesof n images we get the cross-patch manifold loss Lcp. The time complexity is O(np2c).

• The blank blocks in Figure 3 are unreached relations during our sampling process. To reducethis sampling error we add a random sampler loss Lrs as a correction item which randomlytakes k patches from all patches and builds their relations. Lrs represents the differencebetween the teacher network and the student in modeling this relation. The time complexityis O(k2c). We will discuss the choice of k in Section 4.3.

The simplified patch embedding manifold loss gets O(np2c + n2pc + k2c) time complexity foradding it in a specific block. It is used together with the hard-label distillation loss proposed by DeiT

Table 2: Distillation results. The performance of RegNetY-16GF and CaiT models are taken from[28, 29]. ‘Hard’ means the hard-label distillation strategy used in DeiT. All trained for 300 epochs onthe ImageNet-1k.

Teacher Acc1 Distillationmethod Student Acc1

RegNetY-16GF 82.9 Hard DeiT-Tiny 74.48RegNetY-16GF 82.9 Hard DeiT-Small 81.16CaiT-XXS24 77.6 Hard DeiT-Tiny 73.86CaiT-XXS24 77.6 Hard DeiT-Small 80.11CaiT-S24 82.7 Hard DeiT-Tiny 74.39CaiT-S24 82.7 Hard DeiT-Small 81.27CaiT-XXS24 (our) 77.6 Manifold DeiT-Tiny 74.46CaiT-XXS24 (our) 77.6 Manifold DeiT-Small 80.25CaiT-S24 (our) 82.7 Manifold DeiT-Tiny 75.06CaiT-S24 (our) 82.7 Manifold DeiT-Small 81.48

6

Page 7: Efficient Vision Transformers via Fine-Grained Manifold

so the student network can better understand the within-image and cross-image features of the teacher.The total training loss is:

Ltotal =1

2H(ψ(fs(X)), y) +

1

2H(ψ(fs(X)), yt) +

∑l

Lmanifold−sp (13)

where l represents the blocks we choose to add the patch embedding manifold loss. The choice ofblocks is discussed in Section 4.3.

4 Experiments

In this section, we verify the effectiveness of our proposed manifold distillation method on theImageNet-1k dataset.

4.1 Experiment setup

Dataset The ImageNet-1k contains over 1.2 million labeled images for training and 50,000 labeledimages for validation, which are classified into 1000 classes. It is a subset of the ImageNet [17]and used in the ISLVRC 2012. It is widely used in the CV field and does not contain personallyidentifiable information or offensive content. We obtain and use the dataset under the license of theImageNet, which is posted on https://image-net.org/download. Our distillation experiments followthe data argumentation strategies used in DeiT [28], which includes Rand-Augment [8], Mixup [35]and CutMix [33].

Baseline We model our experiments on the DeiT work, which is released with the Apache License2.0. To make a fair comparison, we follow the training setting of DeiT and take the distillation resultsof it as baselines. Our method is suitable for the teacher and student who are both vision transformerstructures and have the same patch size, so we choose CaiT-XXS24 and CaiT-S24 [29] as the teachersto train DeiT-Tiny and DeiT-Small since CaiT-S24 has a similar performance with RegNetY-16GFwhich is chosen in DeiT as the teacher and CaiT-XXS24 is a lightweight model which can be used toindicate how our method benefits the student network even with a worse teacher. Specifications aboutthese models are shown in Table 1.

4.2 Patch-level manifold distillation on ImageNet-1k

Implementation details Figure 2 shows how we apply our patch embedding manifold loss betweenblocks in the DeiT student and the CaiT teacher, together with the classification loss and the hardlabel distillation loss. We apply it in the first four and last four blocks and set the α and β in Formula9 to 1 and 0.2, respectively. We also choose the number of patches we take in random sampler lossLrs as 192. We will discuss the choice of hyper-parameters and blocks to set our loss in Section4.3. When building the patch embedding manifold, we drop the class token and distillation tokenbecause they do not carry features of the input images, and dropping them can align the manifold

0.65

0.66

0.67

0.68

0.69

0.7

0.71

0.72

0.73

0.74

0.75

0.76

200 210 220 230 240 250 260 270 280 290 300

Top

-1 a

ccu

racy

Epochs

Cait-S24 Manifold-KD Deit-Tiny (Our)RegNetY Hard-KD Deit-Tiny (DeiT)DeiT-Tiny without KD

Figure 4: Comparison of the accuracy curves of DeiT-Tiny without distillation, hard-label distilledDeiT and the proposed patch-level manifold distillation approach.

7

Page 8: Efficient Vision Transformers via Fine-Grained Manifold

Table 3: Ablation study results for loss functions in our patch embedding manifold loss. All trainedfor 300 epochs on the ImageNet-1k.

Teacher Student Manifoldlosses Acc1

CaiT-S24 DeiT-Tiny Lci, Lcp, Lrs 75.06CaiT-S24 DeiT-Tiny Lci, Lcp 74.62CaiT-S24 DeiT-Tiny Lci, Lrs 74.54CaiT-S24 DeiT-Tiny Lcp, Lrs 74.63

structures between the teacher and the student. The student network initializes the weights withnormal distribution and the biases are initialized with zeros.

We use 8 Tesla-V100 GPUs and train models for 300 epochs. The batch size is 128 and the learningrate is 5e-4. In all our experiments the resolution of input images is fixed as 224x224. It takes aboutthree days to conduct a complete training session.

Results The training process is shown in Figure 4. The accuracy curve of our method is smootherand it surpasses the DeiT baselines at different training epochs using the same network architecture.

We compare final distillation results in Table 2, which demonstrates that our proposed patch-levelmanifold distillation strategy is more efficient: the distilled DeiT-Tiny and DeiT-Small using ourmethod both outperform the baselines by 0.58% and 0.32%, respectively. Also, the performanceof distilled DeiT-Tiny in our method can be very close to the baseline even using a much weakerteacher CaiT-XXS. The potential negative societal impacts of our method may include the energyconsumption of GPU computation when training.

4.3 Ablation study and analysis

Ablation study for loss functions As shown in Formula 9, our proposed patch embedding manifoldloss consists of Lci, Lcp and Lrs. The ablation experiments of these losses are shown in Table 3. Theresults confirm that each component in our patch embedding manifold loss contributes positively tothe final accuracy.

We also test different combinations of hyperparameters α and β in the loss function Formula 9 andthe results are shown in Table 4. Our method shows a strong performance when α and β are set to1 and 0.2. From the results, the weights of Lrs should be set relatively small because the randomsampling mechanism contains uncertainty and less adjustable by gradient backpropagation.

Table 4: Results about the choices of hyper-parameters α and β in our patch embedding manifoldloss. All trained for 300 epochs on the ImageNet-1k.

Teacher Student α β Acc1

CaiT-S24 DeiT-Tiny 1 0.2 75.06CaiT-S24 DeiT-Tiny 0.5 0.2 74.72CaiT-S24 DeiT-Tiny 2.5 0.2 74.90CaiT-S24 DeiT-Tiny 1 1 74.73

Blocks to apply the manifold loss Since our patch embedding manifold loss can be pluggedin any blocks of the student and teacher if they have the same number of patches, we have doneexperiments to find out how to insert the manifold loss to get better results, which are shown in Table5. Experiment results show applying it in the first four and last four blocks of the teacher and studentnetworks makes the student performance best. We believe this is because placing constraints onthe head and tail of the model helps the student network to learn the abilities of feature extractionand fusing high-dimensional features from the teacher network, while not placing constraints in themiddle of the model prevents over-constraint of the student network.

8

Page 9: Efficient Vision Transformers via Fine-Grained Manifold

Table 5: Distillation results for different choices of blocks using our manifold distillation method. Alltrained for 300 epochs on the ImageNet-1k.

Teacher Student Teacher blocksextracted

Student blocksextracted Acc1

CaiT-S24 DeiT-Tiny {3,4,21,22} {3,4,9,10} 74.50CaiT-S24 DeiT-Tiny {1,2,3,4,21,22,23,24} {1,2,3,4,9,10,11,12} 75.06CaiT-S24 DeiT-Tiny {1,2,3,9,10,11,22,23,24} {1,2,3,5,6,7,10,11,12} 74.30

CaiT-S24 DeiT-Tiny {1,2,3,4,9,10,11,12,21,22,23,24}

{1,2,3,4,5,6,7,8,9, 10,11,12} 74.31

Number of random sampling We randomly sample k patches in the loss Lrs and k is a hyper-parameter. To find the appropriate k, we launch four experiments, which are shown in Table 6.It indicates that although Lrs does make a contribution, a bigger k will normally result in worseperformance after a certain limit is exceeded.

Table 6: Distillation results for different k using our manifold distillation method. All trained for 300epochs on the ImageNet-1k.

Teacher Student k Acc1

CaiT-S24 DeiT-Tiny 0 74.62CaiT-S24 DeiT-Tiny 192 75.06CaiT-S24 DeiT-Tiny 384 74.95CaiT-S24 DeiT-Tiny 768 74.81

Compare with direct embedding transfer We compare our method with directly transferringthe feature embedding between the teacher and the student, which is also plugged in the firstfour and last four blocks. The results in Table 7 show that our knowledge distillation strategyoutperforms direct embedding transfer, which also indicates the student does benefit from theprovided information between patches. Besides, our approach is less restrictive than that sincetransferring feature embedding between the student and the teacher requires both networks have thesame embedding dimension, or an extras MLP layer is needed for dimensional transformation.

Table 7: Distillation results compared with direct embedding transfer. ‘Embedding’ means directembedding transfer. All trained for 300 epochs on the ImageNet-1k.

Teacher Student Distillationmethod Acc1

CaiT-XXS24 DeiT-Tiny Embedding 74.31CaiT-XXS24 (our) DeiT-Tiny Manifold 74.46CaiT-S24 DeiT-Small Embedding 81.41CaiT-S24 (our) DeiT-Small Manifold 81.48

5 Conclusion

In this work, we propose a novel patch-level manifold knowledge distillation strategy for visiontransformer which transfers information about the fine-grained manifold space of the teacher modelto the student. Since the original algorithm has a high level of computational complexity, we simplifyit with the sum of three sub-manifolds. We apply our method to blocks of the teacher and studenttransformers and the results outperform other ViT distillation methods, such as DeiT. Our approachhas few limitations in model choosing, making it flexible in use. We also have done a series ofexperiments to prove the validity of our method and discuss the design of the distillation structure. Wehope our observations and methods could contribute to the study and application of vision transformer.

9

Page 10: Efficient Vision Transformers via Fine-Grained Manifold

References[1] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and

clustering. In Nips, volume 14, pages 585–591, 2001.

[2] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, ArvindNeelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.arXiv preprint arXiv:2005.14165, 2020.

[3] C. Bucila, R. Caruana, and A. Niculescu-Mizil. Model compression. In ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining(KDD’06), 2006.

[4] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and SergeyZagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision,pages 213–229. Springer, 2020.

[5] Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker. Learning efficient objectdetection models with knowledge distillation. In Proceedings of the 31st International Conference onNeural Information Processing Systems, pages 742–751, 2017.

[6] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, ChunjingXu, Chao Xu, and Wen Gao. Pre-trained image processing transformer. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, pages 12299–12310, 2021.

[7] Hanting Chen, Yunhe Wang, Chang Xu, Chao Xu, and Dacheng Tao. Learning student networks via featureembedding. IEEE Transactions on Neural Networks and Learning Systems, 2020.

[8] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated dataaugmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition Workshops, pages 702–703, 2020.

[9] Stéphane d’Ascoli, Hugo Touvron, Matthew Leavitt, Ari Morcos, Giulio Biroli, and Levent Sagun. Convit:Improving vision transformers with soft convolutional inductive biases. arXiv preprint arXiv:2103.10697,2021.

[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, ThomasUnterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, andNeil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprintarXiv:2010.11929, 2020.

[11] Benjamin Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, andMatthijs Douze. Levit: a vision transformer in convnet’s clothing for faster inference. arXiv preprintarXiv:22104.01136, 2021.

[12] Jianyuan Guo, Kai Han, Yunhe Wang, Han Wu, Xinghao Chen, Chunjing Xu, and Chang Xu. Distillingobject detectors via decoupled features. In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 2154–2164, 2021.

[13] Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao,Chunjing Xu, Yixing Xu, et al. A survey on visual transformer. arXiv preprint arXiv:2012.12556, 2020.

[14] Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. Transformer in transformer.arXiv preprint arXiv:2103.00112, 2021.

[15] Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, and Seong Joon Oh.Rethinking spatial dimensions of vision transformers. arXiv preprint arXiv:2103.16302, 2021.

[16] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXivpreprint arXiv:1503.02531, 2015.

[17] D. Jia, D. Wei, R. Socher, L. J. Li, L. Kai, and F. F. Li. Imagenet: A large-scale hierarchical image database.Proc of IEEE Computer Vision & Pattern Recognition, pages 248–255, 2009.

[18] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu.Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2019.

[19] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectionaltransformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.

[20] J. B. Kruskal, M. Wish, and E. M. Uslaner. Multidimensional scaling. BOOK ON DEMAND POD, 1978.

[21] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learningefficient convolutional networks through network slimming. In Proceedings of the IEEE InternationalConference on Computer Vision, pages 2736–2744, 2017.

[22] Zizheng Pan, Bohan Zhuang, Jing Liu, Haoyu He, and Jianfei Cai. Scalable visual transformers withhierarchical pooling. arXiv preprint arXiv:2103.10619, 2021.

10

Page 11: Efficient Vision Transformers via Fine-Grained Manifold

[23] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deepnets. Computer ence, 2015.

[24] St Roweis and Lk Saul. Nonlinear dimensionality reduction by locally linear embedding. Science,290(5500):2323–2326, 2000.

[25] Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient knowledge distillation for bert model compression.arXiv preprint arXiv:1908.09355, 2019.

[26] Yehui Tang, Kai Han, Yunhe Wang, Chang Xu, Jianyuan Guo, Chao Xu, and Dacheng Tao. Patch slimmingfor efficient vision transformers. arXiv preprint arXiv:2106.02852, 2021.

[27] Joshua B. Tenenbaum, Vin De Silva, and John C. Langford. A global geometric framework for nonlineardimensionality reduction. Science, 290(5500):2319–2323, 2000.

[28] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and HervéJégou. Training data-efficient image transformers & distillation through attention. arXiv preprintarXiv:2012.12877, 2020.

[29] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeperwith image transformers. arXiv preprint arXiv:2103.17239, 2021.

[30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, LukaszKaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.

[31] Yunhe Wang, Chang Xu, Chao Xu, and Dacheng Tao. Adversarial learning of portable student networks.In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.

[32] Yixing Xu, Yunhe Wang, Hanting Chen, Kai Han, Chunjing Xu, Dacheng Tao, and Chang Xu. Positive-unlabeled compression on the cloud. arXiv preprint arXiv:1909.09757, 2019.

[33] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo.Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of theIEEE/CVF International Conference on Computer Vision, pages 6023–6032, 2019.

[34] Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performanceof convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928, 2016.

[35] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical riskminimization. arXiv preprint arXiv:1710.09412, 2017.

[36] Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, and JiashiFeng. Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886, 2021.

[37] Guorui Zhou, Ying Fan, Runpeng Cui, Weijie Bian, Xiaoqiang Zhu, and Kun Gai. Rocket launching:A universal and efficient framework for training well-performing light net. In Proceedings of the AAAIConference on Artificial Intelligence, volume 32, 2018.

[38] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training lowbitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160,2016.

[39] Mingjian Zhu, Kai Han, Yehui Tang, and Yunhe Wang. Visual transformer pruning. arXiv preprintarXiv:2104.08500, 2021.

11