supplementary material: video compression …...supplementary material: video compression through...

Supplementary Material:Video Compression through Image Interpolation

Chao-Yuan Wu, Nayan Singhal, and Philipp Krahenbuhl

The University of Texas at Austin{cywu, nayans, philkr}@cs.utexas.edu

1 Model Details

Context feature fusion. In experiments, we found that fusing the U-net featuresat resolution W

2 ×H2 yields good performance for all models. Fusing features at

more resolutions improves the performance slightly, but requires more memoryand computation. In this paper, we additionally use W

4 ×H4 features for the

encoder of M1,2 and M3,3, and W4 ×

H4 and W

8 ×H8 features for the decoder of

M3,3. Models are selected based on the performance on validation set.

Probability estimation in AAC. Adaptive arithmetic coding (AAC) relies ona good probability estimation for each bit, given previously decoded bits, toefficiently encode a binary representation. To estimate the probability, we followMentzer et al. and use a 3D-Pixel-CNN model [3, 4]. The model contains 11layers of masked convolution. Each layer has 128 channels, and is followed bybatch normalization [1] and relu. We train the models using Adam [2] with alearning rate of 0.0001 for 30K iterations.

2 Bitrate Optimization

We present detailed results of bitrate optimization. Figure 1a shows the exploredperformance at the first level of the hierarchy. We pick good combinations fromthe envelope of the curves, and they proceed to the next level. Figure 1b presentsthe final combinations. We use these combinations for the experiments in ourpaper unless otherwise noted.

3 Computational Complexity and Memory Consumption

With 352 × 288 videos and batch size 1, encoding/decoding takes 20.4/13.0msper iteration, plus 13.7ms for the context model on a TITAN Xp GPU. This is ofthe same order as the highly optimized H.264, with takes 4.0/1.4ms for encod-ing/decoding on a Intel Xeon E5-2630 v3 CPU with FFmpeg implementation.As for memory, our model consumes 506MB at inference time.

2 Chao-Yuan Wu, Nayan Singhal, Philipp Krahenbuhl

0.5 1 1.5

0.94

0.96

0.98

BPP

Picked (K0, K1)

K0 = 1

K0 = 2

K0 = 3

K0 = 4

K0 = 5

K0 = 6

K0 = 7

K0 = 8

K0 = 9

K0 = 10

(a) MS-SSIM. Beam search for K1 given K0.

K0 K1 K2 K3 Average

bitrate

5 3 2 1 0.109

7 3 2 2 0.151

10 4 2 2 0.188

10 10 2 2 0.219

10 10 6 2 0.260

10 10 10 2 0.302

(b) Final combinations.

Fig. 1: Bitrate optimization. K0, K1, K2 and K3 correspond to the numberof iterations used in the I-frame model, M6,6, M3,3 and M1,2 respectively.(a) Beam search at the first level of hierarchy. Picked (K0,K1)-combinationsproceed to the next level. (b) Final combinations returned by our algorithm.

4 Full System

Figure 2 illustrates how the full system compresses/decompresses a video stepby step. Here we illustrate a 3-level model as an example. (The main model inpaper is a 4-level model. )

We first compress the key frames using the I-frame (image compression)model (Figure 2a). Second, we use the decompressed key frames as context andinterpolate the frame in next level of hierarchy, i.e. the middle frame (Figure 2b).Finally, we again use the decompressed frames as context, and interpolate theframes in the final level of hierarchy, i.e. the remaining frames (Figure 2c).

References

1. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training byreducing internal covariate shift. In: ICML (2015)

2. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)

3. Mentzer, F., Agustsson, E., Tschannen, M., Timofte, R., Van Gool, L.: Conditionalprobability models for deep image compression. In: CVPR (2018)

4. Oord, A.v.d., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks.In: ICML (2016)

Video Compression through Image Interpolation 3

EI DI

EI DI

(a) Level 0.

Context

DRER

Context

(b) Level 1.

DRER

DRER

Context

Context

Context

(c) Level 2.

Fig. 2: Full compression/decompression procedure for a 3-level model. Blue ar-rows represent motion compensated context features. Gray arrows represent in-put and output of the network.

supplementary material: video compression …...supplementary material: video compression through...

Documents