implementation and improvement of wavefront parallel processing for hevc encoding on many-core...

25
Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)

Upload: philippa-young

Post on 19-Dec-2015

228 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

Implementation And Improvement Of Wavefront Parallel Processing For HEVC

Encoding On Many-core Platform

Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao

2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)

Page 2: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

2

Outline

• Introduction• Proposed Method• Experimental Results• Conclusion

Page 3: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

3

Introduction

• In HEVC, two parallel tools, Tile and WPP, are presented to facilitate high level parallel processing.

• Compared with slice and Tile, WPP neither changes the regular raster scan order nor breaks coding dependencies at rows boundaries.

• WPP may often provide better compression performance and avoid some visual artifacts that may be induced by Tile and slice parallelism.

Page 4: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

4

Introduction(Cont.)• Several related works focus on improving

parallelism of HEVC.• Chi[4] presents a novel approach called

Overlapped Wavefront (OWF) is provided to enhance the parallel efficiency of WPP.

• Yan[5] utilizes the data dependencies among neighboring CTUs and PU regions to exploit the implicit parallelism.

• [4] C. C. Chi et al., “Parallel scalability and efficiency of HEVC parallelization approaches,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, pp. 1827–1838, Dec. 2012.

• [5] Chenggang Yan et al., “Highly parallel framework for HEVC motion estimation on many-core platform,” Proc. DCC, pp. 63-72, Mar. 2013.

Page 5: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

5

Introduction(Cont.)

• WPP and its applications still have some shortages.– HEVC test model(HM) is a single-core codec, thus

the serial realization of WPP in HM is not suitable for HEVC encoding on many-core platform.

– Due to the wavefront dependencies, it will introduce parallelization inefficiencies and becomes worse when a high number of processors is utilized.

Page 6: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

6

Proposed Method

• Besides the first row of a slice, WPP requires control signaling to inform whether the top-right CTU in previous row has been encoded when processing a CTU.

• Additional memory to store side information and probabilities of CABAC are required by the next rows.

Page 7: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

7

Proposed Method(Cont.)

• Try-and-wait mechanism is presented to apply WPP for HEVC encoder on many-core platform.– The control signaling are stored CTU by CTU, thus

W × H bytes are required.– Current CTU should check whether the top-right

CTU in previous row has been done before its processing. If not, the correspond core should wait and attempt again.

Page 8: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

8

• Ping-pang storage is utilized to reduce memory for side information storage.

Page 9: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

9

• Data reuse structure is also utilized for probabilities storage of CABAC.– Probabilities of previous row have been utilized

and unnecessary any more, thus they can be write off by the newest probabilities. Data reuse structure can reduce 88% for probabilities storage.

• Based on the above methods, WPP is realized for real-time HEVC encoder efficiently on many-core platform.

Page 10: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

10

Proposed Method(Cont.)

• Parallel scalability model of WPP

– When the encoding speed ceases to increase with the increase of cores, the encoder gets to its Maximum Parallel Scalability (MPS)

• k : number of cores.• n : CTU units (rows, Tile or slice) number in one frame.

Page 11: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

11

Proposed Method(Cont.)

• α : remaining rows.• u = ceil(H/k)• v = (H−1)mod k

Page 12: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

12

Proposed Method(Cont.)

• Improvement of parallel scalability for WPP– Reduce CTU size– Combine WPP with slice-level parallelism– Combine WPP with frame-level parallelism

Page 13: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

13

Proposed Method(Cont.)

• Reduce CTU size– The reduction of CTU size is an efficient way to

increase the height of CTU rows and improve the parallel scalability accordingly.

Page 14: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

14

Proposed Method(Cont.)– Although the reduction of CTU size can increase

the parallel scalability of WPP effectively, however, it decreases the coding efficiency.

– Kim[6] proves that BD-rate drops about 3.4% to 14.4% performance loss when CTU size decreases from 32 × 32 to 16 × 16.

– CTU size of 32×32 would be preferable to balance the parallelism and performance loss.

• [6] Kim et al., “Block partitioning structure in the HEVC standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, pp. 1649–1668, Dec. 2012.

Page 15: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

15

Proposed Method(Cont.)

• Combine WPP with slice-level parallelism– Slice-level parallelism, such as slice and Tile, can

break some dependencies among rows, thus the parallel scalability can be enhanced when they combined with WPP.

– Clare[7] implements two type of combinations of Tile and WPP, which divide frame into two independent or dependent Tiles side-by-side and each Tile is wavefront processed.

• [7] G. Clare et al., “Wavefront parallel processing for HEVC encoding and decoding,” JCTVCF0274, July. 2011.

Page 16: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

16

Proposed Method(Cont.)

– Combination of 2-4 slices and WPP under 32 × 32 CTU size will bring promising parallel scalability while keep minor performance loss.

• m : number of slices or tiles.• Hm = H/m.• v' = (Hm−1) mod [floor(k/m)]

Page 17: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

17

Proposed Method(Cont.)

Page 18: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

18

Proposed Method(Cont.)

• Combine WPP with frame-level parallelism– Two GOP structures, IPpP and IPpp, are introduced

to improve parallelism, where I and P can be used as reference frame while p(denotes as disposable frame) can not be used as reference.

– When a row has been encoded and no more tasks are available in current picture, WPP combined with frame-level parallelism will start next 1−3 frames simultaneously.

Page 19: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

19

Proposed Method(Cont.)– It can be inferred that H −2 cores are enough for

the encoding in parallel.– Start time can be deduced as NW + 2Nr + 1.– Finish moment of the Nth picture can be deduced

as (N + 2)W + 2Nr + 2

• r : maximum vertical search range.• N : Nth picture.

Page 20: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

20

Proposed Method(Cont.)

– Finishing moment of the N frame is (α + 2)W + 2αr + 2– (p+1)(H −r) cores are enough to attain its MPS

• r : maximum vertical search range.• p : number of disposable frame.• α = ceil[ N/(p+1) ].

Page 21: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

21

Experimental Results• Test sequences and encode environments– Adopt an encoder named FHM10.0 migrated from HEVC

reference software HM10.0.– The input videos in our experiments contain a list of

standard test sequences with 100 frames, and motion search range is set to 64.

– Select the Main profile and the default encoding test conditions are specified in [8].

– The experiment platform of this paper is based on GX36, which is a member of TILERA many-core processor family and contains 36 processing cores.

• [8] F. Bossen, “Common test conditions and software reference configurations,” JCTVCI1100, Apr. 2012.

Page 22: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

22

Experimental Results

• Parallel scalability analysis

Page 23: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

23

Page 24: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

24

Page 25: Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core Platform Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao 2014 IEEE

25

Conclusion

• Several effective methods, such as try-and-wait data interface, ping-pang storage and data reuse structure, are presented to realize WPP on HEVC encoder in parallel.

• Three effective methods are presented to improve parallel scalability of WPP.

• Experimental results show that our proposed methods improve more than 40% maximum parallel scalability when compared with WPP.