entropy slices for parallel entropy coding k. misra, j. zhao and a. segall

17
L A B O R A T O R I E S O F A M E R I C A L A B O R A T O R I E S O F A M E R I C A l Entropy Slices for Parallel Entropy Coding K. Misra, J. Zhao and A. Segall

Upload: stan

Post on 15-Jan-2016

61 views

Category:

Documents


0 download

DESCRIPTION

Entropy Slices for Parallel Entropy Coding K. Misra, J. Zhao and A. Segall. Entropy Slices. Introduction: Entropy Slice Introduce partitioning of slices into smaller “entropy” slices Entropy slice Reset context models Restrict definition for neighborhood - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Entropy Slices for Parallel Entropy Coding K. Misra, J. Zhao and A. Segall

L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C A

l

Entropy Slices for Parallel Entropy Coding

K. Misra, J. Zhao and A. Segall

Page 2: Entropy Slices for Parallel Entropy Coding K. Misra, J. Zhao and A. Segall

L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices

Introduction: Entropy Slice Introduce partitioning of slices

into smaller “entropy” slices Entropy slice

Reset context models Restrict definition for

neighborhood Process identical to current

slice by entropy decoder Key difference: reconstruction

uses information from neighboring entropy slices

Reconstruct slice

Each Picture/Slice

Reset CABAC state

Reset CABAC state

entropy_slice_flag?

Parse regular slice header

Parse entropy slice header

Define neighbor info for CABAC & reconstruct

Entropy Decode Slice

Data

Define neighbor info for CABAC

Entropy Decode Slice

Data

Define neighbor info

for reconstruct

Entropy decode slice

data

Parse slice headerYN

Page 3: Entropy Slices for Parallel Entropy Coding K. Misra, J. Zhao and A. Segall

L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices

We now introduce major advantages for the entropy slice concept

Advantage #1 - Parallelization: Entropy slices do not depend on

information outside of the entropy slice and can be decoded independently

Allows for parallelization of entire entropy decoding loop – including context adaptation and bin coding

Advantage #2 - Generalization Entropy slices can be used for all

entropy coding engines currently under study in the TMuC and TMuC software

Moreover, we have software available for PIPE and CABAC V2V

CABAC PIPE UVLC

Page 4: Entropy Slices for Parallel Entropy Coding K. Misra, J. Zhao and A. Segall

L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices

Advantages #3 – No impact on single thread/core: Parallelization capability does not

come at the expense of single thread/core applications

A single thread/core process may1. Decode all entropy slices prior to

reconstructionOR2. Decode entropy slice and then

reconstruct without neighbourhood reset

This is friendly to any architecture

Reconstruct slice

Each Picture/Slice

Reset CABAC state

Reset CABAC state

entropy_slice_flag?

Parse regular slice header

Parse entropy slice header

Define neighbor info for CABAC & reconstruct

Entropy Decode Slice

Data

Define neighbor info for CABAC

Entropy Decode Slice

Data

Define neighbor info

for reconstruct

Entropy decode slice

data

Parse slice headerYN

Page 5: Entropy Slices for Parallel Entropy Coding K. Misra, J. Zhao and A. Segall

L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices

Advantage #4 –Easy Adaptation to Decoder Design

Bit-stream can be partitioned into a large number of entropy slices with little overhead

For example, we will show performance of 32 entropy slices for 1080p on next slide – this would translate to ~128 slices for 4k.

Decoder can schedule N entropy decoders easily, where N is arbitrary

One example: for 32 slices, architecture with parallelization of 4 (N=4) would assign 8 slices per decoder.

Another example: for 32 slices, architecture with N=8 would assign 4 slices per decoder

Additionally, for large resolutions (4k,8k) possible to scale to 100s of decoders for GPU implementations

Parse N slice/entropy sliceHeaders or until

start of next picture

Entropy decode 1st

slice data

reconstruct m slices

Each Picture

N : Desired degree of parallel m: available slices in current picture

…Entropy

decode 2nd

slice data

Entropy Decode mth

slice data

Parallel Entropy Decoding

m slices

Page 6: Entropy Slices for Parallel Entropy Coding K. Misra, J. Zhao and A. Segall

L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices

Advantage #5 –Coding Efficiency Insertion of Entropy Slices results in

negligible impact on coding efficiency. For example, if configure the encoder for a parallelization factor of 32, we get:

Y BD-rate U BD-rate V BD-rateClass A 0.4 0.3 0.3Class B 0.3 0.2 0.2Class C 0.3 0.1 0.2Class D 0.2 0.2 0.1Class E 0.2 0.1 0.0All 0.3 0.2 0.2Enc Time[%]Dec Time[%]

Random accessY BD-rate U BD-rate V BD-rate

Class A 0.2 0.8 0.5Class B 0.2 0.6 0.2Class C 0.1 0.1 0.2Class D 0.1 0.0 0.1Class EAll 0.1 0.4 0.2Enc Time[%]Dec Time[%]

Y BD-rate U BD-rate V BD-rateClass AClass B 0.1 0.9 0.3Class C 0.0 0.3 0.1Class D 0.0 -0.1 0.2Class E 0.1 0.6 -0.3All 0.0 0.4 0.1Enc Time[%]Dec Time[%]

111%

103%

103%

Intra

Low delay

101%103%

105%

Page 7: Entropy Slices for Parallel Entropy Coding K. Misra, J. Zhao and A. Segall

L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices

Advantage #6 –Specification Entropy slices allow simple and direct

specification of parallelization at the Profile and Level stage

This is accomplished by: Specifying the maximum number

of bins in an Entropy Slice Specifying the maximum number

of Entropy Slices per picture Allows addition specification of

PIPE/V2V configurations Maximum number of bins per bin

coder in an Entropy Slice Additional advantage: straightforward

to determine conformance at encoder

16

16

16

16

16

16

16

32

-

-

-

-

-

-

-

-

Max number of motion

vectors per two

consecutive MBs

MaxMvsPer2Mb

M5.12[-512,+511.75]

240 000240 00069 120.036 864

983 0405.1

M52[-512,+511.75]

135 000135 00041 400.022 080

589 8245

M4.22[-512,+511.75]

62 50050 00013 056.08 704522 2404.2

M4.12[-512,+511.75]

62 50050 00012 288.08 192245 7604.1

M44[-512,+511.75]

25 00020 00012 288.08 192245 7604

M3.24[-512,+511.75]

20 00020 0007 680.05 120216 0003.2

M3.14[-512,+511.75]

14 00014 0006 750.03 600108 0003.1

M32[-256,+255.75]

10 00010 0003 037.51 62040 5003

-2[-256,+255.75]

4 0004 0003 037.51 62020 2502.2

-2[-256,+255.75]

4 0004 0001 782.079219 8002.1

-2[-128,+127.75]

2 0002 000891.039611 8802

-2[-128,+127.75]

2 000768891.039611 8801.3

-2[-128,+127.75]

1 000384891.03966 0001.2

-2[-128,+127.75]

500192337.53963 0001.1

-2[-64,+63.75]350128148.5991 4851b

-2[-64,+63.75]17564148.5991 4851

Max number of bin in entropy

slice

Min compression ratio

MinCR

Vertical MV component

range MaxVmvR

(luma frame samples)

MaxCPB sizeMaxCPB

(1000 bits,1200 bits,

cpbBrVclFactor bits, or cpbBrNalFa

ctor bits)

Max video

bit rate MaxBR(1000 bits/s,

1200 bits/s, cpbBrVclFactor bits/s,

or cpbBrNalFactor bits/s)

Max decoded picture buffer size

MaxDPB(1024

bytes for 4:2:0)

Max fram

e size

MaxFS

(MBs)

Max macroblo

ckprocessin

g rate MaxMBPS (MB/s)

Level

number

16

16

16

16

16

16

16

32

-

-

-

-

-

-

-

-

Max number of motion

vectors per two

consecutive MBs

MaxMvsPer2Mb

M5.12[-512,+511.75]

240 000240 00069 120.036 864

983 0405.1

M52[-512,+511.75]

135 000135 00041 400.022 080

589 8245

M4.22[-512,+511.75]

62 50050 00013 056.08 704522 2404.2

M4.12[-512,+511.75]

62 50050 00012 288.08 192245 7604.1

M44[-512,+511.75]

25 00020 00012 288.08 192245 7604

M3.24[-512,+511.75]

20 00020 0007 680.05 120216 0003.2

M3.14[-512,+511.75]

14 00014 0006 750.03 600108 0003.1

M32[-256,+255.75]

10 00010 0003 037.51 62040 5003

-2[-256,+255.75]

4 0004 0003 037.51 62020 2502.2

-2[-256,+255.75]

4 0004 0001 782.079219 8002.1

-2[-128,+127.75]

2 0002 000891.039611 8802

-2[-128,+127.75]

2 000768891.039611 8801.3

-2[-128,+127.75]

1 000384891.03966 0001.2

-2[-128,+127.75]

500192337.53963 0001.1

-2[-64,+63.75]350128148.5991 4851b

-2[-64,+63.75]17564148.5991 4851

Max number of bin in entropy

slice

Min compression ratio

MinCR

Vertical MV component

range MaxVmvR

(luma frame samples)

MaxCPB sizeMaxCPB

(1000 bits,1200 bits,

cpbBrVclFactor bits, or cpbBrNalFa

ctor bits)

Max video

bit rate MaxBR(1000 bits/s,

1200 bits/s, cpbBrVclFactor bits/s,

or cpbBrNalFactor bits/s)

Max decoded picture buffer size

MaxDPB(1024

bytes for 4:2:0)

Max fram

e size

MaxFS

(MBs)

Max macroblo

ckprocessin

g rate MaxMBPS (MB/s)

Level

number

Page 8: Entropy Slices for Parallel Entropy Coding K. Misra, J. Zhao and A. Segall

L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices

Syntax Slice header Indicate slice is “entropy slice” Send only information necessary for entropy decoding

slice_header( ) { C Descriptor

first_lctb_in_slice 2 ue(v)

entropy_slice_flag 2 u(1)

if(!entropy_slice_flag) {

}

else {

if( entropy_coding_mode_flag && slice_type != I )

cabac_init_idc 2 ue(v)

}

Page 9: Entropy Slices for Parallel Entropy Coding K. Misra, J. Zhao and A. Segall

L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AConclusions

We have presented the concept of an “entropy slice” for the HEVC system

Advantages include: 1. Parallel entropy decoding (both context adaptation and/or bin coding)

2. Generalization to any entropy coding system under study3. No impact on serial implementations4. Easy adaptation to different parallelization factors at the decoder5. Negligible impact on coding efficiency (<0.2%)6. Direct path for specifying parallelization at the profile/level stage

Software is available

Page 10: Entropy Slices for Parallel Entropy Coding K. Misra, J. Zhao and A. Segall

L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices

In the last meeting, two topics were discussed1. Size of entropy slice headers2. Extension to potential architectures that do not decouple

parsing and reconstruction

We address these in the next slides…

Page 11: Entropy Slices for Parallel Entropy Coding K. Misra, J. Zhao and A. Segall

L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices

Header Size Very small (as asserted previously) Quantitative

2 bytes + NALU (1 byte) for 1080p Scales for resolutions due to first_lctb_in_slice

slice_header( ) { C Descriptor

first_lctb_in_slice 2 ue(v)

entropy_slice_flag 2 u(1)

if(!entropy_slice_flag) {

}

else {

if( entropy_coding_mode_flag && slice_type != I )

cabac_init_idc 2 ue(v)

}

Page 12: Entropy Slices for Parallel Entropy Coding K. Misra, J. Zhao and A. Segall

L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices

Extension to additional architectures Previous meeting there was interest in extending the method to

architectures that do no buffer symbols between parsing and reconstruction

This anticipates “joint-wave-front” processing of both parsing and reconstruction loops

We investigated this issue and concluded the following:1. In the current TMuC design, we observe that it is not possible to

do wavefront processing of the parsing stage.2. If we configure the TMuC to support wavefront parsing, the

extension of entropy slices is straightforward

Page 13: Entropy Slices for Parallel Entropy Coding K. Misra, J. Zhao and A. Segall

L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices

Confidential 13

Our approach: provide additional entry-points without neighbor restriction

EC Init

EC Init

EC Init

EC Init

EC Init : Use cabac_init_idc to initialize entropy coder

“Entropy slice” entry-points

Page 14: Entropy Slices for Parallel Entropy Coding K. Misra, J. Zhao and A. Segall

L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices

Confidential 14

Entropy + Reconstruction steps : 16

Page 15: Entropy Slices for Parallel Entropy Coding K. Misra, J. Zhao and A. Segall

L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices

Syntax1. Signal that the bin coding

engine will be reset at start of each LCU row

2. Allow signaling cabac_init_idc for the reset

coding_unit( x0, y0, currCodingUnitSize ) { C Descriptor

if (x0==0 && currCodingUnitSize==MaxCodingUnitSize && lcu_row_cabac_init_idc_flag==true && lcu_id!=first_lcu_in_slice) {

cabac_init_idc_present_flag 1 u(1)

if( cabac_init_idc_present_flag )

cabac_init_idc 2 ue(v)

}

a regular coding unit …

}

slice_header( ) { C Descriptor

entropy_slice_flag 2 u(1)

if (entropy_slice_flag) {

first_lcu_in_slice 2 ue(v)

lcu_row_cabac_init_flag 1 u(1)

if( lcu_row_cabac_init_flag ){

lcu_row_cabac_init_idc_flag 1 u(1)

}

if( entropy_coding_mode_flag && slice_type != I) {

cabac_init_idc 2 ue(v)

}

}

else {

lcu_row_cabac_init_flag 1 u(1)

if( lcu_row_cabac_init_flag ){

lcu_row_cabac_init_idc_flag 1 u(1)

}

a regular slice header ……..

}

}

Page 16: Entropy Slices for Parallel Entropy Coding K. Misra, J. Zhao and A. Segall

L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices

Performance

Y BD-rate U BD-rate V BD-rateClass A 0.5 0.4 0.5Class B 0.5 0.4 0.4Class C 0.6 0.5 0.6Class D 0.6 0.5 0.6Class E 0.5 0.5 0.6All 0.5 0.5 0.5Enc Time[%]Dec Time[%]

Random accessY BD-rate U BD-rate V BD-rate

Class A 0.9 1.5 1.2Class B 1.0 1.5 1.2Class C 0.9 0.8 0.9Class D 1.3 1.2 1.3Class EAll 1.1 1.2 1.2Enc Time[%]Dec Time[%]

#NUM!#NUM!

#NUM!

Intra

#NUM!

Max parallelism: Maintain initial 32x parallelization Additionally: one entry point for every LCU row 17x for 1080p RD performance - .5-1%

Y BD-rate U BD-rate V BD-rateClass A 0.4 0.4 0.4Class B 0.3 0.2 0.2Class C 0.3 0.2 0.3Class D 0.2 0.2 0.1Class E 0.2 0.2 0.2All 0.3 0.2 0.2Enc Time[%]Dec Time[%]

Random accessY BD-rate U BD-rate V BD-rate

Class A 0.4 1.2 1.0Class B #VALUE! #VALUE! #VALUE!Class C 0.2 0.3 0.2Class D 0.1 0.0 0.1Class EAll #VALUE! #VALUE! #VALUE!Enc Time[%]Dec Time[%]

Intra

#NUM!

#NUM!

#NUM!

#NUM!

4x parallelism: Maintain initial 32x parallelism Additionally: Four entry points in the ES (aligned with LCU rows; result 4x speedup) RD performance - .3%

Page 17: Entropy Slices for Parallel Entropy Coding K. Misra, J. Zhao and A. Segall

L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices

Conclusion Entropy slices well tested and flexible

Demonstrated in multiple environments (JM, JMKTA, TMuC) Demonstrated with CABAC and CAV2V Friendly to serial and parallel architectures (including both decoupled and

coupled parsing/reconstruction architectures)

From the last meeting:“The basic concept of desiring enhanced high-level parallelism

of the entropy coding stage to be in the HEVC design is agreed.”

We propose1. Adoption of the entropy slice technology into the TM2. Evaluation of the “joint-wavefront” extension in a CE