a hardware accelerator ip for ebcot tier-1 coding in jpeg2000 standard tien-wei hsieh youn-long lin...

A Hardware Accelerator IP for EBCOT A Hardware Accelerator IP for EBCOT Tier-1 Coding in JPEG2000 Standard Tier-1 Coding in JPEG2000 Standard

Tien-Wei Hsieh Youn-Long LinTien-Wei Hsieh Youn-Long Lin

Department of Computer ScienceDepartment of Computer Science

National Tsing Hua UniversityNational Tsing Hua University

TAIWANTAIWAN

2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 22

AbstractAbstract

PropositionProposition– 16-bit parallel context generator16-bit parallel context generator– Stripe-skipping methodStripe-skipping method– 3-stage pipelined arithmetic encoder3-stage pipelined arithmetic encoder– Renormalization strategy with forwarding methodRenormalization strategy with forwarding method

ContributionContribution– We reduce the cycle count by 17% compared with We reduce the cycle count by 17% compared with

the best-known designthe best-known design– We have achieved 5% within the optimumWe have achieved 5% within the optimum


OutlineOutline

IntroductionIntroduction

Previous workPrevious work

Proposed architectureProposed architecture

Experimental resultsExperimental results

ConclusionConclusion


OutlineOutline







Pre-process

Discrete WaveletTransform (DWT)

Quantization Block Coding(Tier-1 Coding)

Block Coding(Tier-1 Coding)

Bit-streamOrganization

(Tier-2 Coding)

Bit-streamOrganization

(Tier-2 Coding)

Original Image Data

Compressed Image Data

EBCOT (Embedded Block Coding with Optimized Truncation)

JPEG2000 Image CodingJPEG2000 Image Coding


EBCOT Tier-1 Time ConsumingEBCOT Tier-1 Time Consuming

Platform: Pentium 4 2.8GHz, 736MB RAM, Microsoft Windows XP, VC ++

Reference software: JPEG 2000 jasper 1.500.4

Test pattern: 512x512 gray image, 1 tile, 5/3 DWT

3 decomposition levels, code-block size 64x64

DWT

EBCOT Tier-1

EBCOT Tier-2

others

13.325 %

71.625 %

1.725 %13.325 %


EBCOT Tier-1 Block CodingEBCOT Tier-1 Block Coding(Context-based adaptive binary arithmetic

coding)

LL

LH

LH

HL

HHHL

HH

Sub-bitstream N

ContextFormation (CF)

ArithmeticEncoder (AE)

Sub-bitstream 3

Sub-bitstream 2

Sub-bitstream 1

context decision

Block Coding

Code-blockN

Code-block3

Code-block2

Code-block1

From DWT & Quantization

To Tier-2


Bit-Plane Division of Code-blockBit-Plane Division of Code-block1Sign bit

MSB

LSB

Magnitude bits

insignificant

significant

Pixel

Bit-plane

1

0

0

1

1

0

0

0

2004/6/162004/6/16 99

Scanning Each Bit-plane 3 TimesScanning Each Bit-plane 3 Times

4 bits in a column

N stripes in a bit-plane

M columns in a stripe

(pass > stripe > column > bit)

Code-block size is 4N x M

Coding a Bit-planeCoding a Bit-plane

0 0 1 1 0 1 0 11 1 0 0 0 1 0 01 0 0 0 1 1 1 00 0 1 1 0 0 1 00 1 0 0 0 1 1 00 1 1 0 0 0 1 11 1 0 0 1 0 0 01 0 0 0 0 0 1 1

1 0 11 1 0 0

0 1 1 11 1

0 1 0 10 11 1 01 0 1 1

1 0 11 0 0

0 1 1 11 1

0 1 00 11 11 0 1 1

Insignificant bits with significant neighbors

Significant bits Bits not coding in previous two passes

Pass 1 Pass 2 Pass 3

0 0 1 1 0 1 0 11 1 0 0 0 1 0 01 0 0 0 1 1 1 00 0 1 1 0 0 1 00 1 0 0 0 1 1 00 1 1 0 0 0 1 11 1 0 0 1 0 0 01 0 0 0 0 0 1 1

Significant bit


OutlineOutline






1212

Previous workPrevious workContext Formation– Normal mode

NTU: Skipping methods, Sample Skipping (SS) and Group-of-Column Skipping (GOCS)

– Reduce 60% cycle count compared with straightforward method

NCTU: Memory-saving algorithm– Reduce 4K bits memory space if the code-block size is 64x64

– Pass-parallel modeTKU: Pass-parallel context modeling

– No cycle wasted– 0.1 ~ 0.2 dB image quality degradation

Arithmetic EncoderArithmetic Encoder– MQ coder (JBIG)MQ coder (JBIG)

Osaka University: 4-stage pipelined architectureOsaka University: 4-stage pipelined architecture


OutlineOutline







Proposed EBCOT Tier-1 CoderProposed EBCOT Tier-1 Coder

AddressGenerator

Code BlockMemory

State Memory

ContextFormation

(CF)

Compress & PISOArithmeticEncoder

(AE)

Pixel_in

Byte_out

16 bits

40 (CX, D)s

(CX, D)

16 bits


Context Formation UnitContext Formation Unit

AddressGenerator

Code BlockMemory

State Memory

ContextFormation

(CF)


(AE)

Pixel_in

Byte_out

16 bits

40 (CX, D)s

(CX, D)

16 bits


Data Dependency for 16 BitsData Dependency for 16 Bits

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

(delay)

2004/6/162004/6/16

16-Way Parallel Architecture16-Way Parallel Architecture1

2

3

4

5

6

7

8

5

6

7

8

1

2

3

4

9

10

11

12

13

14

15

16

13

14

15

16

9

10

11

12

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

1818

Memory SchemeMemory Scheme

Stripe N

Stripe N+1

Memory B

Memory C

Memory A

Memory B

Memory C

ORDER

Memory A

Memory B

Memory C

2004/6/162004/6/16 1919

Stripe SkippingStripe Skipping

3 registers record coding condition of all stripes in 3 registers record coding condition of all stripes in 3 passes3 passes– A stripe is skipped in Pass1 if all bits in the stripe are A stripe is skipped in Pass1 if all bits in the stripe are

significantsignificant– A stripe is skipped in Pass2 if all bits in the stripe are A stripe is skipped in Pass2 if all bits in the stripe are

insignificantinsignificant– A stripe is skipped in Pass3 if all bits in the stripe have A stripe is skipped in Pass3 if all bits in the stripe have

been coded in Pass1 or 2been coded in Pass1 or 2


Arithmetic EncoderArithmetic Encoder

AddressGenerator

Code BlockMemory

State Memory

ContextFormation

(CF)


(AE)

Pixel_in

Byte_out

16 bits

40 (CX, D)s

(CX, D)

16 bits

2121

Feedback Loop in AE Flow ChartFeedback Loop in AE Flow Chart

Probability Estimation Table (PET)

ContextTable

A CalculationA

C C Calculation

Index Updating

Table Reading

MPS Updating

Byte

Context Decision

Renormalization

NM

PS

, NLP

S

SW

ITC

H

Qe

Qe

index mps

2222

Modified Probability Estimation Table (MPET)

Context Table

A Calculation

A

C

C Calculation

Index Updating

Bit Shifting

MPS Updating

Context Decision

Renormalization

Proposed Pipelined AEProposed Pipelined AE

Byte

Stage 1

Stage 2

Stage 3

Table Reading

2323

Fast RenormalizationFast Renormalization

CT > A_shift ?

C = C << A_shiftCT = CT – Ashift

C = C << CTA_shift = A_shift – CT

BYTEOUT2

DONE

NOYES

BYTEOUT

Twice ?YES

NO

CT > A_shift ?

C = C << A_shiftCT = CT – Ashift

C = C << CTA_shift = A_shift – CT

CT = 0

BYTEOUT

DONE

NOYES


Compress & Parallel In Serial OutCompress & Parallel In Serial Out

AddressGenerator

Code BlockMemory

State Memory

ContextFormation

(CF)


(AE)

Pixel_in

Byte_out

16 bits

40 (CX, D)s

(CX, D)

16 bits


Interaction Between CF and AEInteraction Between CF and AE

Clock

CF_stall

AE_stall

CF generates 4, 2, 0, 0, 0, 2, 1, 2, 0, and 0 (CX, D) pairs respectively

4 2 0 0 0 2 1 2 0 0

For example


Overlapping CF and AE Overlapping CF and AE

Clock

CF_stall

AE_stall

Clock

CF_stall

AE_stall

4 2 0 0 0 2 1 2 0 0

4 2 0 0 0 2 1 2 0 0


AHB InterfaceAHB Interface

RegisterBlock Context Formation

Arithmetic Encoder

Slave Controller

Slave Transaction

Master Interface

Master Controller

Tier-1 Encoder

MemoryBlock

AHB

Slave Interface

Master Interface

IP Core


OutlineOutline







Objective of ExperimentObjective of Experiment

The objective of our experiment is to proveThe objective of our experiment is to prove– Low powerLow power– High performanceHigh performance– AHB-compliantAHB-compliant

Test pattern– 512x512 gray images (airplane, baboon, lena, pepper

s)– 1 title– 5/3 DWT– 3 decomposition levels– Code-block size 64x64Code-block size 64x64


IP Qualification & Code CoverageIP Qualification & Code Coverage

IP qualification (nLint)IP qualification (nLint)– Compliant with RMM guidelinesCompliant with RMM guidelines

Code coverage (Code coverage (Verification Navigator ))

Design for testability (TetraMAX)Design for testability (TetraMAX)

Our designOur design General expectancyGeneral expectancy

Statement coverageStatement coverage 97.9%97.9% 95%95%

Branch coverageBranch coverage 95.8%95.8% 95%95%

Toggle coverageToggle coverage 100%100% 95%95%

Path coveragePath coverage 69.2%69.2% 50%50%

Total faultsTotal faults Test patternsTest patterns Test coverageTest coverage

77,20077,200 439439 99.99%99.99%


Synthesis and power analysisSynthesis and power analysis

Our designOur design

Technology libraryTechnology library TSMC .35TSMC .35

Area (gate count)Area (gate count)25,706 +25,706 +

45kb memory45kb memory

Max. Frequency (MHz)Max. Frequency (MHz) 43.4843.48

Power (mW)Power (mW) 26.6826.68

Synthesis tool : Design Compiler (under WCCOM)

Power analysis tool : PrimePower


0 500000 1000000 1500000 2000000 2500000

Airplane

Babbon

Lena

Peppers

# of contexts

# of BYTEOUTs

# of AE stalls

Composition of coding cycleComposition of coding cycle

0 0.5 1 1.5 2 2.5 (unit: 1,000,000)

Simulation tool : ModelSim SE/PE 5.7e

1.32

1.75

1.46

1.35

0.14

0.22

0.17

0.16

0.12

0.13

0.12

0.11

(1.75)

(1.62)

(2.1)

(1.58)

(CX, D)s


0 500000 1000000 1500000 2000000 2500000

Airplane

Lena

Babbon

Peppers

Cycle reductionCycle reduction

0 0.5 1 1.5 2 2.5 (unit: 1,000,000)

Peppers

Lena

Baboon

Airplane

Peppers

Lena

Baboon

Airplane

2% reduction by stripe-skipping

9% reduction by proposed renormalization

0.078

0.084

0.069

0.077

0.006

0.005

0.007

0.005


Comparison Comparison

0 500000 1000000 1500000 2000000 2500000

Airplane

Baboon

Lena

Peppers

Our desgin

Column-base design

Lower bound

1.54

1.43

1.83

1.41

1.88

1.74

1.761.32

2.11.75

1.46

1.35

0 0.5 1 1.5 2 2.5

( Unit: 1,000,000 cycles )


Platform Platform ArchitectureArchitecture – –Altera Excalibur EPXA10DDRAltera Excalibur EPXA10DDR

FPGA

Embedded Stripe

AHB2

AHB1

ARM922TProcessor

InterruptControlle

r

WatchDogTimer

AHB1 toAHB2 Bridge

DPRAM128KB

SRAM256KB

SDRAMControlle

r

SDRAM128MB

PLD to StripeBridge

Stripe to PLD

Bridge

UARTController

External BusInterface

FLASH32MB

PLDMaster

1

PLDMaster

2

PLDSlav

e3

PLDSlav

e4

PLDSlav

e2

PLDMaster

3

PLDSlav

e1


Platform-based SOC Design Platform-based SOC Design FlowFlow

■ ADS■ SOPC Builder■ Quartus II■ ESS, ADS and Modelsim SE

System spec.

Profiling & HW/SW partitionSoftware

spec. HW spec. for each component

FPGA designStripe

configuration

Software coding

Library

BUS interfaceHDL coding

Device HDL

coding

Interface.vAccelerator.v

Component.vStripe.

v

User defined

firmware

Integration (SOPC Builder)

System PTF

SOPC generation

Pin assignment & Hardware compilation (Quartus II)

*.c files

Compilation (ADS)

Software image

System building

(Quartus II)

System image

SOPC platform

Prototyping

Excalibur.h

Stripe.h

Component SDK

HW/SW co-simulation

RTL codes

Hardware image

3737

FPGA Prototyping ResultFPGA Prototyping Result

Platform: Altera ExcaliburTM EPXA10DDR, 25MHz

0

20

40

60

80

100

120

Airpla

ne

Baboo

nLe

na

Peppe

rs

Airpla

ne

Baboo

nLe

na

Peppe

rs

OthersTier-1

Pure software Proposed accelerator(Second)

100.98

80.18 81.4785.34

26.9931.2

27.6 27.78


OutlineOutline







SummarySummary

PropositionProposition– 16-bit parallel context generator16-bit parallel context generator– Stripe-skipping methodStripe-skipping method– 3-stage pipelined arithmetic encoder3-stage pipelined arithmetic encoder– Renormalization strategy with forwarding methodRenormalization strategy with forwarding method

ContributionContribution– We reduce the cycle count by 17% compared with the We reduce the cycle count by 17% compared with the

best-known designbest-known design– We have achieved 5% within the optimumWe have achieved 5% within the optimum


Future WorkFuture Work

Ping-pong method for the “compress & Ping-pong method for the “compress & PISO” to reduce the 5% coding cyclesPISO” to reduce the 5% coding cycles

ASIC IntegrationASIC Integration


Demo on the SoC platformDemo on the SoC platform

Pure software (100 sec.)Pure software (100 sec.)– Configure the FPGAConfigure the FPGA– Load the original image to SDRAMLoad the original image to SDRAM– Execute the JPEG2000 encoderExecute the JPEG2000 encoder– Get the compressed image from SDRAMGet the compressed image from SDRAM– Record the time consumingRecord the time consuming

Proposed accelerator (50 sec.)Proposed accelerator (50 sec.)– Configure the FPGAConfigure the FPGA– Load the original image to SDRAMLoad the original image to SDRAM– Execute the JPEG2000 encoderExecute the JPEG2000 encoder– Get the compressed image from SDRAMGet the compressed image from SDRAM– Compare images and time consumingCompare images and time consuming

Thank you!!Thank you!!

a hardware accelerator ip for ebcot tier-1 coding in jpeg2000 standard tien-wei hsieh youn-long lin...

Documents

coding block coding

architecture slide

block coding context

optimum slide

ebcot tier

codeblock size 64x64

passparallel context

plane division of code