an efficient hardware implementation of mq decoder of the jpeg2000

10
An efficient hardware implementation of MQ decoder of the JPEG2000 Layla Horrigue a,, Taoufik Saidani a , Refka Ghodhbani a , Julien Dubois b , Johel Miteran b , Mohamed Atri a a Electronics and Micro-Electronics Laboratory, Faculty of Sciences, Monastir, Tunisia b University of Burgundy, Laboratory Le2i, UMR CNRS 6063, 21000 Dijon, France article info Article history: Available online 11 July 2014 Keywords: JPEG-2000 MQ-decoder Implementation FPGA abstract JPEG2000 is an international standard for still images intended to overcome the shortcomings of the existing JPEG standard. Compared to JPEG image compression techniques, JPEG2000 standard has not only better not only has better compression ratios, but it also offers some exciting features. As it’s hard to meet the real-time requirement of image compression systems by software, it is necessary to implement compression system by hardware. The MQ decoder of the JPEG2000 standard is an important bottleneck for real-time applications. In order to meet the real-time requirement we propose in this paper a novel architecture for a MQ decoder with high throughput which is comparable to that of other architectures and suitable for chip implementation. This architecture has been implemented in VHDL hardware description language and synthesized using Xilinx’s and Altera’s design flows respectively ISE 13.1 and Quartus. The implementation results show that the design operates at 439.5 MHz when implemented on Virtex-6 and the estimated frame rate at this frequency is 63.24 frames per second (FPS). On Stratix III device, the design operates at 214.4 MHz and the hardware cost is very low. Hardware overhead is minimized to a great extent because the structure of the probability estimation table (PET) is replaced by a small PET ROM. The memory bits used in the architecture are reduced significantly. The use of a dedicated probability estimation table decreases the internal memory. Ó 2014 Elsevier B.V. All rights reserved. 1. Introduction The standard JPEG2000 is an image compression standard rati- fied by the JOINT Photographic Expert Group (JPEG) in December 2000 [1]. JPEG2000 offers numerous advantages over the previous JPEG standard. The features supported by JPEG2000 include, lossy and lossless compression, continuous tone and bi-level compres- sion, progressive transmission by pixel accuracy and resolution, region of interest coding, compressed domain processing, and error resilience [2,3]. Such characteristics add to the functionality of a system that is employing JPEG2000 as an image compression tech- nique. The features and performance of JPEG2000 make this stan- dard superior to JPEG. Yet computational complexities of JPEG2000 are much higher than those of JPEG. Such complexities are due to EBCOT which is the most important algorithm employed in JPEG2000 [4]. The JPEG2000 decoder algorithm is profiled in [4,5], with both concluding that the entropy decoding operation uses the largest percentage of the processing time and the most memory requirement. Thus the entropy decoder is the best candi- date for hardware acceleration. MQ decoding procedure is intrinsi- cally sequential and it is therefore very difficult to speed up this module. To improve its performance, MQ decoder architectures that are capable of providing a decision by one clock cycle have been proposed in [9,11,12,14]. However such architectures request a large silicon area and operate at low speed. As a result, the throughput of such MQ decoders is not as high as expected. To avert this situation, a new methodology is proposed to implement the probability estimation table (PET) which reduces the memory requirement of the design. This paper presents a cost effective high-speed MQ decoder architecture in the very high speed inte- grated circuit Hardware Description Language (VHDL). The results of the proposed design are compared to various FPGA implementations. After the introduction, Section 2 provides a brief overview of the JPEG2000 algorithm and the MQ decoder algorithm. Section 3 describes some MQ decoder’s architectures which have been pro- posed in the literature. The proposed architecture of MQ decoder is presented in Section 4. In Section 5 experimental results are pro- vided and are compared with other architectures. Finally, conclu- sions are drawn in Section 6. http://dx.doi.org/10.1016/j.micpro.2014.06.005 0141-9331/Ó 2014 Elsevier B.V. All rights reserved. Corresponding author. E-mail addresses: [email protected] (L. Horrigue), Saidani_taoufiki2v@ yahoo.com (T. Saidani), [email protected] (R. Ghodhbani), julien. [email protected] (J. Dubois), [email protected] (J. Miteran), [email protected] (M. Atri). Microprocessors and Microsystems 38 (2014) 659–668 Contents lists available at ScienceDirect Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro

Upload: mohamed

Post on 03-Feb-2017

220 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: An efficient hardware implementation of MQ decoder of the JPEG2000

Microprocessors and Microsystems 38 (2014) 659–668

Contents lists available at ScienceDirect

Microprocessors and Microsystems

journal homepage: www.elsevier .com/locate /micpro

An efficient hardware implementation of MQ decoder of the JPEG2000

http://dx.doi.org/10.1016/j.micpro.2014.06.0050141-9331/� 2014 Elsevier B.V. All rights reserved.

⇑ Corresponding author.E-mail addresses: [email protected] (L. Horrigue), Saidani_taoufiki2v@

yahoo.com (T. Saidani), [email protected] (R. Ghodhbani), [email protected] (J. Dubois), [email protected] (J. Miteran),[email protected] (M. Atri).

Layla Horrigue a,⇑, Taoufik Saidani a, Refka Ghodhbani a, Julien Dubois b, Johel Miteran b, Mohamed Atri a

a Electronics and Micro-Electronics Laboratory, Faculty of Sciences, Monastir, Tunisiab University of Burgundy, Laboratory Le2i, UMR CNRS 6063, 21000 Dijon, France

a r t i c l e i n f o a b s t r a c t

Article history:Available online 11 July 2014

Keywords:JPEG-2000MQ-decoderImplementationFPGA

JPEG2000 is an international standard for still images intended to overcome the shortcomings of theexisting JPEG standard. Compared to JPEG image compression techniques, JPEG2000 standard has notonly better not only has better compression ratios, but it also offers some exciting features. As it’s hardto meet the real-time requirement of image compression systems by software, it is necessary toimplement compression system by hardware. The MQ decoder of the JPEG2000 standard is an importantbottleneck for real-time applications. In order to meet the real-time requirement we propose in thispaper a novel architecture for a MQ decoder with high throughput which is comparable to that of otherarchitectures and suitable for chip implementation. This architecture has been implemented in VHDLhardware description language and synthesized using Xilinx’s and Altera’s design flows respectivelyISE 13.1 and Quartus. The implementation results show that the design operates at 439.5 MHz whenimplemented on Virtex-6 and the estimated frame rate at this frequency is 63.24 frames per second(FPS). On Stratix III device, the design operates at 214.4 MHz and the hardware cost is very low. Hardwareoverhead is minimized to a great extent because the structure of the probability estimation table (PET) isreplaced by a small PET ROM. The memory bits used in the architecture are reduced significantly. The useof a dedicated probability estimation table decreases the internal memory.

� 2014 Elsevier B.V. All rights reserved.

1. Introduction

The standard JPEG2000 is an image compression standard rati-fied by the JOINT Photographic Expert Group (JPEG) in December2000 [1]. JPEG2000 offers numerous advantages over the previousJPEG standard. The features supported by JPEG2000 include, lossyand lossless compression, continuous tone and bi-level compres-sion, progressive transmission by pixel accuracy and resolution,region of interest coding, compressed domain processing, and errorresilience [2,3]. Such characteristics add to the functionality of asystem that is employing JPEG2000 as an image compression tech-nique. The features and performance of JPEG2000 make this stan-dard superior to JPEG. Yet computational complexities of JPEG2000are much higher than those of JPEG. Such complexities are due toEBCOT which is the most important algorithm employed inJPEG2000 [4]. The JPEG2000 decoder algorithm is profiled in[4,5], with both concluding that the entropy decoding operation

uses the largest percentage of the processing time and the mostmemory requirement. Thus the entropy decoder is the best candi-date for hardware acceleration. MQ decoding procedure is intrinsi-cally sequential and it is therefore very difficult to speed up thismodule. To improve its performance, MQ decoder architecturesthat are capable of providing a decision by one clock cycle havebeen proposed in [9,11,12,14]. However such architectures requesta large silicon area and operate at low speed. As a result, thethroughput of such MQ decoders is not as high as expected. Toavert this situation, a new methodology is proposed to implementthe probability estimation table (PET) which reduces the memoryrequirement of the design. This paper presents a cost effectivehigh-speed MQ decoder architecture in the very high speed inte-grated circuit Hardware Description Language (VHDL). The resultsof the proposed design are compared to various FPGAimplementations.

After the introduction, Section 2 provides a brief overview of theJPEG2000 algorithm and the MQ decoder algorithm. Section 3describes some MQ decoder’s architectures which have been pro-posed in the literature. The proposed architecture of MQ decoderis presented in Section 4. In Section 5 experimental results are pro-vided and are compared with other architectures. Finally, conclu-sions are drawn in Section 6.

Page 2: An efficient hardware implementation of MQ decoder of the JPEG2000

660 L. Horrigue et al. / Microprocessors and Microsystems 38 (2014) 659–668

2. Standard JPEG-2000

2.1. The JPEG2000 decoding process

In this section the JPEG2000 decoder algorithm is describedbriefly. As shown in ‘‘Fig. 1’’, the JPEG2000 decoding process is com-posed of a bit stream parser(tier-2), an entropy block decoder (tier-1), a de-quantization module, an inverse discrete wavelet transformmodule, a color transform module and a tile combiner [9].

2.1.1. Bit stream parserThe bit stream parser is the part of JPEG2000 algorithm com-

monly known as Tier 2. The parser extracts the required data fromthe image headers. Then after all the required image information isobtained, the parser locates the compressed bit stream for eachcode block to pass through to the entropy block decoder. The codeblock is an array of coded wavelet coefficients that are processedby the arithmetic decoder.

2.1.2. The entropy block decoderAs shown in ‘‘Fig. 1’’, the entropy block decoder is subdivided

into two principal processing stages: the decoding passes com-monly known as embedded block coding with optimized trunca-tion (EBCOT) and the binary arithmetic decoder appointed as MQdecoder. The EBCOT determines the significance of every bit inthe image by using the decoded values and the surrounding bits[15]. When the EBCOT finds bit to be significant, the pass providesthe MQ decoder with a context CX for that bit based on the statusof the surrounding bits. Thereafter the EBCOT should wait for thedecision value which is returned from the MQ decoder before thesurrounding bits can be processed. The tier 1 processes the com-pressed bit stream and outputs quantized wavelet coefficients,which make up the code block as shown in ‘‘Fig. 1’’.

2.1.3. Inverse quantizationTo increase the amount of bits used to represent the coefficients

output by arithmetic decoder, the inverse quantization uses thestep size extracted from the image header information. This mod-ule is performed by using the multiplication of the coefficients bythe step size. If the step size is equal to one or if the lossless encod-ing mode is selected then the inverse quantization step will beskipped.

2.1.4. Inverse discrete wavelet transformAfter decoding and de-quantized stages, the inverse discrete

wavelet transform (IDWT) converts the coefficients into raw image

Fig. 1. JPEG2000 decod

data. The inverse discrete wavelet transform applies a high pass fil-ter and a low pass filter equal to the number of levels [16]. Thisnumber of discrete wavelet transform levels is found by the bitstream parser in the header.

2.1.5. Color transformThe color transform is performed when the image has more

than one color plane. When the wavelet coefficients are convertedinto pixels by the inverse discrete wavelet transform, the colortransform converts these pixels from the YCbCr (luminance, bluechrominance, and red chrominance) color space to the RGB (red,green, and blue) color space [17].

2.1.6. Tile combinerThe standard JPEG2000 allows the original image to be broken

into multiple tiles. The entropy decoding, inverse quantization,inverse DWT, and color transform is performed separately on eachtile, if multiple tiles are used in the encoder. When the image datais decoded, the tile combiner arranges the rows of each tile to alignthe image in raster order. The tile combiner is not evoked, whenonly one tile is used.

2.2. Overview of MQ decoder

2.2.1. The binary arithmetic encoderThe compression technique adopted in JPEG2000 standard is a

statistical binary arithmetic coding, which is called MQ coder andis based on recursive probability subdivision of Elias coding[1,3,18,15]. Each binary decision is coded by subdividing the cur-rent probability into two sub-intervals. Then the code string pointsto the lower bound of the probability interval.

The MQ coder [6,7] is used to map the input binary decision intomore probable symbol (MPS) and less probable symbol (LPS) asshown in ‘‘Fig. 2’’. We can find out whether MPS or LPS is coded,and the new interval will be shorter than the original one. In orderto solve the finite-precision problems when the length of the prob-ability interval falls below a certain minimum size, the intervalmust be renormalized to become greater than the minimum bound[15].

2.2.2. The binary arithmetic decoderThe binary arithmetic decoder in the standard JPEG2000 is com-

monly known as MQ decoder. As with the encoder, the decoderworks by adjusting an interval [cn,cn + an) # [0, 1). The Multi Quo-tient (MQ) variant of the arithmetic coder and decoder is a specialmultiplier free version, which uses finite arithmetic and bit stuffing

er block diagram.

Page 3: An efficient hardware implementation of MQ decoder of the JPEG2000

Fig. 2. MQ Coder procedure.

L. Horrigue et al. / Microprocessors and Microsystems 38 (2014) 659–668 661

to handle carry propagation [19]. The MQ decoder uses a 16 bitregister A to store the interval length and a 32 bit register C to storepart of the interval lower bound. The top 16 bits of C represent theactive region of the code word Cactive. The current probability of theMost Probable Symbol (MPS) is denoted p and the Least ProbableSymbol is denoted LPS. The bits are decoded according to thesesimplified equations that do not show conditional exchange, whichmay reverse the x decision when p is close to ½.

x ¼LPS if Cactive < p

MPS else

�ð1Þ

A and C registers are updated according to the followingequations:

A ¼p if Cactive < p

A� p else

�ð2Þ

C ¼C if Cactive < pC� p else

�ð3Þ

The register structures for the arithmetic decoder are given inTable 1:

� The A register is a 16 bit interval register that contains the valueof the current interval.� The C register is the code register containing the partial coded

bits at every stage of decoding.� The Chigh-register and Clow-register can be considered as one

32 bit of C-register in that renormalisation of C shifts a bit ofnew data from bit 15 of Clow to bit 0 of Chigh.� The ‘‘a’’ bits represents the fractional bits in the current interval

value (A-register) and the ‘‘x’’ bits are the fractional bits in Cregister. The ‘‘s’’ bits represent the spacer bits which provideuseful constraints on carry-over, and the ‘‘b’’ bits indicate thebit positions from which the completed bytes of the data areremoved from the C-register. The ‘‘c’’ bit is a carry bit.

The MQ decoder receives contexts from the decoding passesand uses each context to determine the probability estimatebetween Least Probable Symbol (LPS) and the Most Probable Sym-bol (MPS). According to the probability estimate and the currentstate of the decoder, the decision bit is either the LPS or MPS.The output bit is returned to the decoding passes to be placed in

Table 1The register structures for the MQ decoder.

MSB LSB

C-register 0000 cbbb bbbbbssss xxxxxxxx xxxxxxxxA-register 0000 0000 0000 0000 aaaaaaaa aaaaaaaaChigh-register xxx xxxx xxxx xxxx 00000000 00000000Clow-register 0000 0000 0000 0000 bbbb bbbb 00000000

the correct location for the final decoded image and to be used indetermining the significance of surrounding bits. As shown in the‘‘Fig. 1’’ the context CX and the compressed data (Bitstream) areprocessed together to produce the output bit value, namelyDecision. There are 19 contexts (integer (0–18)) in the JPEG2000standard, each context is represented in 5 bits [1,8,9].

2.2.3. MQ decoder’s algorithmAs shown in the ‘‘Fig. 3’’, the MQ decoder’s algorithm is initial-

ized through INITDEC procedure. The contexts (CX) is read andpassed on to the DECODE procedure until all contexts have beenread. The DECODE procedure which contains four procedures(LPS_EXCHANGE, MPS_EXCHANGE, RENORMED and BYTEIN)decodes the binary decision D and returns a value of either 0 or1. The probability estimation table which provides adaptiveestimates of the probability for each context are embedded inDECODE. When all contexts have been read, the compressed datahas been decompressed. More detailed description of decodingalgorithm is given in [1,8,9].

3. Existing architecture

Many architectures of the MQ decoder have been proposed inthe literature, to implement the JPEG2000 decoder using hardwaredescription language (HDL) implementations to be used in FPGAs[9,11,12,14].

The architecture proposed in [9] is composed of four stages:Load, Compute, Decide, and Renormalization. These stages are usedto speed up the decoding. In fact, the first stage is used to loads theIndex values of the context from the look up tables. The Computestage performs the calculation for the arithmetic operations whenthe look-up table values are loaded. Then the Decide stage uses thecalculated outputs to update the internal registers, the bit of deci-sion, the RAM (I(CX)) and the RAM(MPS). Since the renormalizationstage is used when the next value of A register is less than valuehex ‘‘8000’’. This architecture is implemented on a platform Virtex2 XCV6000-6 and it operates at 157, 2 MHz. Moreover the through-put for this design is 9.4 MB/s.

The MQ decoder architecture proposed in [11] is composed offour stages. The first stage involves the load RAM MQ whichbehaves as a normal RAM. The second stage Compute containsthree subtractors and an Nbshift module enables the number ofshifts required for (A > hex 8000) to be calculated. This informationwill be used in the stage Renorme. The multiplexers located beforeand after the Nbshift block allows to distribute the calculation oftwo possible values of shifts on the stage Compute and Decide.The two blocks shift of the stage Renorme can shift the A registerand C register of the number which is provided by Nbshift. In factthis architecture is able to provide a decision by 3 clock cycle afterreceiving a context. By using a Virtex2 XCV6000-6 platform, theimplementation results show that the design operates at108 MHz and occupies 1.6% of requirement memory.

A high-speed area efficient architecture for the MQ decoder isproposed in [12]. In fact this architecture is subdivided in to threemain modules, namely ‘‘Compute and Decide’’, ‘‘Renorm’’ and‘‘Bytein’’. The maximum operating speed of this architecture isobserved on the virtex-5 devices, which is 222.8 MHz and the esti-mated frame rate is 32.02FPS.

The MQ decoder design from [14] separates the logical and themathematical operations from the logic in order to determine thevalues of the internal registers and the output value. As proposedin [14], the MQ decoder architecture is composed of two look uptables (the context state table and the probability state table), astate machine module, an arithmetic module, a comparator mod-ule, and a controller module. This design utilizes no propriety parts

Page 4: An efficient hardware implementation of MQ decoder of the JPEG2000

Fig. 3. The flowchart of the MQ-decoder.

662 L. Horrigue et al. / Microprocessors and Microsystems 38 (2014) 659–668

Page 5: An efficient hardware implementation of MQ decoder of the JPEG2000

L. Horrigue et al. / Microprocessors and Microsystems 38 (2014) 659–668 663

and is easily portable between the Xilinx and Altera architectures.This architecture is implemented on Altera Stratix III SL150 and itoperates at 167.45 MHz.

4. Proposed MQ decoder architecture

A simplified and functional diagram of the MQ decoder isshown in Fig. 4. There are 19 possible context (CX) values gener-ated by the bit plane coder (BPC) [1]. These CX are fed to an MQdecoder and each context has an associated entry in the Index Lookup table (ILT). The ILT has two fields: index (ICX) and MPS value(i.e. S_mps). The ICX is a pointer to another table known as proba-bility estimation table (PET), whereas S_mps value indicateswhether the symbol ‘0’ or ‘1’ is treated as MPS symbol. The initialvalues for ILT are defined in the standard JPEG2000 and areupdated as decoding processes. The PET has 47 pre-defined statesand each state has 4 fields: the probability estimation (Qe), thenext state most probable symbol (NMPS), the next state leastprobable symbol (NLPS) and SWITCH. These entries are notupdated during decoding operation.

Two registers A and C are used during the decoding operation.Register A is the interval register, which represents the size ofthe current interval as required in the MQ coder [2]. It is initializedat 0 � 8000 which is equivalent to decimal 0.75. The code registerC contains partial codeword (bit stream) at any stage of decoding.Here on wards, bits C [31:16] are referred as Chigh and bits C[15:0] are referred as Clow. Register C initialized with first twocompressed bytes as specified in the standard JPEG2000. RegisterA is kept in the range of 0.75 < A < 1.5. The decoding process deter-mines, which sub-interval is pointed to by compressed image data.To do this, the decoder subtracts any interval that encoder hasadded to the code string, after decoding each decision. Wheneverthe range value in A register falls below 0.75, data in registers Aand C are shifted with 1. This operation is called as renormalizationand it continues till A < 0 � 8000. More detailed description of therenormalization and decoding algorithm is given in this previewsection.

4.1. PET ROM structure

To implement the proposed architecture, the probability esti-mation table (PET) used in the standard JPEG2000 will be changed.There are 47 indexes in the PET table for contexts, for this reasonwe use a ROM memory to allow a broadening PET table. The depthof the ROM of the PET should be 47, with 6 bits as the address

Fig. 4. MQ decoder fu

wires. At each index, the PETROM shows not only the probabilityof the current LPS symbol, but also the next probable symbol’sinformation as well.

In order to decrease the size of the PETROM, we check carefullythe content of MQ-related data structures. For the symbol proba-bility table Qe(47), each value of Qe is less than 0 � 8000, and thelast two bits of Qe are constants, i.e., ‘‘01’’ in binary. Then a widthof only 13 bits is enough to each value of Qe.

For the symbol probability table, we observe that some valuesof Qe for the different indexes are the same. For example,Qe(0) = Qe(6) = Qe(14) = Qe(46) = 0 � 5601.

Table 2 shows the detailed information for Qe (47). According tothis table, we count the number of different values and their occur-rence in the table of Qe (47). There are only 32 different values inthe Qe list. Therefore the necessary size of Qe is only 32 � 13 bits.By adopting this Qe implementation technique, instead of 752 only416 memory element are necessary.

In addition, each NMPS entry in PET is of 6 bits wide. All theseentries are realized using the following function (4).

NMPS ½x� ¼

xþ 1 if x 2 ½0;4� [ ½6;12� [ ½14;44�38 if x ¼ 529 if x ¼ 13x if x ¼ 45;46

8>>><>>>:

ð4Þ

All SWITCH entries in the PET ROM are realized using function(5):

SWITCH ½x� ¼1 if x ¼ 0;6;140 else

�ð5Þ

Similarly, all NLPS entries in the PET are realized using function(6):

NLPS ½x� ¼

1 if x ¼ 029 if x ¼ 433 if x ¼ 5x if x ¼ 6;463ðxþ 1Þ if x 2 ½1;3�14 if x 2 ½7;9� [ ½14;15�xþ 7 if x 2 ½10;11�xþ 8 if x 2 ½12;13�x� 1 if x 2 ½16;20�x� 2 if x 2 ½21;45�

8>>>>>>>>>>>>>>>>>>><>>>>>>>>>>>>>>>>>>>:

ð6Þ

nctional diagram.

Page 6: An efficient hardware implementation of MQ decoder of the JPEG2000

Table 2The statistics of Qe table.

Qe Occurrence count Position Qe Occurrence count Position

0 � 5601 4 0, 6, 14, 46 0 � 5101 1 160 � 3401 2 1, 19 0 � 2801 1 210 � 1801 2 2, 25 0 � 2201 1 230 � 0AC1 2 3, 30 0 � 1401 1 270 � 0521 2 4, 33 0 � 1201 1 280 � 0221 2 5, 36 0 � 1101 1 290 � 5401 2 7, 15 0 � 09C1 1 310 � 4801 2 8, 17 0 � 08A1 1 320 � 3801 2 9, 18 0 � 0441 1 340 � 3001 2 10, 20 0 � 02A1 1 350 � 2401 2 11, 22 0 � 0141 1 370 � 1C01 2 12, 24 0 � 0111 1 380 � 1601 2 13, 26 0 � 0085 1 390 � 0009 1 43 0 � 0049 1 400 � 0005 1 44 0 � 0025 1 410 � 0001 1 45 0 � 0015 1 42

664 L. Horrigue et al. / Microprocessors and Microsystems 38 (2014) 659–668

By adopting all these changes, only 502 memory elements arenecessary to hold the entire PET ROM data which means that rest861 memory elements are inessential. Therefore, these optimiza-tion techniques can reduce the memory elements of the conven-tional implementation of the PET ROM by 63%.

4.2. Probability estimator

The whole architecture of our proposed MQ decoder is shown inFig. 5. In order to provide a decision by 4 clock cycle after receivinga context and Bitstream, the decoder undergoes two different

Fig. 5. Internal architect

stages. The first stage is used for the prediction of the probabilityestimation and the second stage is used for decoding the bit value.In the first stage, we start by reading the context CX. Thereafter thecontext CX value is passed on the address bus of the ILT RAM (RAMI(CX) and RAM MPS (CX)) and ICX and mps_D values are read. Thenthis I(CX) value is supplied to the PET ROM (ROM_NMPS (I(CX)),ROM_NLPS(I(CX)), ROM_Switch(I(CX)) and ROM_Qe(I(CX))) andthe entries corresponding to these locations are read. When therenormalization is required, then the value of NMPS or NLPS willbe copied in the RAM I(CX). Similarly, if S_Switch signal is high,the S_MPS bit in the RAM_MPS will be updated.

ure of MQ decoder.

Page 7: An efficient hardware implementation of MQ decoder of the JPEG2000

L. Horrigue et al. / Microprocessors and Microsystems 38 (2014) 659–668 665

An example to demonstrate the algorithmic links between theLPS_Exchange state and the Probability estimator stage in thecurrent design is shown in Fig. 6.

4.3. The decoding block

According to the MQ decoder algorithm we have established astate machine with eleven states. The conditions to transitionbetween the different states of the block diagram are highlightedin the state machine diagram in ‘‘Fig. 7’’.

� Pause: Depending on the value of RST and the value of go, theINITDEC state will be executed� Initdec: This state is used to start the MQ decoder. The pointer

to the compressed data BP (Pointer to Byte B) is initialized toBPST (initial value of BP). The first byte of the compressed datais shifted into the low order byte of Chigh, and a new byte isthen read in. Finally, the register C is shifted by 7 bits and CTis decremented by 7 and the register A is initialized to 0 � 8000.� READ: CX is reading and passed on to DECODE. The probability

estimation procedures which provide adaptive estimates of theprobability for each context are embedded in the state DECODE.� DECODE: The first step in the decoding state is to subdivide the

current interval A by subtracting the probability of the leastprobable symbol Qe(I(CX)). Then the Chigh register is comparedto the value of the probability estimate Qe for the current indexI stored at CX (Qe (I(CX)). This test determines whether a MPS orLPS is decoded. If Chigh register is logically greater than or equalto Qe(I(CX)), then Chigh register is decremented by thatamount. If A is less than ‘‘0 � 8000’’, then the state MPS_EXCHANG will be performed. Else the MPS sense stored at CXis used to set the decoded decision D in the D_get_MPS state.If Chigh register is logically less than to Qe(I(CX)) then the

Fig. 6. An extract of the proposed VHDL pseudo code.

LPS_EXCHANG state well executed. Once a renormalisation isneeded, the MPS_Exchange or the LPS_Exchange states mayhave occurred.� MPS_Exchange: As a first step in this state, the interval A is

compared to the probability estimate Qe(I(CX)). If the LPSsub-interval is larger, the conditional exchange occurred andan LPS occurred. The probability update switches the MPS senseif the SWITCH(I(CX)) has a ‘‘1’’ and updates the index I(CX) fromthe next LPS index NLPS((ICX)). If, however, the MPS sub-inter-val size A is not logically less than the LPS probability estimateQe(I(CX)), an MPS occurred and the decision can be set fromMPS(CX). Then the index I(CX) is updated from the next MPSindex NMPS(I(CX)).� LPS_Exchange: The same logical comparison between the MPS

sub-interval A and the LPS probability estimate Qe(I(CX)) for theLPS EXCHANGE state. This comparison determines if a condi-tional exchange occurred. On both paths the new sub-intervalA is set to Qe(I(CX)). On the left path the conditional exchangeoccurred so the decision and update are for the MPS case. Onthe right path, the LPS decision and update are followed.� Renormed: In the renormed state a counter keeps a track of the

number of compressed bits in the Clow section of the C-register.When CT is equal to zero, a new byte is inserted into Clow reg-ister in the BYTEIN state. Else the ren_1 state well executed.� Bytein: This state is called from the RENORMED state when CT

is equal to zero as shown in ‘‘Fig. 7’’. This state reads in one byteof data, compensating for any stuff bits following the ‘‘0 � FF’’byte in the process. The Reg_C in Bytein is the concatenationof the Chigh and Clow registers. If B is not a ‘‘0 � FF’’ byte, BPis incremented and the new value of B is inserted into the highorder 8 bits of Clow_register. In fact B is the byte pointed to thecompressed data buffer pointer BP. B1 (the byte pointed to byBP + 1) is tested’’, if B is equal to hex FF byte. When B1 exceedshex ‘‘8F’’, then Reg_C will be achieved by adding data byte hexFF in C register and setting the bit counter CT to 8.� Ren_1: During this state, the registers Reg_A and Reg_C are

shifted to the left, one bit at a time and the counter CT is decre-mented by 1. If Reg_A is below hex ‘‘8000’’, then Renormed stateis executed again. Else the finish state well executed.� Finish: In this state, if the nbpaire is equal to the total number

of CX. Then we return to the pause state else the read willexecuted again.

5. Implementation and results

5.1. Implementation

For comparison to future designs and for completeness pur-poses, Table 3 displays the logic utilization and maximum clockfrequency of the proposed MQ decoder architecture when target-ing the three platforms Virtex6, Virtex5 and Virtex4 using thedefault settings of Xilinx ISE 13.1. It has been estimated that onan average 4.10 cycles are required to decode one decision. Asshown in table III, the estimated frame rate at 439.58 MHzfrequency is 63.2 FPS, while for 322.5 MHz, it is 46.4 FPS of highdefinition TV of 1920p.

5.2. Comparison

To evaluate the performance of our architecture viewpointhardware requirements and operating frequency, a comparativestudy was made with other existing architectures using the sameFPGA Virtex-2 XC2V6000-6. Table 4 presents comparison of theproposed MQ decoder architecture with the existing architectures.From the above comparison, we note that the operating frequencyof the proposed MQ decoder is faster than all other architectures

Page 8: An efficient hardware implementation of MQ decoder of the JPEG2000

Fig. 7. Proposed MQ decoder state machine.

Table 3The synthesis’s results of the proposed architecture.

Used FPGA XC6VLX75T XC5VLX50T XC4VLX15

Number of Slice LUTs 378/46560 414/28800 611/71680Number of slice registers 258/93120 286/28800 318/35840Number of fully used LUT-FF

pairs176/475 196/659 288/71680

Max. frequency 439.58 MHz 395,64 MHz 322.49 MHzFrame rate supported (FPS) 63.23 56.92 46.40

Table 4Comparison with other MQ decoder architectures.

Used FPGA Virtex-2 XC2V6000-6

Architectures [13] [12] [9] [14] ProposedNumber of slice registers 498 328 315 324 313Number of slice LUTs 944 579 586 600 630Frequency (MHz) 140.4 142 157.2 146.5 205.5Frame rate supported (FPS) – 20.41 – – 29.56

666 L. Horrigue et al. / Microprocessors and Microsystems 38 (2014) 659–668

and it consumes less hardware requirements. By adopting themethod of the PET ROM implementation, hardware requirementis reduced significantly as excepted. However, only 502 memoryelements are required here to implement it, because of the use of

the proposed optimization technique. From Table 4, it is clear thatthe memory requirement of the proposed architecture has reducedby 37.1% compared to those of the architecture presented in [13].Therefore the estimated frame rate of the proposed MQ decoder

Page 9: An efficient hardware implementation of MQ decoder of the JPEG2000

Table 5Comparison with other mq decoder architecture using altera stratix iii and stratix ii.

FPGA Family Stratix II EP2S60F102C4 Stratix II EP2S60F102C4 Stratix III SL150

Architecture [19] Proposed ProposedCombinational ALUTs 351

48352 (<1%) 52048352 (<1%) 548

113600 (<1%)Dedicated logic registers 323

48352 (<1%) 30148352 (<1%) 301

113600 (<1%)Frequency 100 MHz 155.98 MHz 214.41 MHz

Table 6MQ decoder throughput.

Used FPGA Design Number of bytes Average number clock cycles Frequency (MHz) Throughput (MB/s)

Virtex5 LX155 Proposed 4.096 38.9661 396.196 41.97[20] N/A N/A 125 31.25

Virtex2 XC2V6000 Proposed 4.096 28.443 205.5 35.74[9] 4.096 68,538 157.2 9.4[13] 4.096 74,239 140.4 7.74[21] 4.096 32.768 45.6 5.7[22] N/A N/A 127.97 5.3

Stratix 3A [18] N/A N/A 73 4.5

L. Horrigue et al. / Microprocessors and Microsystems 38 (2014) 659–668 667

architecture is 1.44 times higher as compared to those of MQdecoders in [12].

For careful comparison, we also implement our MQ decoderarchitecture by VHDL, using Altera’ FPGA as target devices. Table 5shows the final implementation results on the two FPGA platformsas reported by their default synthesis tools, i.e. Quartus layout toolfor Altera. From the above table we can see that our MQ decoder’sfrequency is much higher and our memory size is much smallerthan those of the existing architecture [14]. Considering of memorysize and speed, we believe that our architecture is 1.28 faster thanthat of MQ decoder in [14]. Therefore, the proposed architectureuses a small area to get a high speed.

The theoretical throughput calculation is performed by usingthe frequency provided by targeting the Xilinx Virtex2XC2V6000-6 with Xilinx ISE 10.1 and the average number of theclock cycles is determined by implementing the design with acounter in state machine HDL. Since the throughput from [21,18]are calculated along with the design of [9] and using Eq. (7). Thethroughput and the average number of clock cycles from[10,8,22,9] are copied from their publications and provided herefor comparison. The average number of clock cycles for the designin [18] is given in their publication as eight cycles for each byte forthe code block. The designs in [21,10,9,22] and the proposed designall targeted the Xilinx Virtex 2 FPGA while the design from [18] tar-geted the Altera Stratix 3 FPGA and the design from [20] targetedthe Xilinx Virtex 5LX155.

ThroughputðMB=sÞ ¼ wCB � hCB � dCB �f ðMHzÞ

lð7Þ

Eq. (7) produces the throughput of the binary arithmetic deco-der MB/s. wCB and hCB refer to the width and the high per codeblock in pixels. dCB refers to the depth of the pixels. The f is the fre-quency of the design in MHz and l is the number of cycles requiredper code block. For the design comparison, the height and widthper code block are 64 � 64 and the depth of each pixel is a byte.The total size of the data processed is 4.096 bytes. Table 6 showsthe data size, number of average cycles, frequency, and finalthroughput for all six designs. The proposed design requires38,871 clock cycles per code block. From FPGA synthesis results,the proposed architecture‘s throughput can reach 35.745 MB/sand 41.97 MB/s respectively when it is implemented in Virtex2and Virtex5LX155 which is higher than those of Baruffa. G‘sarchitecture [21].

6. Conclusion

In this paper, we have designed and implemented an efficientarchitecture of the MQ decoder of the JPEG2000 standard. Thisnew architecture is based on reduced probability estimation blockand faster MQ decoding. The MQ decoder design was implementedin VHDL hardware description language and synthesized for FPGAdevices. The Maximum operating speed is observed on the Virtex-6device which is 439 MHz. Moreover, memory requirement of theproposed architecture is reduced by 37.1% compared to the otherexisting architecture and the Maximum frequency of the MQ deco-der is the highest of all architectures mentioned above. Accordingto synthesis results, our architecture can reach 41.97 MB/s at itsmaximum frequency. Therefore, the proposed architecture is easilyportable between the Xilinx and Altera architectures and it is capa-ble of decoding 63.2 frames/s of High Definition (HD, 1920 � 1080pixels), making it a good candidate for use as a high speed real-time JPEG2000 decoder in various applications like digital cinema,medical imaging, satellite imagery, etc.

References

[1] JPEG. JPEG 2000 Part 1 Final Draft International Standard Version 1.0.ISO(15444), April 2000.

[2] JPEG2000 Standard for Image Compression: Concepts, Algorithms and VLSIArchitectures, Tinku Acharya & Ping-Sing Tsai, A John Wiley & Sons, Inc.,Publication, 2005.

[3] ISO/IEC JTC 1/SC 29/WG 1, (ITU-T SG8) Coding of Still Pictures, JBIG (JointBilevel Image Experts Group), JBIG Committee, 16 Juillet 1999.

[4] D. Taubman, High performance scalable image compression with EBCOT, IEEETrans. Image Process. 9–7 (July) (2000) 1158–1170.

[5] JPEG2000 Part 1 020719 (Final Publication Draft), ISO/IEC/JTC1/SC29/WG1N2678. Tech. Rep., 2002.

[6] K. Liu, Y. Zhou, Y. Song Li, J.F. Ma, A high performance MQ encoder architecturein JPEG2000, Intgr., VLSI J. 43 (3) (2010) 305–317.

[7] Kishor Sarawadekar, Swapna Banerjee, VLSI design of memory-efficient, high-speed baseline MQ coder for JPEG 2000, Integ. VLSI J. 45 (2012) 1–8.

[8] Nandini Ramesh Kumar, Wei Xiang, Yafeng Wang, Two-symbol FPGAarchitecture for fast arithmetic encoding in JPEG 2000, J. Signal Process. Syst.69 (2012) 213–224. Springer.

[9] David J. Lcking, Eric J. Blaster, Kerry L. Hill, Frank A. Scarpino, FPGAimplementation of the JPEG2000 binary arithmetic MQ decoder, J. Real TimeImage Process. Syst. (July) (2011). Springer.

[10] M. Dyer, S. Nooshabadi, D. Taubman, Reduced latency arithmetic decoder forJPEG 2000 block decoding, in: Circuits and systems, 2005. ISCAS 2005. IEEE Int.Symp. 32076_2097 (2005). http://dx.doi.org/10.1109/ISCAS.2005.1465027.

[11] Antonin DESCAMPE & François-Olivier DEVAUX, «Etude et conception d’undécodeur hardware JPEG 2000 destiné au cinéma numérique», master’s thesis,June 2003.

Page 10: An efficient hardware implementation of MQ decoder of the JPEG2000

668 L. Horrigue et al. / Microprocessors and Microsystems 38 (2014) 659–668

[12] Omkar C. Kulkarni, Kishor Sarawadekar, Swapna Banerjee, «VLSI Implementation ofMQ Decoder in JPEG2000», in: Proceeding of the 2011 IEEE Students’ TechnologSymposuim, Circuits and Systems for Video Technology, pp. 193–197, January 2011.

[13] A. descampe, F.-O. Devaux, G. Rouvroy, J.-D. Legat, J.-J. Quisquater, B. Macq, Aflexible hardware JPEG 2000 decoder for digital cinema, in: Circuits andSystems for Video Technology, IEEE Transactions on, vol. 16(11), November2006, pp. 1397–1410.

[14] David J. Lcking, Eric J. Blaster, Kerry L. Hill, Frank A. Scarpino, ‘‘FPGAimplementation of the JPEG2000 binary arithmetic MQ decoder’’Master’sthesis, University of Dayton, May 2010.

[15] T. Saidani, M. Atri, L. Lekhriji, R. Tourki, An efficient hardware implementationof parallel EBCOT algorithm for JPEG 2000, J. Real-Time Image Process.(January) (2013) 1–12.

[16] Taoufik Saidani, M. Atri, Y. Said, D. Dia, R. Tourki, FPGA Real Time Accelerationfor Discrete Wavelet Transform of the 5/3 Filter for JPEG2000 Standard, Int. J.Embedded Syst. Appl. (IJESA) 2 (1) (2012).

[17] Saidani Taoufik, M. Atri, D. Dia, R. tourki, Using xilinx system generator for realtime hardware co-simulation of video processing system, in: ElectronicEngineering and Computing Technology, Springer, 2010.

[18] Michael Dyer, Saeid Nooshabadi, David Taubman, Reduced latency arithmeticdecoder for JPEG2000 block decoding, in: Circuits Systems, 2005. ISCAS 2005.IEEE International Symposium on, vol. (3), May 2005, pp: 2076–2079, 23–26.

[19] R. Xu et al., A high-performance JPEG2000 decoder based on FPGA according toDCI specification, in: 2010 Symposium on Photonics and Optoelectronic (SOPO).

[20] G. Baruffa et al., A reprogrammable computing platform for JPEG 2000 andH.264 SHD video coding, in: 2010 8th IEEE Workshop on Embedded Systemsfor Real-Time Multimedia.).

[21] H.H. Chen, C.J. Lian, T.H. Chang, L.G. Chen, Analysis of EBCOT decodingalgorithm and its VLSI implementation for JPEG 2000, in: Circuits and systems2002. IEEE Int. Symp., vol. 4, 2002, pp. 329–332.

[22] T. Zhu, J. Zhou, S. Liu, Design and implementation of JPEG 2000 arithmeticdecoder based on Handel-C, in: Anti-counterfeiting, security, and Identificationin communication, 2009. http://dx.doi.org/10.1109/ICASID.2009.5276988.

Layla Horrigue received her M.S. degree in Micro-electronics from Faculty of Science of Monastir, Tunisiain 2013. His major research interests include VLSI andembedded system in video compression.

Taoufik Saidani received his M.S. degree in Micro-electronics from Faculty of Science of Monastir, Tunisiain 2007. His major research interests include VLSI andembedded system in video and image compression.Hiscurrent research interests include digital signal pro-cessing and hardware–software co-design for rapidprototyping in telecommunications.

Refka Ghodhbani received her M.S. degree in Micro-electronics from Faculty of Science of Monastir, Tunisiain 2013. His major research interests include Circuit andSystem Design, Image compression embedded blockcoding.

Julien Dubois is associated professor at the Universityof Burgundy since 2003. He is a member of the Labo-ratory Le2i (UMR CNRS 6063). His research interestsinclude real-time implementation, smart camera,hardware design based on data-flow modeling, motionestimation and image compression. In 2001, he receivedPhD in Electronics from the University Jean Monnet ofSaint Etienne (France) and joined EPFL based in Lau-sanne (Switzerland) as a project leader to develop a co-processor, based on FPGA, for a new CMOS camera.

Johel Miteran received the PhD degree in image pro-cessing from the University of Burgundy, Dijon, Francein 1994. Since 1996, he has been an assistant professorand since 2006 he has been professor at Le2i, Universityof Burgundy. He is now engaged in research on classi-fication algorithms, face recognition, access controlproblem and real time implementation of these algo-rithms on software and hardware architecture.

Atri Mohamed born in 1971, received his Ph.D. degreein Microelectronics from the Science Faculty of Monastirin 2001. He is currently a member of the Laboratory ofElectronics and Microelectronics. His research includesCircuit and System Design, Image processing, NetworkCommunication, IPs and SoCs.