hardware 734 ece - university of...
Post on 23-Feb-2020
2 Views
Preview:
TRANSCRIPT
Hardware Optimized DCT/IDCT Implementation on Verilog HDL
734 In this report, I explore 4 implementations for hardware based pipelined DCT/IDCT in Verilog HDL. Conventional DCT/IDCT implementations suffer from the amount of hardware requirement needed for storage and computations. This project is an attempt to optimize these important requirements and compare 4 implementations to conclude the best design point for the hardware based DCT/IDCT implementation. It has been observed that the Serial In implementation consumes around ~6% lesser area than parallel In implementation at a performance degradation of only ~4%.
Rahul Srikumar
ECE
1
Table of Contents Motivation..................................................................................................................................................... 2
Prior Work ..................................................................................................................................................... 3
The Discrete Cosine Transform ..................................................................................................................... 3
Introduction .......................................................................................................................................... 4
Four Implementations .......................................................................................................................... 5
Serial In ......................................................................................................................................................... 5
2 Parallel In.................................................................................................................................................... 8
4 Parallel In.................................................................................................................................................... 9
8 Parallel In.................................................................................................................................................. 10
Optimizations .............................................................................................................................................. 11
Synthesis and Results .................................................................................................................................. 11
Conclusion ................................................................................................................................................... 15
References .................................................................................................................................................. 15
2
Motivation
Discrete Cosine Transform(DCT) is one of the important image compression algorithms
used in image processing applications. It involves a lot of multiplications, additions and
also has a huge memory requirement. Several algorithms have been proposed over the
last couple of decades to reduce the number of computations and memory
requirements involved in the DCT computation algorithm.
Any algorithm that can reduce the total number of additions, multiplications or memory
requirement would be of profound significance to the image processing domain.
3
Prior Work
There has been a lot of research both in industry and academia on how to efficeintly
implement a fast DCT/IDCT hardware algorithm. Dae Won Kiln, et. al [1], proposed and
implemented a hardware Distributed Arithmetic(DA) method with radix-2 multibit coding
with minimum resource requirement by using transpose memory. Atitallah et. al [2]
compared Loeffler and DA algorithms to implement compression in H.264 nad
MPEG. Martuza et. al [3] presented a hybrid architecture for IDCT computation based
on the symmetric structure of matrices and similarity in matrix operations. The
proposed architecture derives its inspiration from all the above well set examples.
The Discrete Cosine Transform
A discrete cosine transform (DCT) expresses a sequence of finitely many data points in
terms of a sum of cosine functions oscillating at different frequencies i.e. it transforms a
signal from a spatial representation into a frequency representation. In an image, most
of the energy will be concentrated in the lower frequencies, so if I transform an image
into its frequency components and discard the higher frequency coefficients, I can
reduce the amount of data needed to describe the image without sacrificing too much
image quality. This is why DCT is popularly used in several image compression
algorithms. The DCT function used in image processing consists of sum of weighted
cosine functions at different frequencies.
The DCT of a function is expressed as follows
4
-------------------(1)
------------(2)
--------------(3)
Since images are 2-D objects, a 2-D DCT is required to get all pixels transformed into
the frequency domain. This computation involves 2 major steps.
(i) Computing the 1-D DCT of the rows of the pixel matrix.
(ii) Computing the 1-D DCT of the columns of the pixel matrix by computing the DCT of
the transpose of the matrix obtained in (i).
2-D DCT of an image is expressed as follows:
---------------(4)
------------(5)
--------------(6)
Introduction In my implementation, I explore four design points of my hardware implementation using
Verilog HDL and evaluate the area-performance trade-off. The design comprises of four
modules per design point. One module for DCT computation, One module for IDCT
5
computation, One top module that instantiates both the DCT and IDCT modules and a
test bench to test the entire design.
Core idea is to implement a fully-pipelined architecture that takes in 8 inputs and
provides a single DCT output which in turn is used to compute the IDCT. A 1D-DCT is
implemented on the input pixels first. The output of this so called the intermediate value
is stored in a RAM. The 2nd 1D-DCT operation is done on this stored value to give the
final 2D-DCT ouput dct_2d. The inputs are 8 bits wide and the 2d-dct outputs are 9 bits
wide. A 1D-IDCT is implemented on the input DCT values. This intermediate value is
stored in a RAM. The 2nd 1D-IDCT operation is done on this stored value to give the
final 2D-IDCT output idct_2d. The inputs are 9 bits wide and the 2d-idct outputs are 8
bits wide. The nuances of the 4 design points have been provided in great details in the
sections that follow.
Four Implementations
Serial In
1st 1D section
The input signals are taken one pixel at a time in the order x00 to x07, x10 to x17 and
so on until x77. These inputs are fed into a 8 bit shift register. The outputs of the 8 bit
shift registers are registered by the divide by 8 clock which is the CLOCK signal divided
by 8. This will enable us to register in 8 pixels (one row) at a time. The pixels are paired
6
up in an adder/subtractor in the order xk0,xk7:xk1,xk6:xk2,xk5:xk3,xk4. The adder
subtractor is tied to CLOCK. For every clock, the adder/subtractor module alternately
chooses addition and subtraction. This selection is done by the toggle flop. The output
of the adder/subtractor is fed into a multiplier whose other input is connected to stored
values in registers acting as memory. The outputs of the 4 multipliers are added at
every clock in the final adder. The output of the adder z_out is the 1D-DCT values
given out in the order in which the inputs were read in.
It takes 8 clocks to read in the first set of inputs, 1 clock to register inputs,1 clock to do
add/sub, 1clock to get absolute value, 1 clock for multiplication, 2 clock for the final
adder. total = 14 clocks to get the 1st z_out value. Every subsequent clock gives out
the next z_out value. So to get all the 64 values we need 14+63=77 clocks.
Storage/RAM section
The outputs z_out of the adder are stored in RAMs. Two RAMs are used so that data
write can be continuous. The 1st valid input for the RAM1 is available at the 15th clock.
So the RAM1 enable is active after 15 clocks. After this the write operation continues
for 64 clocks . At the 65th clock, since z_out is continuous, we get the next valid
z_out_00. This 2nd set of valid 1D-DCT coefficients are written into RAM2 which is
enabled at 15+64 clocks. So at 65th clock, RAM1 goes into read mode for the next 64
clocks and RAM2 is in write mode. The 2 RAMS alternate between read and write
every 64 clock cycles.
7
2nd 1D-DCT section
After the 1st 77 clocks when RAM1 is full, the 2nd set of 1D calculations can start. The
second 1D implementation is the same as the 1st 1D implementation with the inputs
now coming from either RAM1 or RAM2. Also, the inputs are read in one column at a
time in the order z00 to z70, z10 to z70 up to z77. The outputs from the adder in the
2nd section are the 2D-DCT coefficients.
1st 1D-IDCT section
The input signals are taken one pixel at a time in the order x00 to x07, x10 to x07 and
so on up to x77. These inputs are fed into a 8 bit shift register. The outputs of the 8 bit
shift registers are registered at every 8th clock .This will enable us to register in 8 pixels
(one row) at a time. The pixels are fed into a multiplier whose other input is connected
to stored values in registers which act as memory. The outputs of the 8 multipliers are
added at every CLOCK in the final adder. The output of the adder z_out is the 1D-IDCT
values given out in the order in which the inputs were read in. It takes 8 clocks to read in
the first set of inputs, 1 clock to get the absolute value of the input, 1 clock for
multiplication, 2 clock for the final addition which adds up to a total of 12 clocks to get
the 1st z_out value. Every subsequent clock gives out the next z_out value. So to get all
the 64 values we need 12+64=76 clocks.
Storage / RAM section
The outputs z_out of the adder are stored in RAMs. Two RAMs are used so that data
write can be continuous. The 1st valid input for the RAM1 is available at the 12th clock.
8
So the RAM1 enable is active after 11 clocks. After this the write operation continues
for 64 clocks . At the 65th clock, since z_out is continuous, we get the next valid
z_out_00. This 2nd set of valid 1D-DCT coefficients are written into RAM2 which is
enabled at 12+64 clocks. So at 65th clock, RAM1 goes into read mode for the next 64
clocks and RAM2 is in write mode. After this for every 64 clocks, the read and write
switches between the 2 RAMS.
2nd 1D-IDCT section
After the 1st 76th clock when RAM1 is full, the 2nd 1d calculations can start. The
second 1D implementation is the same as the 1st 1D implementation with the inputs
now coming from either RAM1 or RAM2. Also, the inputs are read in one column at a
time in the order z00 to z70, z10 to z70 up to z77. The outputs from the adder in the
2nd section are the 2D-IDCT coefficients.
2 Parallel In
1st 1D section
The input signals are taken 2 pixels at a time in the order x00:x01, x02:x03 and so on
up to x06:x07. A divide by 4 clock is used to clock in 4 sets of 2 pixels to get 8 pixels.
The pixels are paired up in an adder/subtractor in the order
xk0,xk7:xk1,xk6:xk2,xk5:xk3,xk4. The adder subtractor is tied to CLOCK. For every
clock, the adder/subtractor module does 4 additions and 4 subtractions. The output of
the add/sub is fed into a multiplier whose other input is connected to stored values in
registers which act as memory. The output of the 8 multipliers are added at every
9
CLOCK in the final adder. The output of the adder z_out is the 1D-DCT values given
out in the order in which the inputs were read in.
The difference is that it takes 4 clocks to register the inputs and sign extension, 1 clock
to do add/sub, 1clock to get separate sign + absolute value, 1 clock for multiplication, 2
clock for the final adder. total = 9 clocks to get the 1st z_out value. Every subsequent
clock gives out the next z_out value. So to get all the 64 values we need 9+63=72
clocks.
The remaining portions of the DCT/IDCT computation process is similar to the serial In
implementation.
4 Parallel In
The input signals are taken 4 pixels at a time in the order x00:x03 and x04:x07. A divide
by 2 clock is used to clock in 2 sets of 4 pixels to get 8 pixels. The pixels are paired up
in an adder/subtractor in the order xk0,xk7:xk1,xk6:xk2,xk5:xk3,xk4. The adder
subtractor is tied to CLOCK. For every clock, the adder/subtractor module does 4
additions and 4 subtractions. The output of the add/sub is fed into a multiplier whose
other input is connected to stored values in registers which act as memory. The output
of the 8 multipliers are added at every CLOCK in the final adder. The output of the
adder z_out is the 1D-DCT values given out in the order in which the inputs were read
in.
10
In this implementation, it takes 2 clocks to register the inputs and sign extension, 1 clock
to do add/sub, 1clock to get separate sign + absolute value, 1 clock for multiplication, 2
clock for the final adder. total = 7 clocks to get the 1st z_out value. Every subsequent
clock gives out the next z_out value. So to get all the 64 values we need 7+63=70
clocks.
The remaining portions of the DCT/IDCT computation process is similar to the serial In
implementation.
8 Parallel In
The input signals are taken 8 pixels at a time in the order x00::x07. The pixels are
paired up in an adder/subtractor in the order xk0,xk7:xk1,xk6:xk2,xk5:xk3,xk4. The
adder subtractor is tied to CLOCK. For every clock, the adder/subtractor module does
4 additions and 4 subtractions. The output of the add/sub is fed into a multiplier whose
other input is connected to stored values in registers which act as memory. The output
of the 8 multipliers are added at every CLOCK in the final adder. The output of the
adder z_out is the 1D-DCT values given out in the order in which the inputs were read
in.
In this implementation, it takes 1 clock to register the inputs and sign extension, 1 clock
to do add/sub, 1clock to get separate sign + absolute value, 1 clock for multiplication, 2
clock for the final adder. total = 6 clocks to get the 1st z_out value. Every subsequent
clock gives out the next z_out value. So to get all the 64 values we need 6+63=69
clocks.
11
The remaining portions of the DCT/IDCT computation process is similar to the serial In
implementation.
Optimizations
Some of the optimizations I included are 2 RAMs for storage. Each RAM can store 64
pixels. When the first 1D-DCT value is available, the first RAM goes into write mode and
remains in write mode for the next 63 clocks. Afterwards, it switches to read mode and
the second RAM goes into write mode. The next set of 1D DCT coefficients are stored
in the second RAM while the first RAM's DCT values are used for 2D DCT computation.
As a result, the 2 RAMs alternate between read and write every 64 clocks. This helps us
to achieve a fully pipelined design.
For DCT computation its needed to store 64 Cosine coefficients for an 8 point DCT. In
my design another main optimization was to use only 8 registers that get 8 coefficients
every clock cycle. These values keep changing every clock cycle providing the multiplier
with appropriate DCT Cosine coefficients. This enables in effectively reducing the
hardware requirement by (1/8)th of conventional designs.
Synthesis and Results
Figure 1 shows the Modelsim Simulation results of the Serial In implementation of the
DCT computation process.
12
Figure 1: Modelsim simulation of serial in DCT computation
All four implementations were synthesized on Quartus using Altera Cyclone IV FPGA.
Some of the results that were obtained from Quartus are as shown in Figure 2.
Figure 2: Synthesis Summary of Serial In DCT implementation
13
Figure 3: Combinational blocks in 4 implementations
Figure 4: Number of registers for 4 implementations
5600
5700
5800
5900
6000
6100
6200
6300
6400
combinational blocks
Combinational Blocks
8 Parallel
4 Parallel In
2 Parallel In
Serial In
4520
4540
4560
4580
4600
4620
4640
4660
4680
4700
4720
Registers
Registers
8 Parallel
4 Parallel In
2 Parallel In
Serial In
14
Figure 5: Total Computation time for 4 implementations
S No. Design
Type
Registers combinational
blocks
Pins Cycles to
1D DCT
Cycles to
2D DCT
Cycles to
1D IDCT
Cycles to
2D IDCT
1 8 Parallel 4706 6390 74 69 146 161 236
2 4 Parallel
In
4706 6390 42 70 147 162 237
3 2 Parallel
In
4702 6380 26 72 149 164 239
4 Serial In 4587 5869 18 77 154 169 246
Table 1: Tabulates the number of cycles to compute various results at 4 design points.
It can be noted from Figures 3,4 and 5 that the Total computation time of Serial In is 246
cycles and that of 8 parallel In is about 236 cycles, although the hardware requirement
is pretty less for the serial in implementation.
230
232
234
236
238
240
242
244
246
Cycles to 2D IDCT of 8*8 block
Total Computation Time
8 Parallel
4 Parallel In
2 Parallel In
Serial In
15
Conclusion
It can be concluded that the serial In consumes 6% lesser area than the 8 parallel
implementation at a performance degradation of only about 4%. Hence for non-
performance critical, low power and low area applications serial In implementation
should be preferred over other implementations.
References
[1]. Dae Won Kiln, Taeh- Won Kwon, Jiing Min Seo, Jae Kiln Ei, Silk Kyu Lee, Jmg Hee
Silk, Jim Rim Choi A compatible dct/idct architecture using hardwired distributed
arithmetic.
[2]. A. Ben Atitallah, P. Kadionik, F. Ghozzi, P.Nouel, N. Masmoudi, Ph.Marchegay
Optimization and implementation on fpga of the dct/idct algorithm.
[3]. Muhammad Martuza, Carl McCrosky and Khan Wahid A fast hybrid dct
architecture supporting h.264, vc-1, Mpeg-2, avs and jpeg codecs.
[4]. Taizo Suzuki and Masaaki Ikehara Integer DCT Based on Direct-Lifting of DCT-
IDCT for Lossless-to-Lossy Image Coding.
[5]. Hui-Cheng Hsu, Kun-Bin Lee, Nelson Yen-Chung Chang, and Tian-Sheuan Chang,
Architecture Design of Shape-Adaptive Discrete Cosine Transform and Its Inverse
for MPEG-4 Video Coding.
[6]. Kibum Suh , Kyung Yuk Min, Kyeounsoo Kim, Jong-Seog Koh Jong-Wha Chong A
design of dpcm hybrid coding loop using single 1-d dct In mpeg-2 video encoder.
top related