[ieee 2011 international conference on signal processing, communication, computing and networking...
TRANSCRIPT
Proceedings of 2011 International Conference on Signal Processing, Communication, Computing and Networking Technologies (ICSCCN 2011)
AN OPTIMIZED ARCHITECTURE TO PERFORM IMAGE
COMPRESSION AND ENCRYPTION SIMULTANEOUSLY USING
MODIFIED DCT ALGORITHM
S V V Sateesh VIT University Vellore, TN
R Sakthivel VIT University
Vellore, TN
K Nirosha VIT University
Veil ore, TN
INDIA [email protected]
INOlA [email protected]
INDIA [email protected]
Harish M Kittur VIT University
Veil ore, TN
INOlA [email protected]
Abstract:
Traditional fast Discrete Cosine
Transforms (DCT)/ Inverse DCT (mCT)
algorithms have focused on reducing the
arithmetic complexity. In this manuscript, we
implemented a new architecture simultaneous for
image compression and encryption technique
suitable for real-time applications. Here, contrary
to traditional compression algorithms, only special
points of DCT outputs are calculated. For the
encryption process, LFSR is used to generate
random number and added to some DCT outputs.
Both DCT algorithm and arithmetic operators
used in algorithm are optimized in order to realize
a compression with reduced operator
requirements and to have a faster throughput.
High Performance Multiplier (HPM) is being used
for integer multiplications. Simulation results
show the compression ratio around 66% and a
PSNR about 24dB. The throughput of this
architecture is 624 M samples/s with a clock
frequency of78 MHz.
Index Terms- DCT, LFSR, HPM, Carry select
adder, JPEG, FPGA and ASIC.
1. INTRODUCTION
The discrete cosine transform (OCT) has been widely used in speech and image compression because it has features such as good energy compaction and low computational complexity. It has become an integral part of several standards, such as JPEG, MPEG-2, MPEG-4, CCITT Recommendation H.26l and H.263, and HOTV applications [1]. For applications like face recognition, detector etc. we need to use communication systems with a good security level and an acceptable transmission rate. However, for some applications the encryption and the compression techniques cannot be deployed independently or in a cascade manner without considering the impact of one technique over another [2]. To solve this problem, we developed a new technique to simultaneously compress and encrypt
978-1-61284-653-8/11/$26.00 ©2011 IEEE
images. The main idea of our approach consists, firstly, in multiplexing the spectra of different transformed images (to be compressed and encrypted) by a Discrete Cosine Transform (OCT) and secondly in implementing the proposed system in FPGA. Consequently, special attention is given to the OCT algorithm implementation in the context of image compression. In fact, the OCT is the heart of the proposed compression and encryption system. This paper focuses on reducing the number of multiplications because general purpose multipliers are assumed to be the basic hardware elements for the computation of the OCT. Much different OCT architectures have been proposed to reduce number of multipliers. Among these, OCT algorithm proposed by Loeffler [3] has 11 multipliers and 29 adders. But when compared to Loeffler the modified OCT has only 4 multipliers and 14 adders. Also these multipliers and adders are optimized to realize a low power and fast OCT architecture. High-Performance multiplier (RPM) and carry select adder are being used as multipliers and adders respectively. HighPerformance Multiplier (RPM) reduction tree is completely regular.
This paper is organized as follows: the previous works on OCT are mentioned in section 2, the description of the proposed simultaneous compression and encryption system and modified OCT architecture is presented in section 3. Section 4 is dedicated to explanation of HPM and carry select adder. Implementation results are illustrated in the section 5. Conclusion has been done in the last section.
2. RELATEDWORK
The N-point OCT of N input samples x (0), ... , x (N -1) is defined as:
i'i' - 1 . . 2 .
L . . (2k ] )n%"
X(tl .) = - C(n) .. ' x(k) oos ----
. N · 2A' (1) it':=D
In literature, many fast OCT algorithms are reported. In [4], the authors show that the theoretical lower
442
Proceedings of 2011 International Conference on Signal Processing, Communication, Computing and Networking Technologies (ICSCCN 2011)
limit of 8-point OCT algorithm is 11 multiplications.
Since the number of multiplications of Loeffler's
algorithm [3] reaches the theoretical limit, we use this
algorithm as the reference to this work. In [5] one
realization based on Loeffler algorithm is shown. A
low power design is obtained with this algorithm. In
[6] use the recursive OCT algorithm and their design
requires less area than conventional algorithms. The
authors of [6] use Distributed Arithmetic (DA)
multipliers and show that N-point OCT can be
obtained by computing N N12-point inner products
instead of computing N N-point inner products. In
[7], a new OA architecture called NEOA is proposed,
aimed at reducing the cost metrics of power and area
while maintaining high speed and accuracy in digital
signal processing (DSP) applications.
3. PROPOSED SYSTEM ARCHITECTURE
As DCT is the heart of the proposed compression
and encryption system, we mainly concentrate on
optimization of DCT architecture. Once it has been
achieved, we move our concentration on optimization
of arithmetic operators used in OCT architecture. In
this section, first we present proposed compression
and encryption system architecture and later we
present the proposed OCT architecture for the
proposed compression and encryption system.
a. SYSTEM ARCHITECTURE
In this paper, we propose a new technique which can
carry out compression and simultaneous encryption
using Discrete Cosine Transform (DCT) and random
number generator respectively. The main idea of our
approach consists in multiplexing the spectra of
different transformed images separately by a DCT.
The choice of the OCT is justified by the use of the
DCT in many standards such as JPEG [8], MPEG [9]
and ITU-T H261 [10]. Moreover, we need fewer
DCT coefficients than the usual DFT coefficients to
get a good approximation to a typical signal [11]. In
fact, by applying the OCT, most of the signal
information tends to be concentrated in a few low
frequency components. Consequently, the higher
frequency coefficients are small in magnitude and
can be ignored in the compression and encryption
process. The proposed compression and encryption
system is shown in fig 1. In the left side, pixels of an
image to be compressed are coming in serially to the
system. To apply for OCT algorithm block, we need
to parallelize image by blocks of 8 pixels. This
operation can be done by a serial to parallel block
978-1-61284-653-8/11/$26.00 ©2011 IEEE
consists of 8 flip- flops. Then, DCT block is used to
transform the input image. This OCT block is used to
generate lower frequency components of the
transformed coefficients by taking into account only
the first and the second OCT outputs among 8, we
can get good approximation of input pixels. Let the
notations for first OCT output be Octoutl and for
second DCT output be Dctout2. The data values of
Dctoutl are high when compared to Dctout2. This is
because Octoutl is addition of all 8 input pixels and
where as a part of Dctout2 is subtraction of input
pixels. In fact, as it will be explained in the next
section, the low value of the Dctout2 is due to the
spatial correlation between 8 successive pixels
presented in input images. In order to ensure a good
encryption level against any hacking attempt, we
propose to add a positive random value to Octout2 to
have a data values close to Dctoutl. The security key
will be sent separately as a private encryption
key. Once secure and compressed information safely
reach the authorized receiver, the image extraction
can be easily done by reversing the various steps used
in the whole process like subtracting the security key
from received image pixels and running an Inverse
OCT.
b. MODIFIED DCT ARCHITECTURE
The modified OCT architecture is in fact
inspired from Loeffler OCT model. Hence, we can find similarities in both. The OCT outputs are
calculated in four stages. Each stage has different
number of adders and subtractors. The first stage has
four adders and four subtractors, the second stage has
two adders and four multipliers, the third stage has
three adders and the fourth stage has only one adder.
The optimization details of these blocks are explained
in the next section.
In traditional DCT algorithms, 8 pixels are
given as input and 8 DCT outputs will be generated.
But the modified OCT circuit accepts 8 pixels per
clock cycle and generates only 2 outputs against 8
outputs in the original Loeffler algorithm. So we
should make changes in the architecture to calculate only necessary DCT coefficients Dctoutl and
Dctout2. It should be noted that only DCT really does
not compress the image because it is almost lossless.
Usually after DCT step, Quantization and encoding
are done to achieve compression. Quantization makes
use of the fact that higher frequency components
443
Proceedings of 2011 International Conference on Signal Processing, Communication, Computing and Networking Technologies (ICSCCN 2011)
Input
image
pixels
Serial
To
Parallel
Compression
Encryption
Modified
DCT
with
optimized
Arithmetic
operators
LFSR
(random
number
generator)
Transmission
Dctoutl r-----,
Parallel
To
Serial
are less important than low frequency components.
It allows varying levels of image compression
and quality through selection of specific
quantization matrices. Thus quality levels ranging
from 1 to 100 can be selected, where 1 gives the poorest image quality and highest compression, while
100 gives the best quality and lowest compression.
Encoder creates a fixed or variable- length code to
represent the quantizer's output and maps the output
in accordance with the code. In most cases, a
variable-length code is used. An entropy encoder
compresses the compressed values obtained by
the quantizer to provide more efficient
compression. Most important types of entropy
encoders used in lossy image compression techniques
are arithmetic encoder, Huffman encoder and run -
length encoder. Quantization and encoding steps
therefore increases computational time and latency.
The proposed compression system does not
require quantization and encoding stages. The
modified DCT architecture itself does the required
compression. To achieve this we need to do some
changes in DCT architecture. The changes in DCT
architecture are as follows: First, necessary
calculations are made to get only DCToutl and
DCTout2. Thus, we can economize 5 multipliers, 2
adders and 2 subtractors compared to the Loeffler
architecture. We can notice that in stages 2, 3 and 4
only adders are used to get the required outputs.
Consequently, 6 additional subtractors can be saved.
The following are the calculations made in each
stage. Let the 8 input pixels be In1, In2, In3, In4, In5,
978-1-61284-653-8/11/$26.00 ©2011 IEEE
Parallel
Decryption To
Serial
Fig 1 Compression and Encryption System
In6, In7 and In8 and output DCT values are Dctoutl
and Dctout2.
Stage 1
Xl = In1 +In8; Y1 =In1-In8;
X2= In2+In7; Y2=In2-In7;
X3= In3+In6; Y3= In3-In6;
X4=In4+In5; Y 4=In4-In5;
Stage 2
X5= X1+X2;
X6= X3+X4;
Z 1 = Y 4 *( cos (3rrI16) + sin (3rrI16»;
Z2= Y3*(cos (3rr/16) - sin (3rr/16»;
Z3= YI *(cos (rr/16) + sin (rr/16»;
Z4= Y2*(cos (rrI16) - sin (rrI16»;
Stage 3
Dctoutl= X5+X6;
X8= ZI+Z4;
X9= Z2+Z3;
Stage 4
Dctout2= X8+ X9;
When we combine all the equations we get the
following two DCT output equations:
Dctoutl =(Inl +In2+In3+In4+InS+In6+In7+InS) (2)
Dctout2= Y2*(cos (n/16) - sin (n/16»
+ Y3*(cos (3nI16) - sin (3n/16»
+ Y4*(cos (3nI16) + sin (3nI16»
+ Yl *(cos (n/16) + sin (n/16» (3)
444
Output
image
pixels
Proceedings of 2011 International Conference on Signal Processing, Communication, Computing and Networking Technologies (ICSCCN 2011)
According to Octout2 equation, the number of multipliers used is four. In fact, the original DCT architecture requires 2 adders, 2 subtractors and 8 multipliers to compute the outputs. Loeffler reduces the number of arithmetic operators to 6 multipliers and 6 adders. In this work, the Octout2 can be calculated using only 4 multipliers. Like this, we economize 6 adders and 2 multipliers. Using these three optimization levels i.e. stages 2,3 and 4 , the modified OCT architecture requires 4 multipliers and 14 adders to compute relevant and representative data outputs for image compression against 11 multipliers and 29 adders proposed by Loeffler[3].
As mentioned before in this paper, the bit values of Octoutl is high when compared to Octout2. In the input side of the proposed method, the pixels are encoded using signed 9-bit values. In the output side, Dctoutl contains the major part of the information, so it has 12 bit values. For DCTout2 encode, we can take into account the spatial correlation of images. To make Octout2 bit values close to Dctoutl, we add a random number generated by LFSR. Thus, it also provides the encryption of the Dctout2 coefficients. Finally, Dctout2 is encoded using 12 bits. When we replicate hardware for four images the compression ratio is given by
12b'i ts ) R = 1-
. .
..11.0'0'% = 66.66% 4 * 9 b, t ts
(4)
4. HPM and Carry select adder
The modified OCT architecture has blocks of adder, subtractor and multiplier. These blocks have the structure shown in Fig. 2. The adder used in this architecture is carry select adder and multiplier used is High performance Multiplier (HPM).
al
=a2� ____ ��*�-+-+ ____ �� sl
Fig 2 Adder and Multiplier
a. HPM
In any integer multiplication, three stages of operations are performed. In the first stage, the partial
978-1-61284-653-8/11/$26.00 ©2011 IEEE
product matrix is formed. In the second stage, this partial product matrix is reduced to a height of two. In the final stage, these two rows are combined using a carry propagating adder. RPM is a reduction tree which is completely regular in structure. RPM reduction tree is easy to place and route and has a logarithmic logic depth. The reduction tree can be easily explained with encircling scheme shown in Fig. 3. Each dot represents a partial product and each step the height of the tree is reduced by one. Finally, when two rows of partial products are present a carry propagating adder is used to get the result.
High-Performance Multiplier (HPM) reduction tree has an 0 (log N) delay dependence on word length N .The connectivity of adding cells in the reduction tree is regular and so that routing becomes trivial, which can be utilized in any type of design method; fully automatic, custom, or somewhere in between. In contrast to other logarithmic multipliers, like Wallace, the design effort for a custom-made HPM multiplier is very limited; in fact it is as low as for a textbook array multiplier. The predictable wiring resulting from the regularity of the HPM tree can enable both systematic sizing of logic circuitry and systematic wire spacing engineering so as to minimize total multiplier delay.
I I I I fIlIll I I I I
" , WJ " " 1111111
"'1'
I , I
11111
I II I
, I . . I'
o
4
445
Proceedings of 2011 International Conference on Signal Processing, Communication, Computing and Networking Technologies (ICSCCN 2011)
• • • • • • • • • • • •
MSB LSB
Fig 3 RPM reduction tree
a. Carry select adder
Carry select adders are relatively fast when
compared to ripple carry adders. The schematic is
shown in Fig. 4. Bits are broken into small numbers
of blocks and each block is computed using ripple
carry adder. Computation is done for both input carry
1 and 0 in parallel for each block. Use muxes and the
carry bits to choose the right output. The number of
logic levels in this adder is around sqrt (n), where n is
number of bits.
Adding 5
bits(carry bit + 4
bit result) and carry 1
Adding 5
bits(carry bit + 4 bit result) and carry 0
Fig 4 Carry select adder Architecture
5. RESULTS
Initially, the proposed system is simulated in
Matlab. This step is very important to validate the
algorithm structure before algorithmic
implementation in FPGA or ASIC. Regarding the
description language, we decided to use Verilog
HOL as it is easier and gives better visibility to
hardware details when compared to others. Further,
the Verilog-HDL gives the choice of implementing
target devices like ASIC, FPGA etc. Simulation
results of the Matlab model are shown in Fig 5. The
PSNR value for images is 24dB.
978-1-61284-653-8/11/$26.00 ©2011 IEEE
The original OCT Loeffler architecture, modified
OCT architecture [12] and the modified OCT
architecture with optimized operators have been
implemented in the same kind of FPGA boards, that
Fig 5a Input images
Fig 5b Output images
is, Virtex 5 of xc5vlx330t. In order to illustrate the
differences in hardware consumption, the FPGA
implementation results are presented in Table I. In
Table 2, comparison done between compression and
encryption system with and without optimized
operators. From this comparison we can notice that
the modified OCT architecture with optimized
operators reduces the area consumption (slices and
Look Up Tables, LUTs). Furthermore, throughput of
this compression system has increased when
compared to the previous works.
Table 1 Synthesis results
characteristics Slice Slice Fully Through-registers LUT used put(MS/s)
LUT
Loeffler 507 1293 316 19l.867 Modified OCT 247 492 162 206.423 Modified DCT With optimized 23 432 0 624 operators
An ASIC implementation of proposed
compression and encryption system has been done
using Cadence tools. The RTL code has been
synthesized in RTL compiler using SAGE-Modeler
TSMC 0.18f.tm technology libraries at 1.8V to get the
timing, area and power results. The results are
tabulated in Table 3. The backend of the design has
446
Proceedings of 2011 International Conference on Signal Processing, Communication, Computing and Networking Technologies (ICSCCN 2011)
been done in SoC Encounter and final chip layout is
shown in Fig. 6.
Table 2 Synthesis results characteristics Slice Slice Fully Through-
registers LUT used put(MS/s) LUT
Compression system without 1536 2058 955 206.423
optimized operators
Compression system with
optimized 367 1085 251 624 operators
Table 3 Synthesis Results
Results Timing Power Area
(MHz) (mW) (�2)
Proposed Compression
and 83.626 9.88 70490
encryption system
Fig 6 Chip layout
6. CONCLUSION
In this paper, a new method of simultaneous
compression and encryption is presented. The
modified DCT algorithm is an optimized model in
terms of number of arithmetic operators. It needs
only 4 multipliers and 14 adders. The arithmetic
operators used in DCT model are also optimized in
order to increase the throughput and
to decrease the power consumption. The FPGA
implementation of the whole method shows
improvement in terms of throughput, area and power
consumption when compared to existing methods. An
978-1-61284-653-8/11/$26.00 ©2011 IEEE
ASIC implementation of the whole method from
RTL to GDSH has been done.
REFERENCES
[I] Chen, T.H.: 'A cost-effective 8*8 2-D IDCT core
processor with folded architecture', IEEE Trans.
Consum. Electron, 1999, 45, (2), pp. 333-339.
[2] A. Alfalou and C. Brosseau, Image Optical
Compression and Encryption Methods, OSA:
Advances in Optics and Photonics, vol 1, pp. 589-
636, 2009.
[3] C. Loeffler and A. Lightenberg and G.S. Moschytz , Practical fast I-D DCT algorithm with 11
multiplication, IEEE, ICAPSS, pp. 988-991, May
1989.
[4] P. Duhamel and H. H'mida, New 2n DCT
algorithm suitable for VLSI implementation, IEEE,
ICAPSS, pp. 1805-1808, November 1987.
[5] C.Y Pai, W.E. Lynch and A.J. Al-Khalili, Low
Power data-dependant 8x8 DCT/IDCT for video
compression, lEE, Proceedings. Vision, Image and
Signal Processing, Vol. 150, pp. 245-254, August
2003.
[6] S. Yu and E.E. Swartzlander Jr, DCT
implementation with distributed arithmetic, IEEE
Transactions on Computers, Vol. 50, No.9, pp, 985-
991, September 2001.
[7] A. Shams, A. Chidanandan, W. Pan and M.A
Bayoumi, NEDA: A low-power high-performance
DCT architecture, IEEE transactions on signal
processing, Vol. 54, No.3, pp, 955-964, 2006.
[8] ISO/IEC JTCl/SC2/WG8, IPEG-8-R8, JPEG
technical specification, 1990.
[9] ISO/IEC JTCl/SC2/WGll, MPEG 901176,
Coding of moving picture and associated audio,
1990.
[10] ISO/IEC DIS 10 918-1, Digital compression and
coding of continuous-tone still image, 1992.
[II] K. F Blinn, What's the deal with the DCT?,
IEEE Computer Graphics and Applications, pp. 78-
83, July 1993.
[12] Maher Jridi and Ayman AIFalou, 'A VLSI
Implementation of a New Simultaneous Image
Compression and Encryption Method', IEEE,201 O.
447