[ieee 2011 international conference on signal processing, communication, computing and networking...

Proceedings of 2011 International Conference on Signal Processing, Communication, Computing and Networking Technologies (ICSCCN 2011)

AN OPTIMIZED ARCHITECTURE TO PERFORM IMAGE

COMPRESSION AND ENCRYPTION SIMULTANEOUSLY USING

MODIFIED DCT ALGORITHM

S V V Sateesh VIT University Vellore, TN

R Sakthivel VIT University

Vellore, TN

K Nirosha VIT University

Veil ore, TN

INDIA [email protected]

INOlA [email protected]

INDIA [email protected]

Harish M Kittur VIT University

Veil ore, TN

INOlA [email protected]

Abstract:

Traditional fast Discrete Cosine

Transforms (DCT)/ Inverse DCT (mCT)

algorithms have focused on reducing the

arithmetic complexity. In this manuscript, we

implemented a new architecture simultaneous for

image compression and encryption technique

suitable for real-time applications. Here, contrary

to traditional compression algorithms, only special

points of DCT outputs are calculated. For the

encryption process, LFSR is used to generate

random number and added to some DCT outputs.

Both DCT algorithm and arithmetic operators

used in algorithm are optimized in order to realize

a compression with reduced operator

requirements and to have a faster throughput.

High Performance Multiplier (HPM) is being used

for integer multiplications. Simulation results

show the compression ratio around 66% and a

PSNR about 24dB. The throughput of this

architecture is 624 M samples/s with a clock

frequency of78 MHz.

Index Terms- DCT, LFSR, HPM, Carry select

adder, JPEG, FPGA and ASIC.

1. INTRODUCTION

The discrete cosine transform (OCT) has been widely used in speech and image compression because it has features such as good energy compaction and low computational complexity. It has become an integral part of several standards, such as JPEG, MPEG-2, MPEG-4, CCITT Recommendation H.26l and H.263, and HOTV applications [1]. For applications like face recognition, detector etc. we need to use communication systems with a good security level and an acceptable transmission rate. However, for some applications the encryption and the compression techniques cannot be deployed independently or in a cascade manner without considering the impact of one technique over another [2]. To solve this problem, we developed a new technique to simultaneously compress and encrypt

978-1-61284-653-8/11/$26.00 ©2011 IEEE

images. The main idea of our approach consists, firstly, in multiplexing the spectra of different transformed images (to be compressed and encrypted) by a Discrete Cosine Transform (OCT) and secondly in implementing the proposed system in FPGA. Consequently, special attention is given to the OCT algorithm implementation in the context of image compression. In fact, the OCT is the heart of the proposed compression and encryption system. This paper focuses on reducing the number of multiplications because general purpose multipliers are assumed to be the basic hardware elements for the computation of the OCT. Much different OCT architectures have been proposed to reduce number of multipliers. Among these, OCT algorithm proposed by Loeffler [3] has 11 multipliers and 29 adders. But when compared to Loeffler the modified OCT has only 4 multipliers and 14 adders. Also these multipliers and adders are optimized to realize a low power and fast OCT architecture. High-Performance multiplier (RPM) and carry select adder are being used as multipliers and adders respectively. HighPerformance Multiplier (RPM) reduction tree is completely regular.

This paper is organized as follows: the previous works on OCT are mentioned in section 2, the description of the proposed simultaneous compression and encryption system and modified OCT architecture is presented in section 3. Section 4 is dedicated to explanation of HPM and carry select adder. Implementation results are illustrated in the section 5. Conclusion has been done in the last section.

2. RELATEDWORK

The N-point OCT of N input samples x (0), ... , x (N -1) is defined as:

i'i' - 1 . . 2 .

L . . (2k ] )n%"

X(tl .) = - C(n) .. ' x(k) oos ----

. N · 2A' (1) it':=D

In literature, many fast OCT algorithms are reported. In [4], the authors show that the theoretical lower

442


limit of 8-point OCT algorithm is 11 multiplications.

Since the number of multiplications of Loeffler's

algorithm [3] reaches the theoretical limit, we use this

algorithm as the reference to this work. In [5] one

realization based on Loeffler algorithm is shown. A

low power design is obtained with this algorithm. In

[6] use the recursive OCT algorithm and their design

requires less area than conventional algorithms. The

authors of [6] use Distributed Arithmetic (DA)

multipliers and show that N-point OCT can be

obtained by computing N N12-point inner products

instead of computing N N-point inner products. In

[7], a new OA architecture called NEOA is proposed,

aimed at reducing the cost metrics of power and area

while maintaining high speed and accuracy in digital

signal processing (DSP) applications.

3. PROPOSED SYSTEM ARCHITECTURE

As DCT is the heart of the proposed compression

and encryption system, we mainly concentrate on

optimization of DCT architecture. Once it has been

achieved, we move our concentration on optimization

of arithmetic operators used in OCT architecture. In

this section, first we present proposed compression

and encryption system architecture and later we

present the proposed OCT architecture for the

proposed compression and encryption system.

a. SYSTEM ARCHITECTURE

In this paper, we propose a new technique which can

carry out compression and simultaneous encryption

using Discrete Cosine Transform (DCT) and random

number generator respectively. The main idea of our

approach consists in multiplexing the spectra of

different transformed images separately by a DCT.

The choice of the OCT is justified by the use of the

DCT in many standards such as JPEG [8], MPEG [9]

and ITU-T H261 [10]. Moreover, we need fewer

DCT coefficients than the usual DFT coefficients to

get a good approximation to a typical signal [11]. In

fact, by applying the OCT, most of the signal

information tends to be concentrated in a few low

frequency components. Consequently, the higher

frequency coefficients are small in magnitude and

can be ignored in the compression and encryption

process. The proposed compression and encryption

system is shown in fig 1. In the left side, pixels of an

image to be compressed are coming in serially to the

system. To apply for OCT algorithm block, we need

to parallelize image by blocks of 8 pixels. This

operation can be done by a serial to parallel block

978-1-61284-653-8/11/$26.00 ©2011 IEEE

consists of 8 flip- flops. Then, DCT block is used to

transform the input image. This OCT block is used to

generate lower frequency components of the

transformed coefficients by taking into account only

the first and the second OCT outputs among 8, we

can get good approximation of input pixels. Let the

notations for first OCT output be Octoutl and for

second DCT output be Dctout2. The data values of

Dctoutl are high when compared to Dctout2. This is

because Octoutl is addition of all 8 input pixels and

where as a part of Dctout2 is subtraction of input

pixels. In fact, as it will be explained in the next

section, the low value of the Dctout2 is due to the

spatial correlation between 8 successive pixels

presented in input images. In order to ensure a good

encryption level against any hacking attempt, we

propose to add a positive random value to Octout2 to

have a data values close to Dctoutl. The security key

will be sent separately as a private encryption

key. Once secure and compressed information safely

reach the authorized receiver, the image extraction

can be easily done by reversing the various steps used

in the whole process like subtracting the security key

from received image pixels and running an Inverse

OCT.

b. MODIFIED DCT ARCHITECTURE

The modified OCT architecture is in fact

inspired from Loeffler OCT model. Hence, we can find similarities in both. The OCT outputs are

calculated in four stages. Each stage has different

number of adders and subtractors. The first stage has

four adders and four subtractors, the second stage has

two adders and four multipliers, the third stage has

three adders and the fourth stage has only one adder.

The optimization details of these blocks are explained

in the next section.

In traditional DCT algorithms, 8 pixels are

given as input and 8 DCT outputs will be generated.

But the modified OCT circuit accepts 8 pixels per

clock cycle and generates only 2 outputs against 8

outputs in the original Loeffler algorithm. So we

should make changes in the architecture to calculate only necessary DCT coefficients Dctoutl and

Dctout2. It should be noted that only DCT really does

not compress the image because it is almost lossless.

Usually after DCT step, Quantization and encoding

are done to achieve compression. Quantization makes

use of the fact that higher frequency components

443


Input

image

pixels

Serial

To

Parallel

Compression

Encryption

Modified

DCT

with

optimized

Arithmetic

operators

LFSR

(random

number

generator)

Transmission

Dctoutl r-----,

Parallel

To

Serial

are less important than low frequency components.

It allows varying levels of image compression

and quality through selection of specific

quantization matrices. Thus quality levels ranging

from 1 to 100 can be selected, where 1 gives the poorest image quality and highest compression, while

100 gives the best quality and lowest compression.

Encoder creates a fixed or variable- length code to

represent the quantizer's output and maps the output

in accordance with the code. In most cases, a

variable-length code is used. An entropy encoder

compresses the compressed values obtained by

the quantizer to provide more efficient

compression. Most important types of entropy

encoders used in lossy image compression techniques

are arithmetic encoder, Huffman encoder and run -

length encoder. Quantization and encoding steps

therefore increases computational time and latency.

The proposed compression system does not

require quantization and encoding stages. The

modified DCT architecture itself does the required

compression. To achieve this we need to do some

changes in DCT architecture. The changes in DCT

architecture are as follows: First, necessary

calculations are made to get only DCToutl and

DCTout2. Thus, we can economize 5 multipliers, 2

adders and 2 subtractors compared to the Loeffler

architecture. We can notice that in stages 2, 3 and 4

only adders are used to get the required outputs.

Consequently, 6 additional subtractors can be saved.

The following are the calculations made in each

stage. Let the 8 input pixels be In1, In2, In3, In4, In5,

978-1-61284-653-8/11/$26.00 ©2011 IEEE

Parallel

Decryption To

Serial

Fig 1 Compression and Encryption System

In6, In7 and In8 and output DCT values are Dctoutl

and Dctout2.

Stage 1

Xl = In1 +In8; Y1 =In1-In8;

X2= In2+In7; Y2=In2-In7;

X3= In3+In6; Y3= In3-In6;

X4=In4+In5; Y 4=In4-In5;

Stage 2

X5= X1+X2;

X6= X3+X4;

Z 1 = Y 4 *( cos (3rrI16) + sin (3rrI16»;

Z2= Y3*(cos (3rr/16) - sin (3rr/16»;

Z3= YI *(cos (rr/16) + sin (rr/16»;

Z4= Y2*(cos (rrI16) - sin (rrI16»;

Stage 3

Dctoutl= X5+X6;

X8= ZI+Z4;

X9= Z2+Z3;

Stage 4

Dctout2= X8+ X9;

When we combine all the equations we get the

following two DCT output equations:

Dctoutl =(Inl +In2+In3+In4+InS+In6+In7+InS) (2)

Dctout2= Y2*(cos (n/16) - sin (n/16»

+ Y3*(cos (3nI16) - sin (3n/16»

+ Y4*(cos (3nI16) + sin (3nI16»

+ Yl *(cos (n/16) + sin (n/16» (3)

444

Output

image

pixels


According to Octout2 equation, the number of multipliers used is four. In fact, the original DCT architecture requires 2 adders, 2 subtractors and 8 multipliers to compute the outputs. Loeffler reduces the number of arithmetic operators to 6 multipliers and 6 adders. In this work, the Octout2 can be calculated using only 4 multipliers. Like this, we economize 6 adders and 2 multipliers. Using these three optimization levels i.e. stages 2,3 and 4 , the modified OCT architecture requires 4 multipliers and 14 adders to compute relevant and representative data outputs for image compression against 11 multipliers and 29 adders proposed by Loeffler[3].

As mentioned before in this paper, the bit values of Octoutl is high when compared to Octout2. In the input side of the proposed method, the pixels are encoded using signed 9-bit values. In the output side, Dctoutl contains the major part of the information, so it has 12 bit values. For DCTout2 encode, we can take into account the spatial correlation of images. To make Octout2 bit values close to Dctoutl, we add a random number generated by LFSR. Thus, it also provides the encryption of the Dctout2 coefficients. Finally, Dctout2 is encoded using 12 bits. When we replicate hardware for four images the compression ratio is given by

12b'i ts ) R = 1-

. .

..11.0'0'% = 66.66% 4 * 9 b, t ts

(4)

4. HPM and Carry select adder

The modified OCT architecture has blocks of adder, subtractor and multiplier. These blocks have the structure shown in Fig. 2. The adder used in this architecture is carry select adder and multiplier used is High performance Multiplier (HPM).

al

=a2� ____ ��*�-+-+ ____ �� sl

Fig 2 Adder and Multiplier

a. HPM

In any integer multiplication, three stages of operations are performed. In the first stage, the partial

978-1-61284-653-8/11/$26.00 ©2011 IEEE

product matrix is formed. In the second stage, this partial product matrix is reduced to a height of two. In the final stage, these two rows are combined using a carry propagating adder. RPM is a reduction tree which is completely regular in structure. RPM reduction tree is easy to place and route and has a logarithmic logic depth. The reduction tree can be easily explained with encircling scheme shown in Fig. 3. Each dot represents a partial product and each step the height of the tree is reduced by one. Finally, when two rows of partial products are present a carry propagating adder is used to get the result.

High-Performance Multiplier (HPM) reduction tree has an 0 (log N) delay dependence on word length N .The connectivity of adding cells in the reduction tree is regular and so that routing becomes trivial, which can be utilized in any type of design method; fully automatic, custom, or somewhere in between. In contrast to other logarithmic multipliers, like Wallace, the design effort for a custom-made HPM multiplier is very limited; in fact it is as low as for a textbook array multiplier. The predictable wiring resulting from the regularity of the HPM tree can enable both systematic sizing of logic circuitry and systematic wire spacing engineering so as to minimize total multiplier delay.

I I I I fIlIll I I I I

" , WJ " " 1111111

"'1'

I , I

11111

I II I

, I . . I'

o

4

445


• • • • • • • • • • • •

MSB LSB

Fig 3 RPM reduction tree

a. Carry select adder

Carry select adders are relatively fast when

compared to ripple carry adders. The schematic is

shown in Fig. 4. Bits are broken into small numbers

of blocks and each block is computed using ripple

carry adder. Computation is done for both input carry

1 and 0 in parallel for each block. Use muxes and the

carry bits to choose the right output. The number of

logic levels in this adder is around sqrt (n), where n is

number of bits.

Adding 5

bits(carry bit + 4

bit result) and carry 1

Adding 5

bits(carry bit + 4 bit result) and carry 0

Fig 4 Carry select adder Architecture

5. RESULTS

Initially, the proposed system is simulated in

Matlab. This step is very important to validate the

algorithm structure before algorithmic

implementation in FPGA or ASIC. Regarding the

description language, we decided to use Verilog

HOL as it is easier and gives better visibility to

hardware details when compared to others. Further,

the Verilog-HDL gives the choice of implementing

target devices like ASIC, FPGA etc. Simulation

results of the Matlab model are shown in Fig 5. The

PSNR value for images is 24dB.

978-1-61284-653-8/11/$26.00 ©2011 IEEE

The original OCT Loeffler architecture, modified

OCT architecture [12] and the modified OCT

architecture with optimized operators have been

implemented in the same kind of FPGA boards, that

Fig 5a Input images

Fig 5b Output images

is, Virtex 5 of xc5vlx330t. In order to illustrate the

differences in hardware consumption, the FPGA

implementation results are presented in Table I. In

Table 2, comparison done between compression and

encryption system with and without optimized

operators. From this comparison we can notice that

the modified OCT architecture with optimized

operators reduces the area consumption (slices and

Look Up Tables, LUTs). Furthermore, throughput of

this compression system has increased when

compared to the previous works.

Table 1 Synthesis results

characteristics Slice Slice Fully Through-registers LUT used put(MS/s)

LUT

Loeffler 507 1293 316 19l.867 Modified OCT 247 492 162 206.423 Modified DCT With optimized 23 432 0 624 operators

An ASIC implementation of proposed

compression and encryption system has been done

using Cadence tools. The RTL code has been

synthesized in RTL compiler using SAGE-Modeler

TSMC 0.18f.tm technology libraries at 1.8V to get the

timing, area and power results. The results are

tabulated in Table 3. The backend of the design has

446


been done in SoC Encounter and final chip layout is

shown in Fig. 6.

Table 2 Synthesis results characteristics Slice Slice Fully Through-

registers LUT used put(MS/s) LUT

Compression system without 1536 2058 955 206.423

optimized operators

Compression system with

optimized 367 1085 251 624 operators

Table 3 Synthesis Results

Results Timing Power Area

(MHz) (mW) (�2)

Proposed Compression

and 83.626 9.88 70490

encryption system

Fig 6 Chip layout

6. CONCLUSION

In this paper, a new method of simultaneous

compression and encryption is presented. The

modified DCT algorithm is an optimized model in

terms of number of arithmetic operators. It needs

only 4 multipliers and 14 adders. The arithmetic

operators used in DCT model are also optimized in

order to increase the throughput and

to decrease the power consumption. The FPGA

implementation of the whole method shows

improvement in terms of throughput, area and power

consumption when compared to existing methods. An

978-1-61284-653-8/11/$26.00 ©2011 IEEE

ASIC implementation of the whole method from

RTL to GDSH has been done.

REFERENCES

[I] Chen, T.H.: 'A cost-effective 8*8 2-D IDCT core

processor with folded architecture', IEEE Trans.

Consum. Electron, 1999, 45, (2), pp. 333-339.

[2] A. Alfalou and C. Brosseau, Image Optical

Compression and Encryption Methods, OSA:

Advances in Optics and Photonics, vol 1, pp. 589-

636, 2009.

[3] C. Loeffler and A. Lightenberg and G.S. Moschytz , Practical fast I-D DCT algorithm with 11

multiplication, IEEE, ICAPSS, pp. 988-991, May

1989.

[4] P. Duhamel and H. H'mida, New 2n DCT

algorithm suitable for VLSI implementation, IEEE,

ICAPSS, pp. 1805-1808, November 1987.

[5] C.Y Pai, W.E. Lynch and A.J. Al-Khalili, Low

Power data-dependant 8x8 DCT/IDCT for video

compression, lEE, Proceedings. Vision, Image and

Signal Processing, Vol. 150, pp. 245-254, August

2003.

[6] S. Yu and E.E. Swartzlander Jr, DCT

implementation with distributed arithmetic, IEEE

Transactions on Computers, Vol. 50, No.9, pp, 985-

991, September 2001.

[7] A. Shams, A. Chidanandan, W. Pan and M.A

Bayoumi, NEDA: A low-power high-performance

DCT architecture, IEEE transactions on signal

processing, Vol. 54, No.3, pp, 955-964, 2006.

[8] ISO/IEC JTCl/SC2/WG8, IPEG-8-R8, JPEG

technical specification, 1990.

[9] ISO/IEC JTCl/SC2/WGll, MPEG 901176,

Coding of moving picture and associated audio,

1990.

[10] ISO/IEC DIS 10 918-1, Digital compression and

coding of continuous-tone still image, 1992.

[II] K. F Blinn, What's the deal with the DCT?,

IEEE Computer Graphics and Applications, pp. 78-

83, July 1993.

[12] Maher Jridi and Ayman AIFalou, 'A VLSI

Implementation of a New Simultaneous Image

Compression and Encryption Method', IEEE,201 O.

447

[ieee 2011 international conference on signal processing, communication, computing and networking...

Documents