b eng final year project presentation

Parallel architecture for image compressionParallel architecture for image compressionIntroduction | Algorithm | Architecture | Results | Conclusion | Q&A

A parallel architecture for image compressionA parallel architecture for image compression

(2004)(2004)

An FYP presentation by:An FYP presentation by:

Jesu Joseph, Shibu MenonJesu Joseph, Shibu Menon


IntroductionIntroduction (Jesu Joseph)

AlgorithmAlgorithm (Shibu Menon)

ArchitectureArchitecture (Shibu Menon)

ResultsResults (Jesu Joseph)

ConclusionConclusion (Jesu Joseph)


IntroductionIntroduction


Why image compression?Why image compression?

• Low color displaysLow color displays

• Real time video compression and streamingReal time video compression and streaming

• Save storageSave storage

• Applications include: Applications include:

Digital camera displaysDigital camera displays

Device to device video streamingDevice to device video streaming

Devices with limited storageDevices with limited storage


Artificial neural networksArtificial neural networks

• Inspired by the human brainInspired by the human brain

• Human learning:Human learning:

1.1. Time dependentTime dependent

2.2. Quality of brain dependentQuality of brain dependent

3.3. Complexity of input dependentComplexity of input dependent

• PC processing: PC processing:

INPUT >> SOFTWARE >> HARDWARE >> OUTPUTINPUT >> SOFTWARE >> HARDWARE >> OUTPUT


Artificial neural networksArtificial neural networks

Self learningSelf learning

Self improvementSelf improvement

Self correctionSelf correction

Ease of upgradeEase of upgrade


Our ProjectOur Project

• Use neural network techniques to design and Use neural network techniques to design and implement a stand-alone chip for image compressionimplement a stand-alone chip for image compression

• Self learningSelf learning

• Self improvementSelf improvement

• Ease of upgradeEase of upgrade



Stage 1 - Algorithm:Stage 1 - Algorithm:

• Studied the theoretical algorithmStudied the theoretical algorithm

• Optimize the algorithm for real time performanceOptimize the algorithm for real time performance

• Optimize the algorithm for ease of implementationOptimize the algorithm for ease of implementation

Stage 2 - Architecture:Stage 2 - Architecture:

• Block level architecture designBlock level architecture design

• Component level hardware designComponent level hardware design

• Hardware coding (Verilog)Hardware coding (Verilog)

• Design generationDesign generation



Stage 3 - Testing:Stage 3 - Testing:

• Simulation of individual modules with test-benchesSimulation of individual modules with test-benches

• Data verification of individual modulesData verification of individual modules

• Result verificationResult verification

Stage 4 – Implementation:Stage 4 – Implementation:

• SynthesisSynthesis

• FPGA testingFPGA testing


Basics of image compressionBasics of image compression

(R,G,B) = (20,48,206)10

= (14,30,CE)16

= (00001110, 00110000, 11001110)2

Normal format

• 24 bits per pixel

• 16,777,216 (16 million) possible colors

• 512x512 image size= 512 x 512 x 3≈ 800kB

16 Color format

• Bits per pixel in 16 color format = 4 bits

•Size of the image in 16 color format= 512 x 512 x .5≈ 130kB

• About 6:1 image size compression and bandwidth improvement


Our designOur design

00000000 (00,00,00)(00,00,00)

00010001 (10,10,A3)(10,10,A3)

00100010 (39,0A, 9D)(39,0A, 9D)

00110011 (40, 68, 90)(40, 68, 90)

…… ……

11111111 (FF, FF, FF)(FF, FF, FF)

Color chooser

Encoder

About 170 different colors

16 colors, each represented by a 4-bit number

Compressed image


Our designOur design

Input pixels

Learn the image

Create the codebook

Improve the code book

Encode the image

00000000 (00,00,00)(00,00,00)

00010001 (10,10,A3)(10,10,A3)

00100010 (39,0A, 9D)(39,0A, 9D)

00110011 (40, 68, 90)(40, 68, 90)

…… ……

11111111 (FF, FF, FF)(FF, FF, FF)

Code book

Compressed image

Decode the image


AlgorithmAlgorithm

ALGORITHMALGORITHM NORMAL ALGORITHMNORMAL ALGORITHM

STEPSSTEPS MODIFICATIONS MADEMODIFICATIONS MADE

WHY THESE WHY THESE MODIFICATIONSMODIFICATIONS

MAIN FUNCTIONSMAIN FUNCTIONS WHAT THE ALGORITHM WHAT THE ALGORITHM

DOESDOES ALGORITHM STEPSALGORITHM STEPS

8-BIT8-BIT 7-BIT7-BIT

ADVANTAGE OF 7-BIT ADVANTAGE OF 7-BIT ALGORITHMALGORITHM

ALGORITHMALGORITHM

KOHONEN ALGORITHMKOHONEN ALGORITHM

MODIFICATIONSMODIFICATIONS

7-BIT ALGORITHM7-BIT ALGORITHM

8-BIT ALGORITHM8-BIT ALGORITHM

FUNCTIONSFUNCTIONS


OVERALL ALGORITHMOVERALL ALGORITHM

Neuron 1Neuron 1 Weight 1Weight 1

Neuron 2Neuron 2 Weight2Weight2

…….... ……....

Neuron 16Neuron 16 Weight 16Weight 16

Kohonen Algorithm

Address 1Address 1 Data 1Data 1


…….... ……....


IMAGE

Learning

Encoding

EncoderCompressed

Image

Pixel by pixel encoding

LUT

+LUTDECOMPRESSION

MSB PLANE


KOHONEN ALGORITHMKOHONEN ALGORITHM

AssumedAssumed Neuron Weights (w)Neuron Weights (w)

Denotes position of Denotes position of neuron in 3D space.neuron in 3D space.

Serial presentation of Serial presentation of training vectors (x)training vectors (x)

Time dependent learning Time dependent learning rate [rate [αα(t)](t)]

Learning Count - tLearning Count - t 3-D space3-D space

Represents R,G and BRepresents R,G and B

RED

GREEN

BLUE

w1

w2

Neuron1

Neuron 2

x

Input Training Vector


STEPSSTEPS

STEP 1: Find closest Neuron (neuron c)

||X(t) – Wc(t)|| = min{||X(t)-Wi(t)||}

STEP 2: Update Weight of the winning Neuron and the Neurons in the topological neighborhood.

Wi(t+1) = Wi(t) + αα(t).{X(t) – W(t)}(t).{X(t) – W(t)}

For i For i ЄЄ N Ncc(t) (t) Neighborhood Neighborhood

Iterate STEP1 and STEP2


ALGORITHM MODIFICATIONALGORITHM MODIFICATION

Why change ? Computational Expense Hardware Complexity Efficiency – 7-bit implementation

Avoid Multiplication and recursive logic blocks

Tradeoff Time EfficiencyComplexity

Modifications discussed where relevant


STEP 2:

Each Neuron calculates its weight difference from the input vector.

Manhattan Distance Σ |Wij – X|;

i1-N(for ith neuron), j r, g or b

STEP 3:

Winning Neuron with minimum Manhattan Distance chosen.

This neuron denoted Winner.

STEP 4:

Neurons in the topological neighborhood chosen

This neuron denoted Neighbor.

MODIFICATIONS

Initialization based on Gray scale initialization (r = g = b)Manhattan Distance used instead of Euclidean Distance. It denotes the absolute distance of the neuron from input vector.Minimum Distance neuron can be chosen using binary/recursive searching

Usual Kohonen Algorithm : Function f(t) => e.g. d = d0 ( 1- t/T )

Modified : Expanding Sphere.

MODIFIED ALGORITHMMODIFIED ALGORITHM

STEP 1

Training Vector X (Xr, Xg, Xb) input to all neurons (N).

Neurons initialized with weight vectors w (gray scale initialization).

Neighbor

STEP5 :

Update Neuron weights

Wi(t+1) = Wi(t) + αα(t).{X(t) – W(t)}(t).{X(t) – W(t)}

Learning rate α Є {1/2, 1/4, 1/8, 1/16…}

Repeat for the next input vector

Stop after fixed number of iterations


7-Bit Algorithm7-Bit Algorithm


(0000_0001)

(1111_1111)

(000_0001)

7-BIT vs. 8-BIT ALGORITHM7-BIT vs. 8-BIT ALGORITHM

8-Bit Algorithm8-Bit Algorithm 7-Bit Algorithm7-Bit Algorithm

• Neuron Weight components Neuron Weight components are 8-bits each (i.e. R= 8, G= 8 are 8-bits each (i.e. R= 8, G= 8 and B =8)and B =8)•Input Vector components are Input Vector components are 7-bits each.7-bits each.•Image Reconstruction a simple Image Reconstruction a simple matter of looking up the pixel matter of looking up the pixel values from the lookup tablevalues from the lookup table•Requires storage of only the Requires storage of only the LUT.LUT.

Neuron Weight Components Neuron Weight Components are 7-bits eachare 7-bits eachInput Vector components are Input Vector components are 7-bits each (need for 7-bits each (need for conversion).conversion).Image reconstruction complex Image reconstruction complex and involves looking up the and involves looking up the MSB planeMSB planeRequires storage of the MSB Requires storage of the MSB plane and LUTplane and LUT


ARCHITECTUREARCHITECTURE

SYSTEM ARCHITECTURESYSTEM ARCHITECTURE

ARHITECTUREARHITECTURE

NETWORK STRUCTURENETWORK STRUCTURE

NEURON STRUCTURENEURON STRUCTURE GLOBAL CONTROLLERGLOBAL CONTROLLER

ARCHITECTURAL NOVELTYARCHITECTURAL NOVELTY

ALGORITHM – ARCHITECTURE TRANSLATIONALGORITHM – ARCHITECTURE TRANSLATIONNeuron 15

Neuron 13

Neuron 11

Neuron 9

Neuron 7

Neuron 5

Neuron 3

Neuron 1

Neuron 14

Neuron 12

Neuron 10

Neuron 8

Neuron 6

Neuron 4

Neuron 2

Neuron 16

GLOBAL CONTROLLERGLOBAL CONTROLLER

NEURON 1 NEURON 2

NEURON 3

NEURON 5

NEURON 7

NEURON 11

NEURON 13

NEURON 9

NEURON 15 NEURON 16

NEURON 14

NEURON 12

NEURON 10

NEURON 8

NEURON 6

NEURON 4

BROADCAST ARCHITECTURE

Central Controller broadcasts control signals to the neuron.

Neurons have the ability to take control of the global bus.

Arbitration eliminates contention for central bus.

Hardware efficient since rich interconnection would mean infeasible number of i/o pins.

Expansion of neural network simplified.

Only one feedback signal from the network to the controller.

NEURON STRUCTURE

All Arithmetic computations use the same block “ARITHMETIC_UNIT”

Variable Learning rate implemented by using Shift register.


Step 6: Winner Updating• Learning Rate for winner based on value in the frequency count register.

• Red value updated based on:

WR = WR + (T2RC * T2R α(f)) α(f) is the learning rate as a function of frequency.

•This step repeated for Green and Blue Values.

•Only neuron with Wnr flag set updates these values.

•Learning rate implemented by using a shift register

•Eg. α = 0.875 1/2+ 1/4+ 1/8

Step 2: Weights Initialization

• 2 cycles per neuron : address broadcast and then the data.

GC:

Cycle 1: Address in the DATA [8:0] with add_cyc. The particular neuron is selected.

Cycle 2: The WR1(=Wg1 = Wg2 ) value on DATA [8:0] together with MEM_ADD [2:0]. The data is read by the RAM.

Step 1: FC initialization

• Global Controller fills in “FC” for all neurons.

• Frequency counter – starts from threshold value and is decremented for the winning neuron. Neuron disabled when FC=0.

GC: ini_freq asserted with the data in DATA[8:0]

Initialization:

Add WR WG WB

0 00 00 001 08 08 082 10 10 10… … … …15 78 78 78

Step 3: Manhattan Distance calculation

• Arithmetic Unit of each neuron calculates |WR-XRi| + | WG-XGi| + |WB-XB|i| and stores it in register T2

GC:

Step 1: WR address on RAM_ADD [2:0] with mem_rd asserted.

Step 2: First input’s red value on DATA [8:0] and assert st_cal.

Step 3: When first posedge(S0) is detected, read T2R into the memory and T2RC into the flag register. Repeat Step1&2 with the green value.

Step 4: When the second posedge(S0) is detected, read T2G into the memory and T2GC into the flag register. Repeat Step 1&2 with the blue value.

Step 5: When the third posedge(S0) is detected, read T2B into the memory and T2BC into the flag register.

Step 6: When the next negedge(S0) is detected, read T2 into the memory using data_en.

Step 4: Find the minimum Manhattan distance

• Any number less than 512 can be guessed in 10 steps, by using binary searching. The “<“ or “>” relation between the guessed number and the original number is used.

• Neuron with least Distance has Wnr flag set.

GC:

Step 1: T2 value is written to MEM_OUT[8:0]

Step 2: The number 512 is broadcasted to the negative terminal of the differentiator.

Step 3: If high S1 is detected, the least Manhattan distance lies below 512.

Step 4: Steps 1 to 3 are repeated with various values (256, 128…) depending on S1 after each stage.

Step 5: At the 10th cycle, min(T2) is on DATA[8:0]. Wnr Flag is loaded. The winning neuron(s) will have Wnr flag set to 1.

Step 5: FC update and winner weight broadcast

• Frequency counter is decremented and the winning neuron takes control of the bus.

• T2 is calculated in each neuron.

GC:

Step 1: dec_freq is asserted and the winning neuron decrements its frequency counter.

Step 2: nrn_ctrl is asserted with the WR MEM_ADD[2:0]. The winning neuron broadcasts its WR value on DATA[8:0]. All neurons, including the winning neuron, put their WR value on MEM_OUT [8:0].

Step 3: GC asserts st_cal. Step 2 is repeated with WG and WB.

Step 4: The accumulated value is written to T2 using data_en, MEM_ADD[2:0] and mem_en.

Step 7: Neighbor Determination and Updating

• All neurons have T1 value stored.

• GC broadcasts Neighbor size values and neurons with T1<neighbor size fall in the neighborhood.

•These neurons have their Nbr flag set.

•Neighbor neurons are updated using

Wr = Wr + (T2RC)(T2R/16)

The Neighbor sise increases progressively

ALGORITHM TO ARCHITECTUREALGORITHM TO ARCHITECTURE

Neuronaddress

wR(7) T1(9)wG(7) wB (7) T2(9) T2R(7) T2G(7) T2B(7) T2RC(1) T2GC(1) T2BC(1) FC (18)

Distance from the winning neuron

Distance from the input pixel

Weight vectors Frequency counter

Registers used in the architecture

Step 8: Iterate for next input pixel

• Stop after fixe number of iterations.


ARCHITECTURAL NOVELTIESARCHITECTURAL NOVELTIES

Implementation of 7-bit learning:7-bit learning, mapping of pixels to an octal space and encoding the MSB plane with the image,

are new theoretical ideas that we implemented on hardware.This mode should theoretically create images with a better quality than those encoded using 8-bit

mode. This is because the neurons are more closely packed in a smaller space, hence creating a better response on the structure from each pixel.

Implementation of the 8-bit and the 7-bit Learning Algorithm:The same hardware can process an image in both 7-bit and 8-bit modes. A single push-button

switch on the FPGA board sets the mode for a cycle. This is useful because certain images give a better output on 7-bit mode than on 8-bit or vice versa and they can be compared for later studies.

This is done, keeping in mind the need for future upgrading of the functionalities. A module can be added to the design that calculates mean-square error for both 7-bit and 8-bit images and the better one can be selected.


ARCHITECTURAL NOVELTIESARCHITECTURAL NOVELTIESIntegration of encoding hardware to the learning hardware:The integration of encoding and learning hardware ensures faster compression and reduces the hardware

overhead.This is done, keeping in mind the future practical application of the hardware for real-time video

compression, rather than just for stand-alone images.

Implementation of the variable learning rate:Variable learning rate (using learning rates of 1/2,1/4,…etc.) is a novel feature of this arhitecture. This

ensures better updating of neighbors based on its distance from the winning neuron, rather than a fixed updating. The neighbors are updated based on 5 ranges of distances from the winner and the updating distances calculated through theoretical calculations.



Implementation of the learning rate depending on the frequency count:The frequency count value is calculated so that all neurons get equal chance of being the winner. At

the same time, the algorithm ensures that a neuron that has been a winner the most number of the time doesn’t get updated as much as the ones that are not as lucky. This is not seen in any other similar algorithms.

Implementation of the Neighbor updating:Neighbor-updating together with the winner-updating is another novel feature of our algorithm.

This makes the design complicated, but the output quality is considerably improved compared to other architecture.



Hardware Features :• Absolutely no redundant hardware.• All computations done using the same hardware blocks. (Shift Register and Arithmetic

Unit).• All control features divested in the Global Controller.



ResultsResults

Parallel architecture for image compressionParallel architecture for image compression

Testing strategyTesting strategy

Data verificationData verification Result verificationResult verification

Introduction | Algorithm | Architecture | Results | Conclusion | Q&A


Testing strategy – data verificationTesting strategy – data verification

Data verificationData verification

• Modelsim SE 5.5a Modelsim SE 5.5a

• 4 random input pixels, 1 loop, 4 random input pixels, 1 loop, 16 neurons, 7/8 bits16 neurons, 7/8 bits

• Signals for each module Signals for each module viewed & verifiedviewed & verified

• Major verifications:Major verifications:• Correct winner selectionCorrect winner selection

• GC state transitionGC state transition

• Correct neighbor/winner updateCorrect neighbor/winner update

• Correct number of encoded/decoded Correct number of encoded/decoded pixelspixels

ModelsimModelsim

TEST.VTEST.V

TOP.V TOP.V (Top-level synthesizable module)(Top-level synthesizable module)

NEURON_ARRAYNEURON_ARRAY

GLOBAL_CONTROLLERGLOBAL_CONTROLLER

ENCODERENCODER

TEST_INPUTTEST_INPUT



Testing strategyTesting strategy

Data verificationData verification Result verificationResult verification



Testing strategy – Result verificationTesting strategy – Result verification

TOP.V TOP.V (Top-level synthesizable module)(Top-level synthesizable module)

NEURON_ARRAYNEURON_ARRAY

GLOBAL_CONTROLLERGLOBAL_CONTROLLER

ENCODERENCODER

C ProgramC Program

Configuration fileConfiguration file

Decode.vDecode.v

Encoded_image.datEncoded_image.dat

Decoded_image.datDecoded_image.dat

Output.tiffOutput.tiff

input.tiffinput.tiff

Config.datConfig.dat

Result verificationResult verification




Result verificationResult verification


1.1. 16 neurons16 neurons 1 loop1 loop 7 bit7 bit














Mean square error

= Σ [(Ro-Rc)2+ (Go-Gc)2+(Bo-Bc)2]

2963.2441292963.244129 2780.9329032780.932903



Screen capturesScreen captures



SynthesisSynthesis


Technology LibrariesTechnology Libraries

Verilog CodeVerilog Code ConstraintsConstraintsSynthesis tool

Prototype Model

Schematic optimized net-listSchematic optimized net-list

In-signal file Out-signal file

Xilinx ISE Series 4.1iXilinx ISE Series 4.1i


SynthesisSynthesis


==================Chip top-optimized==================

Summary Information:--------------------Type: Optimized implementationSource: top, up to dateStatus: 0 errors, 0 warnings, 0 messagesExport: exported after last optimizationChip create time: 0.000000sChip optimize time: 598.734000sFSM synthesis: ONEHOT

Target Information:-------------------Vendor: XilinxFamily: VIRTEXDevice: V800HQ240Speed: -4

Chip Parameters:----------------Optimize for: SpeedOptimization effort: LowFrequency: 50 MHzIs module: NoKeep io pads: NoNumber of flip-flops: 3129Number of latches: 0


FPGA ImplementationFPGA Implementation


• Used the following:

Board – XESS Corporation XSVFPGA – XILINX VIRTEX XCV V800HQ240CPLD - XILINX XC95108 CPLDMEMORY - Winbond AS7C4096 – 2 512K x 16 bit banks

1 on-board Dip switch2 push buttons9 bar graph LEDs




Upload configuration files and image to the on-board memory

Upload the FPGA bit file to the CPLD

BAR LED 1 glows - FPGA is configured

Press Push Button 1 (START) to start the learning process

BAR LED 2 glows – 2 loops completed




BAR LED 6 glows – Encoding completed

Download the image and convert it to tiff format


ConclusionConclusion


ConclusionConclusion

1. 7-bit process better than 8-bit process1. 7-bit process better than 8-bit process


2. Suitable for real-time encoding and streaming of video images2. Suitable for real-time encoding and streaming of video images

(About 12 seconds at 5MHz)(About 12 seconds at 5MHz)

3. Use of frequency count register gives better images3. Use of frequency count register gives better images

4. More the loops, better the image (8-bit, beyond 5 loops). Similar 4. More the loops, better the image (8-bit, beyond 5 loops). Similar to human learningto human learning


RecommendationRecommendation

1. Algorithm can be modified to improve learning time1. Algorithm can be modified to improve learning time


2. Real time video compression with 2 parallel learning chips2. Real time video compression with 2 parallel learning chips

3. Both 7-bit and 8-bit in the same hardware3. Both 7-bit and 8-bit in the same hardware

4. MSB plane compression4. MSB plane compression


Q & AQ & A

b eng final year project presentation

Documents