flexible partial reconfiguration based design architecture
Post on 16-Oct-2021
6 Views
Preview:
TRANSCRIPT
Flexible Partial Reconfiguration based Design Architecture for Dataflow Computation
Mihir Shah
Advisor: Benjamin Carrion Schaefer
1
Thesis Index
2
SR. NO TITLE
1 Thesis Motivation
2 Thesis Contribution
5 Proposed Design Methodology
6 Design Implementations โ Spatial & Partial Reconfiguration based
7 Comparative Study & Analysis
8 Conclusion & Future Works
3
Thesis Motivation
DCT Quantz RLE Huffman
Inp
ut
Stre
am
Ou
tpu
t St
ream
Dataflow Computing (DC) specific model of computation
Target application described as Data Flow Graph (DFG)
Used Extensively in High Frequency Trading, Image, Signal processing based applications
JPEG Encoder as Dataflow Process
Node
Node: Processing Element (PE), Execution
kernel/accelerator/
Links : FIFO (first-in first-out) queue or buffer
4
Thesis Motivation
4
Partial Reconfiguration
Allows modification of an operating FPGA design by loading a partial configuration file,
usually a partial BIT file
Time Multiplex several Processing Elements in a dataflow computation process
๐ท๐ฌ๐
๐ท๐ฌ๐
๐ท๐ฌ๐
Container
FPGA Fabric
Reconfigurable Partition Areaโฆโฆ
Partial
Reconfiguration
Controllers
Config
Port
Matrix Multiplication
55
Thesis Motivation
5
Partial Reconfiguration
Allows modification of an operating FPGA design by loading a partial configuration file,
usually a partial BIT file
Time Multiplex several Processing Elements in a dataflow computation process
๐ท๐ฌ๐
Container
FPGA Fabric
Reconfigurable Partition Areaโฆโฆ
Partial
Reconfiguration
Controllers
Config
Port
Matrix Multiplication
๐ท๐ฌ๐
๐ท๐ฌ๐
๐ท๐ฌ๐
666
Thesis Motivation
6
Partial Reconfiguration
Allows modification of an operating FPGA design by loading a partial configuration file,
usually a partial BIT file
Time Multiplex several Processing Elements in a dataflow computation process
Container
FPGA Fabric
๐ท๐ฌ๐โฆโฆPartial
Reconfiguration
Controllers
Config
Port
Matrix Multiplication
๐ท๐ฌ๐
๐ท๐ฌ๐
๐ท๐ฌ๐
๐ท๐ฌ๐
7777
Thesis Motivation
7
Partial Reconfiguration
Allows modification of an operating FPGA design by loading a partial configuration file,
usually a partial BIT file
Time Multiplex several Processing Elements in a dataflow computation process
Container
FPGA Fabric
๐ท๐ฌ๐โฆโฆPartial
Reconfiguration
Controllers
Config
Port
Matrix Multiplication
๐ท๐ฌ๐
๐ท๐ฌ๐
๐ท๐ฌ๐
88
Thesis Motivation
๐ท๐ฌ๐ ๐ท๐ฌ๐ ๐ท๐ฌ๐ ๐ท๐ฌ๐โฆโฆโฆโฆโฆ. Spatially placed FPGA Design
Problem ? Area Utilization HighDataflow Process on FPGA Fabric
๐ท๐ฌ๐
FPGA Fabric
Reconfigurable Partition Area ARM
Processor
DDR3 Memory
๐ท๐ฌ๐
๐ท๐ฌ๐
โฆโฆ
Container
PR based design using External Off chip DDR
memory as FIFOSolves: Area Utilization Issue
Problem ? Runtime & Latency Hit !!
PRDDR design method
Static design method
99
Thesis Motivation
๐ท๐ฌ๐
FPGA Fabric
Reconfigurable Partition Area
๐ท๐ฌ๐
๐ท๐ฌ๐
โฆโฆBRAM Memory
THUS WE PROPOSE - PR based design using Internal On chip BRAM memory as FIFO for dataflow process
Improves Runtime & Latency compared to PRDDR
with reduced FPGA resource savings
PRBRAM design method
Container
1. Semi-automatic design methodology for dataflow computation
Fixed overlay Static architecture - PRBRAM
Support for partial re-configurability
Input is behavior description language for HLS
2. Prototyped on Xilinx Zynq FPGA
JPEG Encoder given in SystemC
Three Testcase images
3. Extensive experimental results
Measure hardware running time vs. area characteristics of static and PR-based methods.
Comparative study between PRBRAM and PRDDR methods
Hardware Running Time vs. Varying Size of Pblock or Reconfigurable Partitions
10
Thesis Contribution
11
JPEG Encoder: A Dataflow Process for Comparative Study
11
DPCM
RLC
Entropy Coding
HeaderTables
Data
CodingTables
QuantโฆTables
DCTpixel(i, j)
8 x 8
DCT(i, j)
8 x 8
QuantizationDCTq(i, j)
Original_Image.bmp 512 X 512 pixels
Size: 258 KB
Compressed_Image.jpg 512 X 512 pixels
Size: 36 KBCompression Ratio : 7.17:1
Run-Length Zig-Zag Scan
DC
AC
JPEG File Format
PE1 : DCT
PE2: Quantization
PE3: RunLengthEncoding
PE4: Huffman Encoding
12
Using Xilinx SDKApp.cpp +
๐บ๐๐๐๐๐๐๐๐๐๐.bit = BOOT. bin
Overview of the Proposed Design Methodology
12
PRBRAM Static Architecture
Stage 1: Behavioral Algorithm Description to RTL Generation
1313
Using Xilinx SDKApp.cpp +
๐บ๐๐๐๐๐๐๐๐๐๐.bit = BOOT. bin
13
PRBRAM Static Architecture
Stage 1: Behavioral Algorithm Description to RTL Generation
Stage 2: Validation and Creation of Custom IPs
Overview of the Proposed Design Methodology
141414
Using Xilinx SDKApp.cpp +
๐บ๐๐๐๐๐๐๐๐๐๐.bit = BOOT. bin
14
PRBRAM Static Architecture
Stage 1: Behavioral Algorithm Description to RTL Generation
Stage 2: Validation and Creation of Custom IPs
Stage 3: TCL Automated Floorplan for PR Designs
Overview of the Proposed Design Methodology
15151515
Using Xilinx SDKApp.cpp +
๐บ๐๐๐๐๐๐๐๐๐๐.bit = BOOT. bin
15
PRBRAM Static Architecture
Stage 1: Behavioral Algorithm Description to RTL Generation
Stage 2: Validation and Creation of Custom IPs
Stage 3: TCL Automated Floorplan for PR Designs
Stage 4: Deploying the Binaries on Zynq-7000: PRBRAM Static
Architecture
Overview of the Proposed Design Methodology
16
Stage 1: Behavioral Description of Dataflow to RTL Generation
Key-points when describing dataflow application using BDL:
Uniformity in the number, direction and data-widths of I/O all the PEโs
Control interface signals - done, reset and start : Close Loop Feedback when Context Switching
16
๐ท๐ฌ๐. ๐ ๐ท๐ฌ๐. ๐ ๐ท๐ฌ๐. ๐ ๐ท๐ฌ๐. ๐โฆโฆโฆโฆโฆ.
Behavioral Description of Dataflow Algorithm using SystemC/C++/C
High Level Synthesis
RTL GENERATION
Raw Input Data
Processed Output Data
โฆโฆโฆโฆโฆ.
โฆโฆโฆโฆโฆ.๐ท๐ฌ๐. ๐, ๐ท๐ฌ๐. ๐, ๐ท๐ฌ๐. ๐, ๐ท๐ฌ๐. ๐
Vivado HLS (Xilinx),
Stratus (Cadence),
CyberWorkBench(NEC),
Catapult(Mentor)
17
Stage 2: Validation and Creation of Custom IPs
17
Logical Synthesis & Simulation
Packaging Design using Vivado IP Packager
Creating System Level Design with Xilinx IP
Integrator
Validating the IP design using Xilinx SDK
Structural RTL to
Xilinx Primitives
Writing Test
Benches
Ensure Target
Platform Function
& Timing Violations
Step.1 Step.2 Step.3 Step.4
Design Re-usability
Xilinx Supports
AMBA AXI
I/O instance port-
map to internal
slave registers
Helps to Write
Software Code for
the IP
Compare results
with Simulation
Block Design with
ZYNQ7 PS
Create Top-Level
Wrapper
Export to SDK to
create BSP &
drivers
18
Stage 2: Validation and Creation of Custom IPs
18
PE0.v, PE1.v . . . . PEk.v PE0 PE1 PEk
Post Logical Synthesis
Functional Simulation
Match
?
. . . .
Hardware Results
Integrating IP to ARM
Dedicated IPsRTL
Goal: Ensures correct memory maps in IPs
Validates Control Signals & Functionality of the Module before integrating into tighter PR Flow
19
Stage 3: TCL Automated Floorplan for PR Designs
19
1. Synthesized Static & RMs separately
2. Create physical constraints to create
pblock RP
3. Set property HD.RECONFGIURABLE
on RP
4. Implement static + one RM โ complete
design
5. Save the DCP after routing
6. Remove the RM from RP and save
static DCP
7. Lock static placement & routing
8. Add New RM to the static design &
Save the DCP
All RMs covered ?
No
9. Run pr_verifyYes
10. Create bitstreams for each configuration
1) fplan.xdc
2) PEk.v
3) Static.v
1) partial binaries (.bin)
2) blanking.bit
**Processing Element (PE) will be referred as, Reconfigurable Module(RM) in PR Designs
20
Stage 4: Deploying the Binaries on Zynq-7000
20
BOOT.bin = first stage image for PL side + User application
software
After power-on reset, the Boot ROM determines the boot mode (SD flash memory) and the encryption status (non-secure)
Load First Stage Boot Loader (FSBL) into on-chip
RAM (OCM).
Releases CPU control to the FSBL which in turn
configures the PL with the full Staticblank.bit via
PCAP (Processor Control Access Port)
SD Card
2121
Stage 4: Deploying the Binaries on Zynq-7000
21
Partial bitstreams are loaded into DDR memory from SD
card to maximize throughput during configuration
The Raw Image data which is the input to the dataflow
process in JPEG Encoder is also transferred to the DDR
Memory
After this step, the Reconfigurable Modules can
be loaded into the Reconfigurable Partition to start
computation.
2222
Stage 4: Deploying the Binaries on Zynq-7000
Load the BRAM Memory with Raw Data from DDR
Mux Sel=0
232323
Stage 4: Deploying the Binaries on Zynq-7000
PE0.bin
PCAP fetches partial bitfile (PE0.bin) from ddr3 memory to load into configuration port
24242424
Stage 4: Deploying the Binaries on Zynq-7000
PE0.bin
Read the Raw Image data from BRAM Memory and Input it to PE0.bin
Start/Done BRAM Read Sel Mux=1
2525252525
Stage 4: Deploying the Binaries on Zynq-7000
PE0.bin
Enable โstart computationโ signal from ARM for PE0.bin
262626262626
Stage 4: Deploying the Binaries on Zynq-7000
PE0.bin
Read โdone computationโ signal to ARM for PE0.bin
27272727272727
Stage 4: Deploying the Binaries on Zynq-7000
PE0.bin
Write the output generated by PE0.bin to BRAM memory
Start/Done BRAM Write
282828
Stage 4: Deploying the Binaries on Zynq-7000
PEk.bin
The process to Load RM, Read BRAM, Compute & Write BRAM continues till the terminal PE โฆ Finally, PCAP fetches partial bitfile (PEk.bin) from ddr3 memory to load into configuration port
292929
Stage 4: Deploying the Binaries on Zynq-7000
Load the Results generated by PEk.bin to SD Card from BRAM Memory
Mux Sel=0
30
JPEG Encoder PRBRAM Design Implementation
Utilizing 91.42 % of BRAM memory to store intermediate results
Total memory requirements (1048.576 Kbytes) > Available BRAM memory (140 blocks X 36Kb = 630 Kbytes)
Minimum number of partial reconfigurations = 8 (4 (Reconfigurable Modules) * 2(Dividing factor))
30
2048 blocks
4096 8 X 8 pixel
blocks
Divided the image dataset into 2 & Reload BRAM
2048 blocks
3131
PRBRAM Design Implementation โ Experimental Result
32
๐ป๐๐๐๐๐๐๐ = ๐ป๐๐๐๐โ๐๐๐๐๐๐๐๐๐ + ๐ป๐๐๐๐๐๐๐๐ + ๐ป๐๐๐ โ ๐ต๐๐๐
Tjpeg-computing: actual computing time it takes for processing all the inputs of each reconfigurable module
Tbin : time it takes to partially configure the bitstream
Nbin : number of times reconfiguration occurs
Toverhead : time it takes to load the partial binaries and raw image data from SD card to DDR memory
The values Tbin = 0.1975 s, Tjpeg-computing = 2.994 s and Toverhead = 1.675 s are obtained experimentally
PRBRAM Design Implementation โ Experimental Results II
Cases of no. of reconfigurations (1) 32 (2) 64
(3) 128 (4) 256 (5) 512
32
33
RTBRAM values for varying RPBitsize
PRBRAM Design Implementation โ Experimental Results III
33
The size of the pblock or reconfigurable partition affects Tbin
There is a linear relationship
The Hardware running time for reconfigurable architectures is significantly impacted by this.
34
PRBRAM Design Implementation โ Experimental Results IV
[Case 1: RPBitsize = 1598.896 KB, Case 2: RPBitsize = 1306.272 KB, Case 3: RPBitsize = 786.664 KB]
34
4.57
3.86706
3.30284
2
2.5
3
3.5
4
4.5
5
0 1 2 3 4
FPG
A_R
UN
TIM
E (s
)
CASE
FPGA_RUNTIME vs RP_BITSIZE :No. of Reconfiguration = 8
9.305 7.71959
5.62322
4
5
6
7
8
9
10
0 1 2 3 4
FPG
A_R
UN
TIM
E (s
)
CASE
FPGA_RUNTIME vs RP_BITSIZE:No. of Reconfiguration = 32
28.167
23.13036
14.9039
0
5
10
15
20
25
30
0 1 2 3 4
FPG
A_R
UN
TIM
E (s
)
CASE
FPGA_RUNTIME vs RP_BITSIZE:No. of Reconfiguration = 128
103.619
83.11808
52.02699
0
20
40
60
80
100
120
0 1 2 3 4
FPG
A_R
UN
TIM
E (s
)
CASE
FPGA_RUNTIME vs RP_BITSIZE:No. of Reconfiguration = 512
35
Objective:
Prove area utilization efficacy of PRBRAM
Objective:
Prove PRBRAM is runtime and latency efficient compared to PRDDR
JPEG Encoder PRDDR MethodJPEG Encoder Spatial Method
35
DCT
Quantization
IP containing
Reconfigurable
Partition
36
Calculating Area Utilization
๐ด๐ก๐๐ก๐๐ = ๐ด๐ ๐ก๐๐ก๐๐ + ๐ด๐๐ฆ๐๐๐๐๐ ๐๐๐ฅ ๐๐ธ0, ๐๐ธ1, ๐๐ธ2, โฆ . ๐๐ธ๐
36
๐ด๐ก๐๐ก๐๐ = ๐๐ธ0, ๐๐ธ1, ๐๐ธ2, โฆ . ๐๐ธ๐
PR Based Designs
Static Designs
*PE: Process Element
RunLength Encoding PE
Astatic of PRBRAM is high due to additional IPs in the overlay architecture
Max. BRAM Utilization
24759 (Spatial)
18985 (PR_BRAM)
16442 (PR_DDR)
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3
AR
EA_F
F
FPGA_RUNTIME (s)
FPGA_RUNTIME vs AREA_FF
37
I. COMPARATIVE STUDY โ AREA UTILIZATION vs FPGA RUNTIME
37
14997 (Spatial)
12373 (PR_BRAM)
10811 (PR_DDR)
8000
9000
10000
11000
12000
13000
14000
15000
16000
1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5 5.2 5.4 5.6
AR
EA_L
UT
FPGA_RUNTIME (s)
FPGA_RUNTIME vs AREA_LUT
PRBRAM design method requires slightly more area compared to PRDDR due additional IPs - ARM-FPGA
Control Bus, ARM-Side BRAM Control, MUX and Block RAM Memory modules.
Comparing spatial design implementation, the utilization of PRBRAM is significantly low, which is as expected.
3838
1.53
1.163
0.947 0.88 0.832
2.0291.854
1.767 1.722 1.701
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
2.25
0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 272
LATE
NC
Y (
s)
NUMBER_RECONFIGURATIONS
LATENCY COMPARISON
LATENCY_BRAM LATENCY_DDR
7.80710.96317.249
29.824
54.975
10.19416.915
30.344
57.211
110.937
05
101520253035404550556065707580859095
100105110115120
0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 272
FPG
A_R
UN
TIM
E (s
)
NUMBER_RECONFIGURATIONS
FPGA_RUNTIME COMPARISON
RUNTIME_BRAM RUNTIME_DDR
Experiment with unequal RPBitsize
RPBitsize = 3416.088 KB for PRDDR
RPBitsize = 1598.896 KB for PRBRAM
Non-Linear Relationship in Runtime between the
graphs due to Tbin * Nbin not constant
II. COMPARATIVE STUDY โ RUNNING TIME vs LATENCY
39
PRBRAM & PRDDR Static.dcp Floorplan View โ Equal Pblock Sizes
3939
Partition Pins
BRAM Blocks
Higher utilization of
logic elements due to
additional IPs.
4040
II. COMPARATIVE STUDY โ RUNNING TIME vs LATENCY
Experiment with equal RPBitsize = 1306.272 KB for both PR implementations.
Average improvement in runtime is 0.529s
Runtime varies linearly because Nbin * Tbin is constant
40
1.106429
0.874237
0.758173 0.700145
1.322754
0.982405
0.812247
0.727149
0.650.7
0.750.8
0.850.9
0.951
1.051.1
1.151.2
1.251.3
1.351.4
0 10 20 30 40 50 60 70
LATE
NC
Y (
s)
NUMBER_RECONFIGURATIONS
LATENCY COMPARISON
LATENCY_BRAM LATENCY_DDR
2.2128593.496951
6.065391
11.20233
2.685509
3.969622
6.537978
11.67439
22.5
33.5
44.5
55.5
66.5
77.5
88.5
99.510
10.511
11.512
0 8 16 24 32 40 48 56 64 72
FPG
A_R
UN
TIM
E (s
)
NUMBER_RECONFIGURATIONS
FPGA_RUNTIME COMPARISON
RUNTIME_BRAM RUNTIME_DDR
41
CONCLUSION
41
Improvement in average hardware running (PRBRAM ) of 0.529s vs. PRDDR
Implementation with the proposed Architecture PRBRAM is area efficient compared to spatial
implementation with
LUT area savings up to 21.20 % & FF area savings up to 30.41 % for 1306.272 KB
These %โs are including the additional resources utilized by proposed static architecture
Implemented JPEG Encoder on Zynq -7000
For three testcase images-Lena, Peppers and Goldhill
Using Spatial, PRDDR & PRBRAM for Comparative Study & Analysis
Novel design methodology for dataflow computation with proposed PRBRAM overlay static architecture
Including TCL based automated floorplanning + User software application algorithms
42
FUTURE WORKS
Sophisticated Partial Reconfiguration Controllers
Minimize the time required for reconfiguring
Enhanced parallelism of operations in hardware accelerators/Processing Elements due
to saved resources in reconfigurable architecture.
In extremely data-intensive applications, exploring performance impact on PRBRAM
BRAM + Distributed RAM to deal with limitations of on-chip memory
42
43
THANK YOU
43
44
APPENDIX
45
Experimental Results โ Spatial, PRBRAM & PRDDR
Table IV: JPEG Encoder Results for lena.bmp with PRDDR and PRBRAM Design Implementation
Table III: JPEG Encoder Results with Spatial Design Implementation
45
4646
Partial Reconfiguration โ Full vs Partial Bitstream
4646
Figure.2 Configuration Process & Contents of (a) Full and
(b) Partial bitstreams
(a) (b)
46
4747
Partial Reconfiguration โ Method to configure Partial Bitstreams
Internal Configuration Access Port (ICAP) :
User configuration solutions
Requires ICAP controller + Logic to drive the ICAP interface
JTAG Port :
Quick Testing or Debug
Driven using iMPACT or ChipScope Analyzer
Processor Configuration Port (PCAP) :
Configuration mechanism for all Zynq-7000 designs.
Figure.3 (a) ICAP (b) PCAP (c) JTAG
(a) (b) (c)
47
48
JPEG Encoder Hardware Accelerators : DCT & Quantization
Discrete Cosine Transform (DCT):
Converts spatial domain to frequency domain
๐ซ๐ช๐ป ๐, ๐ =๐
๐๐ช ๐ ๐ช ๐
๐=๐
๐
๐=๐
๐
๐๐๐๐๐ ๐, ๐ ๐๐จ๐ฌ๐๐ + ๐ ๐๐ซ
๐๐๐๐จ๐ฌ
๐๐ + ๐ ๐๐ซ
๐๐๐ ,๐๐ก๐๐ซ๐, ๐ช ๐ =
๐
๐๐๐ ๐ = ๐ & ๐ช ๐ = ๐ ๐๐๐๐๐๐๐๐๐
Quantization:
Dividing transformed image DCT matrix by quantization matrix used and rounding off
Aims at reducing most of the less important high frequency DCT coefficients to zero
๐ซ๐ช๐ป๐ธ ๐, ๐ = ๐น๐๐๐๐ ๐ซ๐ช๐ป(๐, ๐)
๐ธ(๐, ๐)(๐)
DCT Quantz RLE Huffman
Inp
ut
Stre
am
Ou
tpu
t St
ream
48
49
JPEG Encoder Hardware Accelerators : RunLength Encoding
DCT Quantz RLE Huffman
Inp
ut
Stre
am
Ou
tpu
t St
ream
Zig Zag Scan
Encoding the difference between current and the previous 8 X 8 block
Encodes a series of zeros as a (skip, value) pair
Differential Pulse Code Modulation (DPCM)
RunLength on AC Components (RLC)
49
50
JPEG Encoder Hardware Accelerators : Entropy Coding
DCT Quantz RLE Huffman
Inp
ut
Stre
am
Ou
tpu
t St
ream
SIZE Value Code
0 0 ---
1 -1,1 0,1
2 -3, -2, 2,3 00,01,10,11
3 -7,โฆ, -4, 4,โฆ, 7 000,โฆ, 011, 100,โฆ111
4 -15,โฆ, -8, 8,โฆ, 15
0000,โฆ, 0111, 1000,โฆ, 1111
. .
. .
11 -2047,โฆ, -1024, 1024,โฆ 2047
โฆ
DC Components
DC components are differentially coded as
(SIZE, Value)
Code for a Value is derived from the
Size_and_Value Table (Table.1)
Code for a SIZE is derived from Table.2
Example: If a DC component is 40 and the previous DC component is 48. The difference is -8. Huffman coded as: 1010111
0111: The value for representing โ8
(Size_and_Value table)
101: The size from the same table reads
4, which corresponds to 101 from Table.2
Table.1 Size_and_Value
SIZE Code
Length
Huffman Code
0 2 00
1 3 010
2 3 011
3 3 100
4 3 101
5 3 110
6 4 1110
7 5 11110
8 6 111110
9 7 1111110
10 8 11111110
11 9 111111110
Table.2 Huffman Table for DC component SIZE field
50
5151
JPEG Encoder Hardware Accelerators : Entropy Coding
DCT Quantz RLE Huffman
Inp
ut
Stre
am
Ou
tpu
t St
ream
AC Components: Coded as (S1,S2 pairs)
S1: (RunLength/SIZE) , where RunLength : length of the
consecutive zero values [0..15] & SIZE : No. of bits needed to code
the next nonzero AC componentโs value
S2: (Value), where Value is the value of the AC component from
Table.1
Zig-Zag order -> 12,10, 1, -7 2 0s, -4, 56 zeros
12: read as zero 0s,12: (0/4)12 10111100
1011: The code for (0/4 from Table.3)
1100: The code for 12 from the Table.1
56 0s: (0,0) 1010 (Rest of the components are zeros therefore we simply put the EOB to signify this fact)
Run/
SIZE
Code
Length
Code
0/0 4 1010
0/1 2 00
0/2 2 01
0/3 3 100
0/4 4 1011
0/5 5 11010
0/6 7 1111000
0/7 8 11111000
0/8 10 1111110110
0/9 16 1111111110000010
0/A 16 1111111110000011
Run/
SIZE
Code
Length
Code
1/1 4 1100
1/2 5 11011
1/3 7 1111001
1/4 9 111110110
1/5 11 11111110110
1/6 16 1111111110000100
1/7 16 1111111110000101
1/8 16 1111111110000110
1/9 16 1111111110000111
1/A 16 1111111110001000
โฆ 15/A More Such rows
Table.3 Huffman Table for AC component SIZE field
40 12 0 0 0 0 0 010 -7 -4 0 0 0 0 01 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 0
Figure.9 Example of 8 X 8 block after quantization
51
52
PRDDR Design Implementation โ Floorplan View Post P & R
Figure.24 Design Checkpoints after performing Place & Route (a) Staticddr.dcp (b) DCTddr.dcp (c) Quantizationddr.dcp (d) RLEddr.dcp (e) Huffmanddr.dcp for RPBitsize = 1306.272 KB case
52
53
PRDDR Design Implementation โ Experimental Results II
Table.13 JPEG Encoder Results for lena.bmp with PRDDR Design Implementation
Additional experiment results were tabulated testcase images of peppers.bmp & goldhill.bmp:
Table.14 peppers.bmp Table.15 goldhill.bmp
53
54
SPATIAL Design Implementation โ Experimental Results
Table.11 JPEG Encoder results with Spatial Design Implementation
Figure.22 SSIM Maps (a) Lena (b) Goldhill (c) Peppers
(a) (b) (c)
54
55
SPATIAL Design Implementation - System Implementation and Setup
Using the Vivado IP Integrator, custom IPs of JPEG Encoder are connected
using AXI4 spatially.
Clock frequency : 50 MHz
Additional Hardware Resources :
SD Card and DDR3 connected to the external interfaces on the
Processing Side (PS) of the Zynq FPGA for storing data
Figure.21 Floorplan View
Table.10 Utilization Report
55
56
PRDDR Design Implementation - System Implementation and Setup
2 Micron DDR3 128 Megabit x 16 memory components
creating a 32-bit interface, totaling 512 MB.
The DDR3 is connected to the hard memory controller in
the Processor Subsystem (PS).
DDR3 memory is referenced using pointers in user software
application
address mappings provided in the systems.hdf file
In JPEG Encoder Implemented,
Max. No of I/O: RLE RM block &
Max. data-widths of I/O : Huffman Encoding RM block
Figure.23 PRDDR design methodology Block Diagram for dataflow computation
56
Table.12 Utilization Report Post P & R
5757
Partial Reconfiguration - Ideology & Benefits
Reduced Resource and Power Consumption:
Integrating the design into a lower FPGA IC count.
Power savings due to Reduction in off-chip communication.
Performance Improvements and Flexibility:
Computation capacity of the system adapted at run time
Additional resources for speeding up the operation of the kernel
More number of kernels to perform the operation in parallel.
Improved Fault Tolerance and Dependability:
Safety critical systems - aerospace & defense industries.
Self Adapting Hardware Designs:
Adapt to changing operating and environmental conditions based on AI &
learning.
Figure.1 Basic Ideology of PR
Partial Reconfiguration DefinitionโAllows the modification of an operating FPGA
by downloading Bitfile
58
PRBRAM Design Implementation โ Experimental Results I
Table.7 JPEG Encoder Results for lena.bmp with PRBRAM Design Implementation
Table.8 JPEG Encoder Results for goldhill.bmp and peppers.bmp with PRBRAM Design Implementation
Figure.16 PRBRAM results for lena.bmp testcase with RPBitsize = 1306.272 KB
58
Runtime and Latency have inverse relationship
Latency and Throughput have linear relationship
top related