joão m. p. cardoso
DESCRIPTION
A Data-Driven Approach for Pipelining Sequences of Data-Dependent LOOPs. João M. P. Cardoso. Portugal. ITIV, University of Karlsruhe, July 2, 2007. Motivation. Many applications have sequences tasks E.g., in image and video processing algorithms Contemporary FPGAs - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/1.jpg)
A Data-Driven Approach for Pipelining Sequences
of Data-Dependent LOOPs
João M. P. Cardoso
ITIV, University of Karlsruhe, July 2, 2007
Portugal
![Page 2: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/2.jpg)
2
Motivation
Many applications have sequences tasks• E.g., in image and video processing
algorithms
Contemporary FPGAs• Plenty of room to accommodate highly
specialized complex architectures• Time to creatively “use available
resources” than to simply “save resources”
![Page 3: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/3.jpg)
3
Motivation
Computing Stages• Sequentially
Task A Task B Task C
TIME
![Page 4: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/4.jpg)
4
Motivation
Computing Stages• Concurrently
TIME
Task A
Task B
Task C
![Page 5: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/5.jpg)
5
Outline
Objective Loop Pipelining Producer/Consumer Computing Stages Pipelining Sequences of Loops Inter-Stage Communication Experimental Setup and Results Related Work Conclusions Future Work
![Page 6: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/6.jpg)
6
Objectives
To speed-up applications with multiple and data-dependent stages • each stage seen as a set of nested
loops
How?• Pipelining those sequences of data-
dependent stages using fine-grain synchronization schemes
• Taking advantage of field-custom computing structures (FPGAs)
![Page 7: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/7.jpg)
7
Loop Pipelining Attempt to overlap
loop iterations Significant
speedups are achieved
But how to pipeline sequences of loops?
I1 I2 I3 I4
I1
I2
I3
I4
time
...
...
![Page 8: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/8.jpg)
8
Computing Stages
Sequentially
Producer:
...A[2]A[1]A[0]
Consumer:
A[0]A[1]A[2]...
![Page 9: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/9.jpg)
9
Computing Stages
Concurrently• Ordered producer/consumer pairs
• Send/receive
Producer:...A[2]A[1]A[0]
Consumer:A[0]A[1]A[2]...
A[3
]
...
A[2
]
A[1
]
A[0
]
FIFO with N stages
![Page 10: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/10.jpg)
10
Computing Stages
Concurrently• Unordered producer/consumer pairs
• Empty/Full table
0
1 A[1]
0
0
0
1 A[5]
0
0
Producer:...A[3]A[5]A[1] Consumer:
A[3]A[1]A[5]...
Em
pty/full
data
![Page 11: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/11.jpg)
11
Main Idea
FDCT
Execution of Loops 1, 2 Execution of Loop 3
time
Loop 1 Loop 2
Loop 3
Global FSM
Data Input
Intermediatedata
Data output
Intermediate data array
0 1 2 3 4 5 6 7
816243240
4856
![Page 12: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/12.jpg)
12
Main Idea
FDCT• Out-of-order producer/consumer pairs• How to overlap computing stages?
0 1 2 3 4 5 6 7
8
16243240
4856
0 1 2 3 4 5 6 7
8
16243240
4856
![Page 13: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/13.jpg)
13
Main Idea Pipelined FDCT
Intermediate data( dual-port RAM )
Loop 1 Loop 2
Loop 3
FSM 1
FSM 2
Dual-port 1-bit table( empty/full )
Data input
Data output
Execution of Loops 1, 2
Execution of Loop 3
time
Intermediate data array
0 1 2 3 4 5 6 7
816243240
4856
![Page 14: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/14.jpg)
14
Main Idea
TaskA
TaskB
Mem
ory
Mem
ory
Mem
ory
![Page 15: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/15.jpg)
15
Possible Scenarios
Single write, single read• Accepted without code changes
Single write, multiple reads• Accepted without code changes (by
using an N-bit table)
Multiple writes, single read• Need code transformations
Multiple writes, multiple reads• Need code transformations
![Page 16: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/16.jpg)
16
Inter-Stage Communication Responsible to:
• Communicate data between pipelined stages
• Flag data availability Solutions
• Perfect associative memory• Cost too high
• Memory for data plus 1-bit table (each cell represents full/empty information)
• Size of the data set to communicate
• Decrease size using hash-based solution
0
1 A[1]
0
0
0
1 A[5]
0
0
Em
pty/full
data
![Page 17: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/17.jpg)
17
i_1 = 0;for (i=0; i<N*num_fdcts; i++){ //Loop 3 L1: f0 = tmp[i_1]; if(!tab[i_1]) goto L1; L2: f1 = tmp[1+i_1]; if(!tab[1+i_1]) goto L2; // remaining loads // computations … // stores i_1 += 8;}
…boolean tab[SIZE]={0, 0,…, 0};…for(i=0; i<num_fdcts; i++){ //Loop 1
for(j=0; j<N; j++){ //Loop 2
// loads // computations // stores tmp[48+i_1] = F6 >> 13; tab[48+i_1] = true; tmp[56+i_1] = F7 >> 13; tab[56+i_1] = true; i_1++; } i_1 += 56; }
Inter-Stage Communication
Memory plus 1-bit table
img
Loop 1 Loop 2
Dual-port memory:
tmp
Loop 3
dct_o
FSM 1 FSM 2
Dual-port 1-
bit table: tab
data connections address connections
![Page 18: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/18.jpg)
18
i_1 = 0;for (i=0; i<N*num_fdcts; i++){ //Loop 3 L1: f0 = tmp[H(i_1)]; if(!tab[H(i_1)]) goto L1; L2: f1 = tmp[H(1+i_1)]; if(!tab[H(1+i_1)]) goto L2; // remaining loads // computations … // stores i_1 += 8;}
…boolean tab[SIZE]={0, 0,…, 0};…for(i=0; i<num_fdcts; i++){ //Loop 1
for(j=0; j<N; j++){ //Loop 2
// loads // computations // stores tmp[H(48+i_1)] = F6 >> 13; tab[H(48+i_1)] = true; tmp[H(56+i_1)] = F7 >> 13; tab[H(56+i_1)] = true; i_1++; } i_1 += 56; }
Inter-Stage Communication
Hash-based solution:
img
Loop 1 Loop 2
Dual-port memory:
tmp
Loop 3
dct_o
FSM 1 FSM 2
Empty/full table: tab
data connections address connections
H H
H
H
![Page 19: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/19.jpg)
19
Inter-Stage Communication Hash-based solution
• We did not want to include additional delays in the load/store operations
• Use H(k) = k MOD m• When m is a multiple of 2*N,• H(k) can be implemented by just using the
least log2(m) significant bits of K to address the cache (translates to simple interconnections)
A[5]1
0
0
0
0
0
A[1]1
0
H H
A[5]1
0
0
0
0
0
A[1]1
0
![Page 20: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/20.jpg)
20
Inter-Stage Communication
Hash-based solution: H(k) = k MOD m Single read
(L=1) R = 1 = 0
a) writeb) read
c) empty/full update
L N
M
data_in address_in
H
address_out data_out
H
hit/miss
T
(a)
(b)
(c)
(a)
(b)
R (a)
![Page 21: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/21.jpg)
21
Inter-Stage Communication
Hash-based solution: H(k) = k MOD m Multiple reads
(L>1) R = 11...1 (L) >>= R
a) writeb) read
c) empty/full update
L N
M
data_inaddress_in
H
address_out data_out
H
hit/miss
T
(a)
(b)
(c)
(a)
(b)
R (a)
![Page 22: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/22.jpg)
22
Buffer size calculation
By monitoring behavior• of communication component
For each read and write • determine the size of the buffer
needed to avoid collisionsDone during RTL simulation
![Page 23: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/23.jpg)
23
Java Code withdirectives
Front-End (includescompilation to JVM)
Library(FUs)
FU Models(HDL)
Java bytecodes
Nau
Logic Synthesis and Place andRoute (vendor-specific)
FU Models(Java)
SpecificReconfigurable
Hardware (FPGA)
Estimators
ControlUnits(XML)
DatapathUnits (XML)
RTG (XML)
XSL Transformers
Experimental Setup
Compilation flow• Uses our previous work on compiling
algorithms in a Java subset to FPGAs
![Page 24: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/24.jpg)
24
Experimental Setup
Simulation back-end
fsm.xmldatapath.xmldatapath.xml fsm.xml rtg.xml
to dotty to dottyto hds to java to javato vhdl to vhdl
datapath.hds fsm.java rtg.java
fsm.class rtg.classHADES
Library of Operators
(JAVA)
I/O data( RAMs and Stimulus )
XSLTs
ANT build file
![Page 25: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/25.jpg)
25
Experimental Results Benchmarks
Algorithm
# Stages #loops
Description
fdct 2 {s1,s2} 3 Fast DCT (Discrete Cosine Transform)
fwt2D 4 {s1,s2,s3,s4}
8 Forward Haar Wavelet
RGB2gray+
histogram
2 {s1,s2} 2 Transforms an RGB image to a gray image with 256 levels and determines the histogram of the gray image
Smooth +
sobel,3
versions:(a)(b)(c)
2 {s1,s2} 6 Smooth image operation based on 33 windows being the resultant image input to the sobel edge detector. (a): original code; (b): two innermost loops of the smooth algorithm fully unrolled (scalar replacement of the array with coefficients); (c): the same as (b) plus elimination of redundant array references in the original code of sobel.
![Page 26: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/26.jpg)
26
Experimental Results
FDCT (speed-up achieved by Pipelining Sequences of Loop)
1.00
1.20
1.40
1.60
1.80
2.00
1 2 3 4 5 6 7 8 16 32 40 48 56 64 128
256
512
1024
# 8x8 blocks
Sp
ee
du
p
![Page 27: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/27.jpg)
27
Experimental ResultsAlgorithm
Input data size
Stages#cc w/o
PSL
Speed-up Upper –Bound
#cc w/ PSLSpeed-
up
fdct 800600(s1,s2)(s1)(s2)
3,930,0051,950,0031,920,003
2.02 1,830,215 2.02
Fwt2D 512512(s1,s2,s3,s4)(s1,s2)(s3,s4)
4,724,7452,362,3732,362,373
2.00 3,664,917 1.29
RGB2gray +
histogram
800600
(s1,s2)(s1)(s2)
6,720,0252,880,0153,840,015
1.75 3,840,007 1.75
Smooth + sobel
(a)800600
(s1,s2)(s1)(s2)
49,634,00932,929,47316,606,951
1.51 32,929,489 1.51
Smooth + sobel
(b)800600
(s1,s2)(s1)(s2)
30,068,64513,364,10916,606,951
1.81 16,640,509 1.81
Smooth + sobel
(c)800600
(s1,s2)(s1)(s2)
25,773,80913,364,10911,862,791
1.92 13,364,117 1.92
![Page 28: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/28.jpg)
28
Experimental Results What does happen with buffer sizes?
128
480000
480000
480000
2621442
2048
131072
56
1
120000
1198
1 10 100 1000 10000 100000 1000000
smooth + sobel (a)
RGB2gray + histogram (a)
fwt2D
fdct
table size (no hash function) buffer size used (simple hash function) buffer minimum size (perfect hash)
![Page 29: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/29.jpg)
29
Experimental Results
Adjust latency of tasks in order to balance pipeline stages:• Slowdown tasks with higher latency• Optimization of slower tasks in order to
reduce their latency
Slowdown of producer tasks usually reduces the size of the inter-stage buffers
![Page 30: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/30.jpg)
30
131072
1
480000
480000
480000
480000
480000
4800002048
2048
8192
2
131072
6001
120000
1198
95110
1198
1 10 100 1000 10000 100000 1000000
smooth + sobel (a)
smooth + sobel (b)
smooth + sobel (c)
RGB2gray + histogram (a)
RGB2gray + histogram (b)
RGB2gray + histogram (c)
table size (no hash function) buffer size used (simple hash function) buffer minimum size (perfect hash)
Experimental Results
Buffer sizes
+1 cycle per iteration of the producer
+2 cycles per iteration of the producer
original
Optimizations in the producer
+Optimizations in the consumer
original
![Page 31: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/31.jpg)
31
Experimental Results
Buffer sizes
41.5%
41.5%
8.4%
27.4%
50.0%
50.0%
26.7%
56.3%
234
4
59
234
4
240000
131072
3750
0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0%
smooth + sobel (a)
smooth + sobel (b)
smooth + sobel (c)
RGB2gray + histogram (a)
RGB2gray + histogram (b)
RGB2gray + histogram (c)
fwt2D
fdct
1 10 100 1000 10000 100000 1000000 1000000010000000
010000000
00
overhead related to optimal size reduction related to original
![Page 32: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/32.jpg)
32
Experimental Results
1.14
1.000.96
1.131.131.00 1.00 1.00 1.03 0.99 1.00 0.99
1
10
100
1000
10000
fdct
fdct-
hash
fdct-
table
sm
ooth
+sobel
sm
ooth
+sobel-hash
sm
ooth
+sobel-ta
ble
RG
B2gra
y+
his
togra
m
RG
B2gra
y+
his
togra
m-h
ash
RG
B2gra
y+
his
togra
m-t
able
fwt2
D
fwt2
D-h
ash
fwt2
D-t
able
FP
GA
reso
urc
es
0.0
0.2
0.4
0.6
0.8
1.0
1.2
# FFs # 4-LUTS # Slices Normalized Freq.
Resources and Frequency (Spartan-3 400)
![Page 33: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/33.jpg)
33
Related Work
Previous approach (Ziegler et al.)• Coarse-grained communication and synchronization
scheme• FIFOs are used to communicate data between
pipelining stages• Width of FIFO stages dependent on
producer/consumer ordering• Less applicable
A[0]A[1]A[2]A[3]...
Producer: Consumer:
A[0]A[1]A[2]A[3]...
A[0]A[1]...
A[0]A[1]A[2]A[3]...
A[1]A[0]A[3]A[2]...
A[0]A[1]
A[2]A[3]
...
A[0]A[1]A[2]A[3]A[4]A[5]...
A[0]A[3]A[1]A[4]A[2]A[5]...
A[0]A[1]A[2]A[3]A[4]
A[5]A[6]A[7]A[8]A[9]
...
time
![Page 34: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/34.jpg)
34
Conclusions We presented a scheme to accelerate
applications, pipelining sequences of loops• I.e., Before the end of a stage (set of nested loops)
a subsequent stage (set of nested loops) can start executing based on data already produced
Data-driven scheme is used based on empty/full tables• A scheme to reduce the size of the memory
buffers for inter-stage pipelining (using a simple hash function)
Depending on the consumer/producer ordering, speedups close to theoretical ones are achieved• as if stages are concurrently and independently
executed
![Page 35: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/35.jpg)
35
Future Work Research other hash functions Study slowdown effects Apply the technique in the context of
Multi-Core Systems
Processor Core
A
LN
Mdata
_in
addr
ess_
in
H
addr
ess_
out
data
_out
H
hit
/mis
s
T
(a)
(b)
(c)
(a)
(b)
R(a
)
Processor Core
BMem
ory
Mem
ory
![Page 36: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/36.jpg)
36
Acknowledgments Work partially funded by
• CHIADO - Compilation of High-Level Computationally Intensive Algorithms to Dynamically Reconfigurable COmputing Systems
• Portuguese Foundation for Science and Technology (FCT), POSI and FEDER, POSI/CHS/48018/2002
Based on the work done by Rui Rodrigues
In collaboration with Pedro C. Diniz
![Page 37: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/37.jpg)
37
technologyfrom seed
A Data-Driven Approach for Pipelining
Sequences of Data-Dependent Loops
![Page 38: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/38.jpg)
38
Buffer Monitor
FDCT
0
10
20
30
40
50
60
0 50 100 150 200 250 300
clock cycles
elem
ents
0
0.5
1
1.5
2
2.5
3
3.5
buffer size store load(hit) load(miss)
![Page 39: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/39.jpg)
39
Buffer Monitor
fwt2D
0
0,2
0,4
0,6
0,8
1
1,2
0 20 40 60 80 100
clock cycles
ele
me
nts
0
0,5
1
1,5
2
2,5
3
3,5
buffer size load(miss) load(hit) store
![Page 40: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/40.jpg)
40
Buffer MonitorRGB2gray + histogram
0
2
4
6
8
10
12
0
18
36
54
72
90
10
8
12
6
14
4
16
2
18
0
19
8
21
6
23
4
25
2
27
0
28
8
30
6
32
4
34
2
36
0
clock cycles
ele
me
nts
0
0,5
1
1,5
2
2,5
3
3,5
buffer size store load(miss) load(hit)
![Page 41: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/41.jpg)
41
Buffer Monitor
RGB2gray + histogram (modified)
0
1
2
3
4
5
6
0
18
36
54
72
90
10
8
12
6
14
4
16
2
18
0
19
8
21
6
23
4
25
2
27
0
28
8
30
6
32
4
34
2
36
0
37
8
clock cycles
ele
me
nts
0
0,5
1
1,5
2
2,5
3
3,5
buffer size store load(miss) load(hit)
![Page 42: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/42.jpg)
42
Buffer MonitorSmooth + Sobel a)
0
5
10
15
20
25
30
0
11
3
22
6
33
9
45
2
56
5
67
8
79
1
90
4
10
17
11
30
12
43
13
56
14
69
15
82
16
95
18
08
19
21
20
34
21
47
22
60
23
73
clock cycles
ele
me
nts
0
0,5
1
1,5
2
2,5
3
3,5
buffer size store load(miss) load(hit)
![Page 43: João M. P. Cardoso](https://reader033.vdocuments.mx/reader033/viewer/2022061512/56812b0c550346895d8ef7d6/html5/thumbnails/43.jpg)
43
Buffer Monitor
Smooth + Sobel a)
0
2
4
6
8
10
12
14
1
11
4
22
8
34
2
45
6
57
0
68
4
79
8
91
2
10
26
11
40
12
54
13
68
14
82
15
96
17
10
18
24
19
38
20
52
21
66
22
80
23
94
clock cycles
ele
me
nts
0
0,5
1
1,5
2
2,5
3
3,5
buffer size store load(miss) load(hit)