vector fpga acceleration of 1 -d dwt computations … fpga acceleration of 1 -d dwt computations...
TRANSCRIPT
VectorFPGAAccelerationof1-DDWTComputationsusingSparseMatrix
Skeletons
SidharthMaheshwari,GouravModi,Siddhartha,NachiketKapreSchoolofComputerScienceandEngineering
NanyangTechnologicalUniversity
Matrix-Form1-DDWT
• Formulation:𝐶 = 𝑇𝑀 % 𝑋, where 𝑇𝑀 = ∏ 𝑇()
*)
• TMmatrixishighlysparseØLargenumberofmultiply-by-zerooperations
ØLargememoryfootprintconsistingofzeroes
• Goals:Ø SIMD-friendlyoperationsonnon-zerovaluesonly
Ø CustomizedDMAroutinesforefficientbandwidthutilization
Matrix-Form1-DDWT
• Formulation:𝐶 = 𝑇𝑀 % 𝑋, where 𝑇𝑀 = ∏ 𝑇()
*)
• TMmatrixishighlysparseØLargenumberofmultiply-by-zerooperations
ØLargememoryfootprintconsistingofzeroes
• Goals:Ø SIMD-friendlyoperationsonnon-zerovaluesonly
Ø CustomizedDMAroutinesforefficientbandwidthutilization
Matrix-Form1-DDWT
• Formulation:𝐶 = 𝑇𝑀 % 𝑋, where 𝑇𝑀 = ∏ 𝑇()
*)
• TMmatrixishighlysparseØLargenumberofmultiply-by-zerooperations
ØLargememoryfootprintconsistingofzeroes
• Goals:Ø SIMD-friendlyoperationsonnon-zerovaluesonly
Ø CustomizedDMAroutinesforefficientbandwidthutilization
𝑁 = 2-., 𝐿 = 6𝑎𝑛𝑑𝑘 = 3
Results- Speedup
05
1015202530354045505560
MXP−DE2 MXP−DE4 MXP−ZedBoard
Speedup Baseline CPU
Raspberry PiZedboardBeagleBone Black
𝑁 = 2-., 𝐿 = 6𝑎𝑛𝑑𝑘 = 3
Results- Speedup
05
1015202530354045505560
MXP−DE2 MXP−DE4 MXP−ZedBoard
Speedup Baseline CPU
Raspberry PiZedboardBeagleBone Black
𝑁 = 2-., 𝐿 = 6𝑎𝑛𝑑𝑘 = 3
Results- Speedup
05
1015202530354045505560
MXP−DE2 MXP−DE4 MXP−ZedBoard
Speedup Baseline CPU
Raspberry PiZedboardBeagleBone Black
Summary
• We propose a Modified Matrix-Form scheme to unlock inherentparallelism in 1-D DWT
• We exploit the sparsity pattern in TM to reduce complexity fromO(𝑛8) to O(𝑛) using :
Ø Skeletons to avoid wastefulmultiply-by-zero operationsØ Rearrangement of input samples
• Speedups of 12-103x over state-of-the-art in-built signal libraryin Octave (dwt function)
ExperimentalSetupMatrix-form1-DDWT
SparseMatrixSkeletons
CPU- OptimizedOpenBLASroutinesinOctaveandC(compiledwith–O3)- PerformancemeasuredusingPAPIv5.4.3- 32bARMv7onBeagleboneBlack,Zedboard,andARMv6onRaspberryPi
CPU+MXP- CustomizedDMAroutinesfordatatransferbetweenhostandMXP- 16-32vectorlanes- 64-128KBscratchpadmemory- PerformancemeasuredusingMXPTimingAPI- AlteraDE2/DE4andZedboard
Results- Throughput
𝑁 = 2-., 𝐿 = 6𝑎𝑛𝑑𝑘 = 3
●
●●
20
40
60
80
0.1 1.0Throughput (GOps/S)
Ener
gy (m
J)●
●●
ARM (Beagl.)
ARM (Rasp.)
ARM (Zedb.)
MXP−DE2
MXP−DE4
MXP−Zed
CHALLENGES:• Largevolumeofdata• Strictreal-timeprocessingconstraints• Highaccuracydemands• Energyconstraints,especiallyinembedded systems