a multi-processor system on chip architecture for real time remote sensing data processing
DESCRIPTION
A Multi-Processor System on Chip Architecture for Real Time Remote Sensing Data Processing. Presenter: Dr. Alejandro Castillo Atoche. 2011/07/25. IGARSS’11. Outline. Introduction Previous Work MPSoC via the HW/SW Co-design Case Study: RBR Algorithms Algorithm Analysis - PowerPoint PPT PresentationTRANSCRIPT
School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico. 1
2011/07/25
Presenter: Dr. Alejandro Castillo Atoche
A Multi-Processor System on Chip Architecture for Real Time Remote Sensing Data Processing
IGARSS’11
2
Outline Introduction Previous Work MPSoC via the HW/SW Co-design
Case Study: RBR Algorithms Algorithm Analysis
Network on Chip (NoC)-based Accelerator Integration in a Co-design scheme
New Perspective: Network of FPGA-VLSI architectures
Hardware Implementation Results Performance Analysis
Conclusions
School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
3
Introduction: Radar Imagery, Facts
The initial problem of this proposition for the Geospatial RS imagery consist in to solve the ill-conditioned inverse spatial spectrum pattern (SSP) estimation problem with model uncertainties via the Bayesian minimum risk (BMR) estimation strategy.
In previous works, alternatives of MPSoC propositions have been developed but without systolic arrays techniques or Network on a Chip structures.
School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
4
Introduction: HW implementation, Facts
Why Multiprocessor System on a Chip? Because MPSoCs are single-chip multiprocessor
designed for real time signal processing applications.
Why Network on a Chip Accelerators?Networks-on-chips (NoCs) are multiprocessor
interconnection networks designed to achieved real time SP. Avoids Bottlenecks in HW/SW co-designs.
School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
5
MOTIVATION
To efficiently conceptualize and implement an architecture with the aggregation of parallel computing and systolic array mapping techniques in a novel network on a chip (NoC) accelerator scheme via the HW/SW co-design paradigm.
School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
6
CONTRIBUTIONS:
First, a high-speed robust Bayesian regularization hardware accelerator for the real-time enhancement of the large scale Geospatial imagery is designed.
Second, the use of High Performance Computing techniques in an efficient architecture based on Network on a Chip (NoC) is also developed.
School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
8
Algorithmic ref. Implementation
School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
9
Algorithmic ref. Implementation
School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
10School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
Algorithmic ref. Implementation
Method → RSF RBR
SNR [dB] → 15 20 25 15 20 25
Metrics
IOSNR
[dB]10.15 15.32 20.25 6.15 10.62 13.04
PIOSNR
(%)81.37 86.62 85.24 95.18 90.29 98.24
MSE 0.16 0.46 0.57 0.03 0.29 0.34
11
Partitioning Stage
School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
FPGA
EmbeddedProcessor
Robust SS vector Coprocessor 1
Data acquisition:
Parameters: 1, , , ,a r
n nR R S S
Pre-computed Sw-stage:
,F Ω
+diag { }V Fuu F
RBR estimator Coprocessor 2
0ˆRBR b b ΩV
( )ju
12
NoC oriented structure of the proposed coprocessors
School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
(a) Robust SS vector
From Embedded Processor
InputFIFO
:,
data inputu F
tiledcontrol
Fu
tiledcontrol
FIFO
MemoryBuffer
tiledcontrol
F uu
tiledcontrol
FIFO
2Fixed-Sized PA2 4m
MemoryBuffer
diag
F uu F
tiledcontrol
tiledcontrol
FIFO
1Fixed-Sized PA1 32m
3Fixed-Sized PA3 32m
OutputFIFO
:data outputV
To Embedded Processor
13
NoC oriented structure of the proposed coprocessors
School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
(b) RBR estimator
RBRb̂0b
InputFIFO
:,
data inputΩ V
tiledcontrol
Ω V
tiledcontrol
FIFO
MemoryBuffer
4Fixed-Sized PA4 32m
From Embedded Processor 0Ω V b
0b OutputFIFO
:data output
To Embedded Processor
14
Aggregation of parallel computing techniques
School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
Application
for (tile=0, tile< L, tile++){
} } }
for (i=0, i< m, i++){for (j=0, j< n, j++){
for (k=0, k< r, k++){a(i,j,k)=a(i,j-1,k);b(i,j,k)=b(i-1,j,k);c(i,j,k)=c(i,j,k-1) +
a(i,j,k)*b(i,j,k);
}
B[2,2]
A[2,2]
--
---
-
-
-- ---
-
-
--
B[0,0]
A[0,0]
- -
-
-
B[0,1]B[1,0]
A[1,0]A[0,1]
-
-
B[0,2]B[1,1]B[2,0]
A[2,0]
A[1,1]A[0,2]
B[1,2]B[2,1]
A[1,2]
A[2,1]
-
-
Linear Schedule: set of parallel and uniformely spaced hyperplanes.
SFG Projection
3-D Dependance Graph (DG)
add
mul
a[i,j-1]
a[i,j]
b[i-1, j] b[i,j]
c[i,j]
15
Tiling technique
School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
E/S
E/S
E/S E/S
E/S
E/S E/S E/S
E/S
E/S
Large-Scale Real-World
Image
PEPE
PEPE
FIFO
FIFO
FIFO
FIFO
E/S
E/S
E/S
E/S
Fixed-Size Systolic Array
16
Tiling technique
School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PEPEPE
PEPEPE
PEPE
PE PE
PE
PE
PE
PE
E/S
E/S
E/S
E/S
Fifo
Fifo
Fifo
Fifo
PE PE
PE PE
E/S
E/S
PE PE
PE
PE
PEPE
PEPEPE
PE
PE PE
PE
PE
PE
PE PE
PE
PE
PE
E/S E/S E/S
E/SE/S
E/SE/SE/S
E/SE/S
PE
PE
PE
PE
E/S
PE
PEPE
PE
E/S
PE
PE
PE
PE
PE PE
PE
PE
PEPE
PEPEPE
PE
PE PE
PE
PE
PE
PE PE
PE
PE
PE
E/S E/S E/S
E/SE/S
E/SE/SE/S
E/SE/S
Fifo
Fifo
Fifo
PE PE
PE
Fifo
PE
PE
PE PE
PE
E/S
E/S
E/S
E/S
PE
PE
PE
PE
PE PE
PE
PE
PEPE
PEPEPE
PE
PE PE
PE
PE
PE
PE PE
PE
PE
PE
E/S E/S E/S
E/SE/S
E/SE/SE/S
E/SE/S
Fifo
FifoPEPE
PEPE
Fifo
Fifo
PE
PE PE
PE
E/SE/S
PE
PE
PE
PE
PE PE
PE
PE
PEPE
PEPEPE
PE
PE PE
PE
PE
PE
PE PE
PE
PE
PE
E/S E/S E/S
E/SE/S
E/SE/SE/S
E/SE/S PEPE
PEPE
Fifo
Fifo
Fifo
Fifo
PE
PE PE
PE
E/S E/S
PE
PE
PE
PE
PE PE
PE
PE
PEPE
PEPEPE
PE
PE PE
PE
PE
PE
PE PE
PE
PE
PE
E/S E/S E/S
E/SE/S
E/SE/SE/S
E/SE/S
Fifo
FifoPEPE
PEPE
Fifo
Fifo
E/S
PE
PE PE
PE
E/S
I/O
I/O
PE
PE
PE
PE
PE PE
PE
PE
PEPE
PEPEPE
PE
PE PE
PE
PE
PE
PE PE
PE
PE
PE
E/S E/S E/S
E/SE/S
E/SE/SE/S
E/SE/S
Fifo
FifoPEPE
PEPE
Fifo
Fifo
PE
PE PE
PE
E/S
E/SE/S
E/S
(1,2)
(2,1)
Large-Scale Real-World
Image
Fixed-Size Systolic Array
17
Fixed-Sized NoC-PAs-based Robust SS vector co-processor
School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
Stage1: 1 PA Fu
DegradedLarge-scaleRS Image
(1)u(2)u(3)u
( )nu
1 1k k
32
u1,1u1,2u1,u n
,u n n
GlobalControl
full en 1,1F 1,2F
1,F n
,Fn n
F
en
full
FIFO Buffer
32
4 , 1 3 , 2 2 , 3 ,
3 ,1 2 , 2 1 , 3
2 , 1 1 , 2
1 , 1
F F F FF F F 0F F 0 0F 0 0 0
m n
u
F
1,1u
D
1,2u
D
1,u m
D
FIFO
1,3u
D
0
1,3u[32,23]
1,3F[32,23]
T[64,46]
[32,23]
[32,23]TEMP_1 ( 1)V m
TEMP_1 ( )V m
[32,23]
TEMP_1VlocalControl
1Fixed-Sized PA1 32m
( 1)n
tiledControl
D: one step delayT: truncate
FIFO Buffer
tiledControl
From Embedded Processor
Stage 1: TEMP_1 V F u( )n n ( 1)n( 1)n
Data Skewed
18
Fixed-Sized NoC-PAs-based Robust SS vector co-processor
School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
Stage2:
Stage 2: TEMP_2 TEMP_1V V u
( )n n ( 1)n (1 )n
[32,23]TEMP_1
1,1
V
[32,23]
1,u m
T
[64,46]
[32,23]
TEMP_21,3
V
PE PE PE PE
PE PE PE PE
D
D
D
D
PE PE PE PE
D
D
DD
DDD
D
DD D
TEMP_11,1
V
TEMP_12,1
V
TEMP_1,1
Vm
TEMP_11,1
Vm
TEMP_12 ,1
Vm
TEMP_12,1
Vm
1,1u1,2u
1,3u1,u m
1, 1u m
1, 2u m
1, 3u m
1,2u m
2Fixed-Sized PA4m
TEMP_1255 1,1
Vm
TEMP_1255 2,1
Vm
TEMP_1,1
Vn
1,255 1u m
1,255 2u m
1,255 3u m
1,u n
TEMP_2V( )n n
2PA Fuu
19
Fixed-Sized NoC-PAs-based Robust SS vector co-processor
School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
Stage3: 3 diag{ } . PA Fuu F
Stage 3: TEMP_2 diag{ }V V F( )n n ( 1)n( 1)n
PE
0
PE
PE
PE
tiledControl
3Fixed-Sized PA3 32m 1,1F
0
0
0
1,2F
2,1F
0
0
1,3F
2,2F
3,1F
0
1,2F m
2,2 1F m
3,2 2F m
,Fm m
V
tiledControl
TEMP_21,1
V
0
0
0
TEMP_22,1
V
TEMP_21,2
V
0
0
TEMP_22 ,1
Vm
TEMP_22 1,2
Vm
TEMP_22 2,3
Vm
TEMP_2,
Vm m
32 1,1v1,2v1,v m
,v n n
GlobalControl
full
FIFO Buffer
To Embedded Processor
V
( 1)n
RobustSS Vector
,Fm m
TEMP_2,
Vm m
( 1)V m( )V m
20
Fixed-Sized NoC-PAs-based RBR estimator co-processor
School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
4 ,1 3 , 2 2 , 3 ,
3 ,1 2 , 2 1 , 3
2 ,1 1 , 2
1 ,1
00 0
0 0 0
m n
1,1V
D
1,2V
D
1,V m
D
FIFO
1,3V
D
0
1,3V[32,23]
1,3[32,23]
T[64,46]
[32,23]
[32,23]TEMP_1 ( 1)b̂ m
TEMP_1 ( )b̂ m
[32,23]
Ω V
Fixed-Sized PA64m
( 1)n
tiledControl
D: one step delayT: truncate
tiledControl
32 1,1V1,2V1,V m
,Vn n
FIFO Buffer
V
32
32 1,11,21,m
,n n
FIFO Buffer
Ω
3232
32
localControl
From
Embedded Processor
0bRBRb̂
32RBR 1,1b̂RBR 1,2b̂
RBR 1,b̂ m
RBR ,b̂ n n
GlobalControl
full
FIFO Buffer
32
RBRb̂
ReconstructedRS Image
RBR (1)b̂RBR (2)b̂RBR(3)b̂
RBR ( )ˆ
nb
1 1k k
RBRb̂
To Embedded Processor
GlobalControl
full en
0 1,1b0 1,2b
0 1,b m
0 ,b n n
FIFO Buffer
0b
32
full en
GlobalControl
full en
GlobalControl
RBR 0ˆ b b ΩV
21
New Perspective:VLSI-FPGA Platforms
School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
Novel VLSI-FPGA platform represents a new perspective for real time processing of newer RS applications.
22School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
VLSI-FPGA Platform
Control System
EmbeddedProcessor
Spatial-temporalreorder
FPGA
BufferMemory
FIFO
Bit-LevelMPPA
Architectureu
F
Image data
RobustifiedReconstruction
operator
VLSI Co-processor
DegradedLarge-scaleRS Image
1 1k k
ReconstructedRS Image
RFS( )ˆ
jb
1 1k k
ju
23
Performance Analysis: FPGA
School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
HW co-processors → Robust SS vector RBR
estimator
Synthesis
Metrics
Slices 8158 3289
*DSP’48 144 32
^LUTs 7539 2278
Flip-Flops 6304 2788
24
Performance Analysis: FPGA
School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
Implementation →Processing time (seconds)
RBR
Evaluated PC-Oriented Implementation 19.7
Proposed Efficient RBR architecture 1.26
25
Conclusions The implementation results of the proposed NoC-PA-
oriented architecture helps to drastically reduce the overall processing time of the RBR algorithm. In fact, the presented architecture is efficiently implemented in MPSoC mode in spite of employing systems based on traditional DSPs or PC-Clusters platforms .
The implementation of the RBR algorithm using the proposed architecture takes only 1.26 seconds for the large-scale RS image reconstruction in contrast to 19.7 seconds required with the C++ implementation. Thus, the achieved processing time is approximately 16 times less than the corresponding processing time with the conventional C++ PC-based implementation.
School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
26
Recent Selected Journal Papers
School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
A. Castillo Atoche, D. Torres, Yuriy V. Shkvarko, “Towards Real Time Implementation of Reconstructive Signal Processing Algorithms Using Systolic Arrays Coprocessors”, JOURNAL OF SYSTEMS ARCHITECTURE (JSA), Edit. ELSEVIER, Volume 56, Issue 8, August 2010, Pages 327-339, ISSN: 1383-7621, doi:10.1016/j.sysarc.2010.05.004. JCR.
A. Castillo Atoche, D. Torres, Yuriy V. Shkvarko, “Descriptive Regularization-Based Hardware/Software Co-Design for Real-Time Enhanced Imaging in Uncertain Remote Sensing Environment”, EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING (JASP), Edit. HINDAWI, Volume 2010, 31 pages, 2010. ISSN: 1687-6172, e-ISSN: 1687-6180, doi:10.1155/ASP. JCR.
Yuriy V. Shkvarko, A. Castillo Atoche, D. Torres, “Near Real Time Enhancement of Geospatial Imagery via Systolic Implementation of Neural Network-Adapted Convex Regularization Techniques”, JOURNAL OF PATTERN RECOGNITION LETTERS, Edit. ELSEVIER, 2011. JCR. In Press
27
Thanks for your attention.
School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.
Dr. Alejandro Castillo AtocheEmail: [email protected]