a multi-processor system on chip architecture for real time remote sensing data processing

School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico. 1

2011/07/25

Presenter: Dr. Alejandro Castillo Atoche

A Multi-Processor System on Chip Architecture for Real Time Remote Sensing Data Processing

IGARSS’11

2

Outline Introduction Previous Work MPSoC via the HW/SW Co-design

Case Study: RBR Algorithms Algorithm Analysis

Network on Chip (NoC)-based Accelerator Integration in a Co-design scheme

New Perspective: Network of FPGA-VLSI architectures

Hardware Implementation Results Performance Analysis

Conclusions

School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.

3

Introduction: Radar Imagery, Facts

The initial problem of this proposition for the Geospatial RS imagery consist in to solve the ill-conditioned inverse spatial spectrum pattern (SSP) estimation problem with model uncertainties via the Bayesian minimum risk (BMR) estimation strategy.

In previous works, alternatives of MPSoC propositions have been developed but without systolic arrays techniques or Network on a Chip structures.


4

Introduction: HW implementation, Facts

Why Multiprocessor System on a Chip? Because MPSoCs are single-chip multiprocessor

designed for real time signal processing applications.

Why Network on a Chip Accelerators?Networks-on-chips (NoCs) are multiprocessor

interconnection networks designed to achieved real time SP. Avoids Bottlenecks in HW/SW co-designs.


5

MOTIVATION

To efficiently conceptualize and implement an architecture with the aggregation of parallel computing and systolic array mapping techniques in a novel network on a chip (NoC) accelerator scheme via the HW/SW co-design paradigm.


6

CONTRIBUTIONS:

First, a high-speed robust Bayesian regularization hardware accelerator for the real-time enhancement of the large scale Geospatial imagery is designed.

Second, the use of High Performance Computing techniques in an efficient architecture based on Network on a Chip (NoC) is also developed.


8

Algorithmic ref. Implementation


9



10School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.


Method → RSF RBR

SNR [dB] → 15 20 25 15 20 25

Metrics

IOSNR

[dB]10.15 15.32 20.25 6.15 10.62 13.04

PIOSNR

(%)81.37 86.62 85.24 95.18 90.29 98.24

MSE 0.16 0.46 0.57 0.03 0.29 0.34

11

Partitioning Stage


FPGA

EmbeddedProcessor

Robust SS vector Coprocessor 1

Data acquisition:

Parameters: 1, , , ,a r

n nR R S S

Pre-computed Sw-stage:

,F Ω

+diag { }V Fuu F

RBR estimator Coprocessor 2

0ˆRBR b b ΩV

( )ju

12

NoC oriented structure of the proposed coprocessors


(a) Robust SS vector

From Embedded Processor

InputFIFO

:,

data inputu F

tiledcontrol

Fu

tiledcontrol

FIFO

MemoryBuffer

tiledcontrol

F uu

tiledcontrol

FIFO

2Fixed-Sized PA2 4m

MemoryBuffer

diag

F uu F

tiledcontrol

tiledcontrol

FIFO

1Fixed-Sized PA1 32m


OutputFIFO

:data outputV

To Embedded Processor

13

NoC oriented structure of the proposed coprocessors


(b) RBR estimator

RBRb̂0b

InputFIFO

:,

data inputΩ V

tiledcontrol

Ω V

tiledcontrol

FIFO

MemoryBuffer


From Embedded Processor 0Ω V b

0b OutputFIFO

:data output


14

Aggregation of parallel computing techniques


Application

for (tile=0, tile< L, tile++){

} } }

for (i=0, i< m, i++){for (j=0, j< n, j++){

for (k=0, k< r, k++){a(i,j,k)=a(i,j-1,k);b(i,j,k)=b(i-1,j,k);c(i,j,k)=c(i,j,k-1) +

a(i,j,k)*b(i,j,k);

}

B[2,2]

A[2,2]

--

---

-

-

-- ---

-

-

--

B[0,0]

A[0,0]

- -

-

-

B[0,1]B[1,0]

A[1,0]A[0,1]

-

-

B[0,2]B[1,1]B[2,0]

A[2,0]

A[1,1]A[0,2]

B[1,2]B[2,1]

A[1,2]

A[2,1]

-

-

Linear Schedule: set of parallel and uniformely spaced hyperplanes.

SFG Projection

3-D Dependance Graph (DG)

add

mul

a[i,j-1]

a[i,j]

b[i-1, j] b[i,j]

c[i,j]

15

Tiling technique


E/S

E/S

E/S E/S

E/S

E/S E/S E/S

E/S

E/S

Large-Scale Real-World

Image

PEPE

PEPE

FIFO

FIFO

FIFO

FIFO

E/S

E/S

E/S

E/S

Fixed-Size Systolic Array

16

Tiling technique


PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PEPEPE

PEPEPE

PEPE

PE PE

PE

PE

PE

PE

E/S

E/S

E/S

E/S

Fifo

Fifo

Fifo

Fifo

PE PE

PE PE

E/S

E/S

PE PE

PE

PE

PEPE

PEPEPE

PE

PE PE

PE

PE

PE

PE PE

PE

PE

PE

E/S E/S E/S

E/SE/S

E/SE/SE/S

E/SE/S

PE

PE

PE

PE

E/S

PE

PEPE

PE

E/S

PE

PE

PE

PE

PE PE

PE

PE

PEPE

PEPEPE

PE

PE PE

PE

PE

PE

PE PE

PE

PE

PE

E/S E/S E/S

E/SE/S

E/SE/SE/S

E/SE/S

Fifo

Fifo

Fifo

PE PE

PE

Fifo

PE

PE

PE PE

PE

E/S

E/S

E/S

E/S

PE

PE

PE

PE

PE PE

PE

PE

PEPE

PEPEPE

PE

PE PE

PE

PE

PE

PE PE

PE

PE

PE

E/S E/S E/S

E/SE/S

E/SE/SE/S

E/SE/S

Fifo

FifoPEPE

PEPE

Fifo

Fifo

PE

PE PE

PE

E/SE/S

PE

PE

PE

PE

PE PE

PE

PE

PEPE

PEPEPE

PE

PE PE

PE

PE

PE

PE PE

PE

PE

PE

E/S E/S E/S

E/SE/S

E/SE/SE/S

E/SE/S PEPE

PEPE

Fifo

Fifo

Fifo

Fifo

PE

PE PE

PE

E/S E/S

PE

PE

PE

PE

PE PE

PE

PE

PEPE

PEPEPE

PE

PE PE

PE

PE

PE

PE PE

PE

PE

PE

E/S E/S E/S

E/SE/S

E/SE/SE/S

E/SE/S

Fifo

FifoPEPE

PEPE

Fifo

Fifo

E/S

PE

PE PE

PE

E/S

I/O

I/O

PE

PE

PE

PE

PE PE

PE

PE

PEPE

PEPEPE

PE

PE PE

PE

PE

PE

PE PE

PE

PE

PE

E/S E/S E/S

E/SE/S

E/SE/SE/S

E/SE/S

Fifo

FifoPEPE

PEPE

Fifo

Fifo

PE

PE PE

PE

E/S

E/SE/S

E/S

(1,2)

(2,1)

Large-Scale Real-World

Image

Fixed-Size Systolic Array

17

Fixed-Sized NoC-PAs-based Robust SS vector co-processor


Stage1: 1 PA Fu

DegradedLarge-scaleRS Image

(1)u(2)u(3)u

( )nu

1 1k k

32

u1,1u1,2u1,u n

,u n n

GlobalControl

full en 1,1F 1,2F

1,F n

,Fn n

F

en

full

FIFO Buffer

32

4 , 1 3 , 2 2 , 3 ,

3 ,1 2 , 2 1 , 3

2 , 1 1 , 2

1 , 1

F F F FF F F 0F F 0 0F 0 0 0

m n

u

F

1,1u

D

1,2u

D

1,u m

D

FIFO

1,3u

D

0

1,3u[32,23]

1,3F[32,23]

T[64,46]

[32,23]

[32,23]TEMP_1 ( 1)V m

TEMP_1 ( )V m

[32,23]

TEMP_1VlocalControl


( 1)n

tiledControl

D: one step delayT: truncate

FIFO Buffer

tiledControl

From Embedded Processor

Stage 1: TEMP_1 V F u( )n n ( 1)n( 1)n

Data Skewed

18



Stage2:

Stage 2: TEMP_2 TEMP_1V V u

( )n n ( 1)n (1 )n

[32,23]TEMP_1

1,1

V

[32,23]

1,u m

T

[64,46]

[32,23]

TEMP_21,3

V

PE PE PE PE

PE PE PE PE

D

D

D

D

PE PE PE PE

D

D

DD

DDD

D

DD D

TEMP_11,1

V

TEMP_12,1

V

TEMP_1,1

Vm

TEMP_11,1

Vm

TEMP_12 ,1

Vm

TEMP_12,1

Vm

1,1u1,2u

1,3u1,u m

1, 1u m

1, 2u m

1, 3u m

1,2u m

2Fixed-Sized PA4m

TEMP_1255 1,1

Vm

TEMP_1255 2,1

Vm

TEMP_1,1

Vn

1,255 1u m

1,255 2u m

1,255 3u m

1,u n

TEMP_2V( )n n

2PA Fuu

19



Stage3: 3 diag{ } . PA Fuu F

Stage 3: TEMP_2 diag{ }V V F( )n n ( 1)n( 1)n

PE

0

PE

PE

PE

tiledControl

3Fixed-Sized PA3 32m 1,1F

0

0

0

1,2F

2,1F

0

0

1,3F

2,2F

3,1F

0

1,2F m

2,2 1F m

3,2 2F m

,Fm m

V

tiledControl

TEMP_21,1

V

0

0

0

TEMP_22,1

V

TEMP_21,2

V

0

0

TEMP_22 ,1

Vm

TEMP_22 1,2

Vm

TEMP_22 2,3

Vm

TEMP_2,

Vm m

32 1,1v1,2v1,v m

,v n n

GlobalControl

full

FIFO Buffer


V

( 1)n

RobustSS Vector

,Fm m

TEMP_2,

Vm m

( 1)V m( )V m

20

Fixed-Sized NoC-PAs-based RBR estimator co-processor


4 ,1 3 , 2 2 , 3 ,

3 ,1 2 , 2 1 , 3

2 ,1 1 , 2

1 ,1

00 0

0 0 0

m n

1,1V

D

1,2V

D

1,V m

D

FIFO

1,3V

D

0

1,3V[32,23]

1,3[32,23]

T[64,46]

[32,23]

[32,23]TEMP_1 ( 1)b̂ m

TEMP_1 ( )b̂ m

[32,23]

Ω V

Fixed-Sized PA64m

( 1)n

tiledControl

D: one step delayT: truncate

tiledControl

32 1,1V1,2V1,V m

,Vn n

FIFO Buffer

V

32

32 1,11,21,m

,n n

FIFO Buffer

Ω

3232

32

localControl

From

Embedded Processor

0bRBRb̂

32RBR 1,1b̂RBR 1,2b̂

RBR 1,b̂ m

RBR ,b̂ n n

GlobalControl

full

FIFO Buffer

32

RBRb̂

ReconstructedRS Image

RBR (1)b̂RBR (2)b̂RBR(3)b̂

RBR ( )ˆ

nb

1 1k k

RBRb̂


GlobalControl

full en

0 1,1b0 1,2b

0 1,b m

0 ,b n n

FIFO Buffer

0b

32

full en

GlobalControl

full en

GlobalControl

RBR 0ˆ b b ΩV

21

New Perspective:VLSI-FPGA Platforms


Novel VLSI-FPGA platform represents a new perspective for real time processing of newer RS applications.

22School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.

VLSI-FPGA Platform

Control System

EmbeddedProcessor

Spatial-temporalreorder

FPGA

BufferMemory

FIFO

Bit-LevelMPPA

Architectureu

F

Image data

RobustifiedReconstruction

operator

VLSI Co-processor

DegradedLarge-scaleRS Image

1 1k k

ReconstructedRS Image

RFS( )ˆ

jb

1 1k k

ju

23

Performance Analysis: FPGA


HW co-processors → Robust SS vector RBR

estimator

Synthesis

Metrics

Slices 8158 3289

*DSP’48 144 32

^LUTs 7539 2278

Flip-Flops 6304 2788

24

Performance Analysis: FPGA


Implementation →Processing time (seconds)

RBR

Evaluated PC-Oriented Implementation 19.7

Proposed Efficient RBR architecture 1.26

25

Conclusions The implementation results of the proposed NoC-PA-

oriented architecture helps to drastically reduce the overall processing time of the RBR algorithm. In fact, the presented architecture is efficiently implemented in MPSoC mode in spite of employing systems based on traditional DSPs or PC-Clusters platforms .

The implementation of the RBR algorithm using the proposed architecture takes only 1.26 seconds for the large-scale RS image reconstruction in contrast to 19.7 seconds required with the C++ implementation. Thus, the achieved processing time is approximately 16 times less than the corresponding processing time with the conventional C++ PC-based implementation.


26

Recent Selected Journal Papers


A. Castillo Atoche, D. Torres, Yuriy V. Shkvarko, “Towards Real Time Implementation of Reconstructive Signal Processing Algorithms Using Systolic Arrays Coprocessors”, JOURNAL OF SYSTEMS ARCHITECTURE (JSA), Edit. ELSEVIER, Volume 56, Issue 8, August 2010, Pages 327-339, ISSN: 1383-7621, doi:10.1016/j.sysarc.2010.05.004. JCR.

A. Castillo Atoche, D. Torres, Yuriy V. Shkvarko, “Descriptive Regularization-Based Hardware/Software Co-Design for Real-Time Enhanced Imaging in Uncertain Remote Sensing Environment”, EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING (JASP), Edit. HINDAWI, Volume 2010, 31 pages, 2010. ISSN: 1687-6172, e-ISSN: 1687-6180, doi:10.1155/ASP. JCR.

Yuriy V. Shkvarko, A. Castillo Atoche, D. Torres, “Near Real Time Enhancement of Geospatial Imagery via Systolic Implementation of Neural Network-Adapted Convex Regularization Techniques”, JOURNAL OF PATTERN RECOGNITION LETTERS, Edit. ELSEVIER, 2011. JCR. In Press

27

Thanks for your attention.


Dr. Alejandro Castillo AtocheEmail: [email protected]

a multi-processor system on chip architecture for real time remote sensing data processing

Documents

chip architecture

chip accelerators

chip structures

chip noc accelerator

hwsw codesigns

singlechip multiprocessor

proposed codesign methodology

design paradigm