a multi-processor system on chip architecture for real time remote sensing data processing

School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico. 1

2011/07/25

Presenter: Dr. Alejandro Castillo Atoche

A Multi-Processor System on Chip Architecture for Real Time Remote Sensing Data Processing

IGARSS’11

Outline Introduction Previous Work MPSoC via the HW/SW Co-design

Case Study: RBR Algorithms Algorithm Analysis

Network on Chip (NoC)-based Accelerator Integration in a Co-design scheme

New Perspective: Network of FPGA-VLSI architectures

Hardware Implementation Results Performance Analysis

Conclusions

School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.

Introduction: Radar Imagery, Facts

The initial problem of this proposition for the Geospatial RS imagery consist in to solve the ill-conditioned inverse spatial spectrum pattern (SSP) estimation problem with model uncertainties via the Bayesian minimum risk (BMR) estimation strategy.

In previous works, alternatives of MPSoC propositions have been developed but without systolic arrays techniques or Network on a Chip structures.

Introduction: HW implementation, Facts

Why Multiprocessor System on a Chip? Because MPSoCs are single-chip multiprocessor

designed for real time signal processing applications.

Why Network on a Chip Accelerators?Networks-on-chips (NoCs) are multiprocessor

interconnection networks designed to achieved real time SP. Avoids Bottlenecks in HW/SW co-designs.

MOTIVATION

To efficiently conceptualize and implement an architecture with the aggregation of parallel computing and systolic array mapping techniques in a novel network on a chip (NoC) accelerator scheme via the HW/SW co-design paradigm.

CONTRIBUTIONS:

First, a high-speed robust Bayesian regularization hardware accelerator for the real-time enhancement of the large scale Geospatial imagery is designed.

Second, the use of High Performance Computing techniques in an efficient architecture based on Network on a Chip (NoC) is also developed.

Algorithmic ref. Implementation

10School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.

Method → RSF RBR

SNR [dB] → 15 20 25 15 20 25

Metrics

[dB]10.15 15.32 20.25 6.15 10.62 13.04

PIOSNR

(%)81.37 86.62 85.24 95.18 90.29 98.24

MSE 0.16 0.46 0.57 0.03 0.29 0.34

Partitioning Stage

EmbeddedProcessor

Robust SS vector Coprocessor 1

Data acquisition:

Parameters: 1, , , ,a r

n nR R S S

Pre-computed Sw-stage:

+diag { }V Fuu F

RBR estimator Coprocessor 2

0ˆRBR b b ΩV

NoC oriented structure of the proposed coprocessors

(a) Robust SS vector

From Embedded Processor

InputFIFO

data inputu F

tiledcontrol

MemoryBuffer

tiledcontrol

2Fixed-Sized PA2 4m

MemoryBuffer

F uu F

tiledcontrol

1Fixed-Sized PA1 32m

OutputFIFO

:data outputV

To Embedded Processor

NoC oriented structure of the proposed coprocessors

(b) RBR estimator

RBRb̂0b

InputFIFO

data inputΩ V

tiledcontrol

MemoryBuffer

From Embedded Processor 0Ω V b

0b OutputFIFO

:data output

Aggregation of parallel computing techniques

Application

for (tile=0, tile< L, tile++){

for (i=0, i< m, i++){for (j=0, j< n, j++){

for (k=0, k< r, k++){a(i,j,k)=a(i,j-1,k);b(i,j,k)=b(i-1,j,k);c(i,j,k)=c(i,j,k-1) +

a(i,j,k)*b(i,j,k);

B[2,2]

A[2,2]

-- ---

B[0,0]

A[0,0]

B[0,1]B[1,0]

A[1,0]A[0,1]

B[0,2]B[1,1]B[2,0]

A[2,0]

A[1,1]A[0,2]

B[1,2]B[2,1]

A[1,2]

A[2,1]

Linear Schedule: set of parallel and uniformely spaced hyperplanes.

SFG Projection

3-D Dependance Graph (DG)

a[i,j-1]

a[i,j]

b[i-1, j] b[i,j]

c[i,j]

Tiling technique

E/S E/S

E/S E/S E/S

Large-Scale Real-World

Fixed-Size Systolic Array

Tiling technique

PEPEPE

E/S E/S E/S

E/SE/S

E/SE/SE/S

E/SE/S

PEPEPE

E/S E/S E/S

E/SE/S

E/SE/SE/S

E/SE/S

PEPEPE

E/S E/S E/S

E/SE/S

E/SE/SE/S

E/SE/S

FifoPEPE

E/SE/S

PEPEPE

E/S E/S E/S

E/SE/S

E/SE/SE/S

E/SE/S PEPE

E/S E/S

PEPEPE

E/S E/S E/S

E/SE/S

E/SE/SE/S

E/SE/S

FifoPEPE

PEPEPE

E/S E/S E/S

E/SE/S

E/SE/SE/S

E/SE/S

FifoPEPE

E/SE/S

Large-Scale Real-World

Fixed-Size Systolic Array

Fixed-Sized NoC-PAs-based Robust SS vector co-processor

Stage1: 1 PA Fu

DegradedLarge-scaleRS Image

(1)u(2)u(3)u

1 1k k

u1,1u1,2u1,u n

,u n n

GlobalControl

full en 1,1F 1,2F

FIFO Buffer

4 , 1 3 , 2 2 , 3 ,

3 ,1 2 , 2 1 , 3

2 , 1 1 , 2

F F F FF F F 0F F 0 0F 0 0 0

1,3u[32,23]

1,3F[32,23]

T[64,46]

[32,23]

[32,23]TEMP_1 ( 1)V m

TEMP_1 ( )V m

[32,23]

TEMP_1VlocalControl

tiledControl

D: one step delayT: truncate

FIFO Buffer

tiledControl

From Embedded Processor

Stage 1: TEMP_1 V F u( )n n ( 1)n( 1)n

Data Skewed

Stage2:

Stage 2: TEMP_2 TEMP_1V V u

( )n n ( 1)n (1 )n

[32,23]TEMP_1

[32,23]

[64,46]

[32,23]

TEMP_21,3

PE PE PE PE

TEMP_11,1

TEMP_12,1

TEMP_1,1

TEMP_11,1

TEMP_12 ,1

TEMP_12,1

1,1u1,2u

1,3u1,u m

1, 1u m

1, 2u m

1, 3u m

1,2u m

2Fixed-Sized PA4m

TEMP_1255 1,1

TEMP_1255 2,1

TEMP_1,1

1,255 1u m

1,255 2u m

1,255 3u m

TEMP_2V( )n n

2PA Fuu

Stage3: 3 diag{ } . PA Fuu F

Stage 3: TEMP_2 diag{ }V V F( )n n ( 1)n( 1)n

tiledControl

3Fixed-Sized PA3 32m 1,1F

1,2F m

2,2 1F m

3,2 2F m

tiledControl

TEMP_21,1

TEMP_22,1

TEMP_21,2

TEMP_22 ,1

TEMP_22 1,2

TEMP_22 2,3

TEMP_2,

32 1,1v1,2v1,v m

,v n n

GlobalControl

FIFO Buffer

RobustSS Vector

TEMP_2,

( 1)V m( )V m

Fixed-Sized NoC-PAs-based RBR estimator co-processor

4 ,1 3 , 2 2 , 3 ,

3 ,1 2 , 2 1 , 3

2 ,1 1 , 2

1,3V[32,23]

1,3[32,23]

T[64,46]

[32,23]

[32,23]TEMP_1 ( 1)b̂ m

TEMP_1 ( )b̂ m

[32,23]

Fixed-Sized PA64m

tiledControl

D: one step delayT: truncate

tiledControl

32 1,1V1,2V1,V m

FIFO Buffer

32 1,11,21,m

FIFO Buffer

localControl

Embedded Processor

0bRBRb̂

32RBR 1,1b̂RBR 1,2b̂

RBR 1,b̂ m

RBR ,b̂ n n

GlobalControl

FIFO Buffer

RBRb̂

ReconstructedRS Image

RBR (1)b̂RBR (2)b̂RBR(3)b̂

RBR ( )ˆ

1 1k k

RBRb̂

GlobalControl

full en

0 1,1b0 1,2b

0 1,b m

0 ,b n n

FIFO Buffer

full en

GlobalControl

full en

GlobalControl

RBR 0ˆ b b ΩV

New Perspective:VLSI-FPGA Platforms

Novel VLSI-FPGA platform represents a new perspective for real time processing of newer RS applications.

22School of Engineering, AutonomousUniversity of Yucatan, Merida, Mexico.

VLSI-FPGA Platform

Control System

EmbeddedProcessor

Spatial-temporalreorder

BufferMemory

Bit-LevelMPPA

Architectureu

Image data

RobustifiedReconstruction

operator

VLSI Co-processor

DegradedLarge-scaleRS Image

1 1k k

ReconstructedRS Image

RFS( )ˆ

1 1k k

Performance Analysis: FPGA

HW co-processors → Robust SS vector RBR

estimator

Synthesis

Metrics

Slices 8158 3289

*DSP’48 144 32

^LUTs 7539 2278

Flip-Flops 6304 2788

Performance Analysis: FPGA

Implementation →Processing time (seconds)

Evaluated PC-Oriented Implementation 19.7

Proposed Efficient RBR architecture 1.26

Conclusions The implementation results of the proposed NoC-PA-

oriented architecture helps to drastically reduce the overall processing time of the RBR algorithm. In fact, the presented architecture is efficiently implemented in MPSoC mode in spite of employing systems based on traditional DSPs or PC-Clusters platforms .

The implementation of the RBR algorithm using the proposed architecture takes only 1.26 seconds for the large-scale RS image reconstruction in contrast to 19.7 seconds required with the C++ implementation. Thus, the achieved processing time is approximately 16 times less than the corresponding processing time with the conventional C++ PC-based implementation.

Recent Selected Journal Papers

A. Castillo Atoche, D. Torres, Yuriy V. Shkvarko, “Towards Real Time Implementation of Reconstructive Signal Processing Algorithms Using Systolic Arrays Coprocessors”, JOURNAL OF SYSTEMS ARCHITECTURE (JSA), Edit. ELSEVIER, Volume 56, Issue 8, August 2010, Pages 327-339, ISSN: 1383-7621, doi:10.1016/j.sysarc.2010.05.004. JCR.

A. Castillo Atoche, D. Torres, Yuriy V. Shkvarko, “Descriptive Regularization-Based Hardware/Software Co-Design for Real-Time Enhanced Imaging in Uncertain Remote Sensing Environment”, EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING (JASP), Edit. HINDAWI, Volume 2010, 31 pages, 2010. ISSN: 1687-6172, e-ISSN: 1687-6180, doi:10.1155/ASP. JCR.

Yuriy V. Shkvarko, A. Castillo Atoche, D. Torres, “Near Real Time Enhancement of Geospatial Imagery via Systolic Implementation of Neural Network-Adapted Convex Regularization Techniques”, JOURNAL OF PATTERN RECOGNITION LETTERS, Edit. ELSEVIER, 2011. JCR. In Press

Thanks for your attention.

Dr. Alejandro Castillo AtocheEmail: acastill@uady.mx

a multi-processor system on chip architecture for real time remote sensing data processing

chip architecture

chip accelerators

chip structures

chip noc accelerator

hwsw codesigns

singlechip multiprocessor

proposed codesign methodology

design paradigm

Documents

on-chip interconnection architecture of the tile processor

neuron chip network processor - farnell element14neuron chip...

epiphany-v: a 1024 processor 64- bit risc system-on-chip

diasys: on-chip trace analysis for multi-processor system...

atac: a manycore processor with on-chip optical...

applications of on-chip trace on debugging embedded...

1 the ibm cell processor – architecture and on-chip...

modular integration and on-chip sensing approaches for

on-chip sensing and actuation methods for integrated self...

video compressive sensing with on-chip...

loihi: a neuromorphic manycore processor with on-chip...

video compressive sensing with on-chip programmable...

microwatt embedded processor platform for medical...

on-chip programmable pulse processor employing cascaded

a case for integrated processor-cache partitioning in chip...

predicting inter-thread cache contention on a chip ...

execu&on)migraon)in)a heterogeneous2isa)chip)mul&processor)

the motherboard : micro processor (main processor) support...

interconnect networks basics. generic parallel/distributed...

single chip multi processor