synthetic aperture radar on low power multi-core · pdf fileoutlines •basics of synthetic...

Synthetic Aperture Radar on Low Power Multi-Core DSP

Dan Wang, Murtaza Ali

Texas Instruments

Outlines

• Basics of Synthetic Aperture Radar (SAR)

• TMS320C6678 Architecture

• SAR System Implementation on DSP

−Modulization

−Data flow

• Implementation Profiling

−Module profiling

−Single core vs. multi-core

−Comparison with alternative platforms in literature: GPU, FPGA

• Conclusion

2

SAR Geometry

3

• Use one antenna in time-

multiplex

• Use Doppler shift to obtain

fine azimuth resolution

• Two dimensions

– Range (cross-track, fast time)

Line-of-sight distance from

radar to target

– Azimuth (along-track, slow time)

Parallel to radar motion track

http://www.radart

utorial.eu/20.airb

orne/ab07.en.htm

l

SAR Algorithm

• SAR types

– Airborne or spaceborne

– Strip SAR, spot SAR, etc.

– Platforms: CPU, GPU, FPGA, etc

• Diverse algorithms

– Range-azimuth algorithm

– Chirp-scaling algorithm

– Ω-k algorithm

• Range-azimuth algorithm

– Achieve block processing efficiency

– Separability of processing in two directions

– Limited for low squint case

4

Strip mapping Spot mapping

Multicore DSP (TMS320C6678): Functional Diagram

Multicore Navigator

Tera

Net

C66x

DSP

C66x

DSP

L1L1 L2L2

C66x

DSP

C66x

DSP

L1L1 L2L2

C66x

DSP

C66x

DSP

L1L1 L2L2

C66x

DSP

C66x

DSP

L1L1 L2L2

C66x

DSP

C66x

DSP

L1L1 L2L2

C66x

DSP

C66x

DSP

L1L1 L2L2

C66x

DSP

C66x

DSP

L1L1 L2L2

C66x

DSP

C66x

DSP

L1L1 L2L2

8 x CorePac

SRIO

x4

SRIO

x4PCIe

x2

PCIe

x2EMIF

16

EMIF

16

TSIP

2x

TSIP

2xI2C

SPI

I2C

SPI UARTUART

Peripherals & IO

GbE

Switch

GbE

Switch

SGMIISGMIISGMIISGMII

IP Interfaces

CryptoCrypto

Packet

Accelerator

Packet

Accelerator

Network

CoProcessors

Power ManagementPower Management

DebugDebug

Multicore Shared Memory Controller

(MSMC)


(MSMC)

Shared Memory 4MBShared Memory 4MB

DDR3-

64b

DDR3-

64b

EDMAEDMA

SysMonSysMon

System Elements

Memory Subsystem

• Multicore KeyStone architecture SoC

• Fixed/Floating corePac

– 8 CorePac @ 1.25 GHz

– 4.0 MB Shared L2

– Performance: 320GMAC, 160GFLOPs,

60GDFLOPs

– Power ~10W@1GHz

• Navigator

– Queue Manager, Packet DMA

• Multicore shared memory controller

– Low latency, high bandwidth memory

access

• 3-port GigE switch (Layer 2)

• PCIe gen-2, 2-lanes

• SRIO gen-2, 4-lanes

• HyperLink

– Support connection to other keystone

devices providing resource scalability

– Provide a 50Gbps chip-level

interconnect

Hyper

Link

50

C66x – Core Architecture

• VLIW architecture

– Can issue 8 instructions per

cycle

• 2 data paths

– 4 units per data path

– L, S, D, M

– Access cross data path

• 64 registers (32 bit)

– 32 per data path

– Can be arranged in dual (64 bit)

or quad (128 bit) registers

• Single Instruction Multiple

Data (SIMD)

– Dual or quad multiplies (64 or

128 bits)

Data Path A Data Path B

Code

Composer

Studio

Multicore Navigator

Tera

Net

C66X

DSP

C66X

DSP

L1L1 L2L2

C66X

DSP

C66X

DSP

L1L1 L2L2

C66X

DSP

C66X

DSP

L1L1 L2L2

C66X

DSP

C66X

DSP

L1L1 L2L2

C66X

DSP

C66X

DSP

L1L1 L2L2

C66X

DSP

C66X

DSP

L1L1 L2L2

C66X

DSP

C66X

DSP

L1L1 L2L2

C66X

DSP

C66X

DSP

L1L1 L2L2

8 x CorePac

SRIO

x4

SRIO

x4PCIe

x2

PCIe

x2EMIF

16

EMIF

16

TSIP

x2

TSIP

x2I2C

SPI

I2C

SPI UARTUART

Peripherals & IO

GbE

Switch

GbE

Switch

SGMIISGMIISGMIISGMII

IP Interfaces

CryptoCrypto

Packet

Accelerator

Packet

Accelerator

Network

CoProcessors

Power ManagementPower Management

DebugDebug


(MSMC)


(MSMC)

Shared Memory 4MBShared Memory 4MB

DDR3-

64b

DDR3-

64b

EDMAEDMA

SysMonSysMon

System Elements

Memory Subsystem

HyperL

ink

Multicore Software Development KitMulticore Software Development Kit

Operating System

SYS/BIOS

Application Binary InterfaceOpenMP ABI

OpenMP Programming Layer

Customer Applications

Compiler

Directives

OpenMP

Library

Environment

Variables

Parallel Thread Interface

Master Thread Master Thread

Slave Thread

Slave Thread

Slave Thread

Slave Thread

Slave Thread

Slave Thread

Multicore Performance Single Core Simplicity

8

•4 c6678

•512 Gflops

•4GByte DDR3

•54W

•8 c6678

•>1 Teraflop

•16 GByte DDR3

•110 W

•1 c6678

•160 Gflops

•1GByte DDR3

•10W

Identified targets

range

azimuth

range(fast time)

azimuth(slow time)

Range-azimuth Algorithm

• Main steps

9

Range

Compression

RCM

Correction

Azimuth

CompressionTranspose

azimuth range

RCM Correction

• Cause

− Instantaneous range change

leads to variation in the range

delay that could be larger

than a range sample space

• Range-Doppler correction

− Interpolation in range-Doppler

domain

10

Range time

Azim

uth

tim

e

Point target signal in memory

Hyperbolic

curve

Range time

Azim

uth

tim

e

Range time

Azim

uth

fre

quen

cy

Range/azimuth

domainRange/Doppler

domain

Range time

Azim

uth

fre

quen

cy

Range/Doppler domain after

correction

Range-azimuth Algorithm Implementation

11

DAR ……

…

Geometry

model……

…

)(RKa

)(Rfcη

autofocus

Data Flow

12

… …4k

in_B

out_B

L2

4kreftwiddle

MSMC

FFT4k

in_BL2

MUL

IFFT

DDR3

Range

Compression

in_B

L2

out_B

DDR3

DDR3

Range

Azim

uth

Ra

ng

e

AzimuthTranspose

Data Flow

13

RCMC and Azimuth

Compression

… …

in_B

L2

reftwiddle

MSMC

FFT

MULIFFT

DDR3 … …

out_B

L2

Sinc coeff.

table

in_B

L2

out_B

RCMC

… …

DDR3

DDR3

in_B

L2

out_B

L2

… …DDR3

Muiticore Mapping

• Allocate memory for each

core

• OpenMP for multiple

threads running

simultaneously

• DMA read from DDR3

• Local processing

• DMA write to DDR3

14

Implementation Profiling

• Setups

– TMDXEVM6678L

– Compiler: v7.4.0.A12012

– OpenMP: v1.0.0.34_eng

– DSPLIB:v 3.1.0.0, (FFT, complex transpose)

– L1D cache 32K, L1P cache 32K

– L2 cache: 128K

– Data size: 4096*4096, complex single precision

15

Module Profiling

• Five modules

16

Range

Compression

RCM

Correction

Azimuth

CompressionTranspose Azimuth FFT

Module Profiling

17

Range comp.

Azimuth FFT

Azimuth comp.RCMC

Transpose

Tim

e:

ms

DDR3 time

• DDR3 transfer time:19 ms

• Transpose saturate around : 34 ms

56 %

7.9×

7.9×4.0×

4.5×

2.9×

Total Time

18

1 core

8 cores= 5.6

0.25 s

1.4 s

Tim

e:

ms

DDR3 time =94 ms

94 ms

250 ms

=37.6%

Time Breakdown

19

29%

14%

18%

27%

13%

Tim

e:

ms

Comparison

• GPU benchmark: Nvidia Tesla C1060 (Bisceglie’10)

– Core clock= 1.296GHz; Processor core #=240; memory= 4GB @

800MHz

– Testing algorithm: Range-azimuth algorithm, FFT size 4096

• FPGA: Xilinx VIRTEX-5 (Pfitzner’11)

• Comparison

20

ns/pixel

DSP GPU FPGA

23.0

14.9

53.3

Range-azimuth

only

Range, samples

Azim

uth

, sam

ple

s

Azimuth compression w ith RCMC

900 1000 1100 1200 1300 1400 1500 1600 1700 1800

7100

7200

7300

7400

7500

7600

7700

7800

7900

8000

8100

Range, samples

Azim

uth

, sam

ple

s

Raw SAR image

500 1000 1500 2000 2500 3000 3500 4000

500

1000

1500

2000

2500

3000

3500

4000

Image Example

21

Range, samples

Azim

uth

, sam

ple

s

Azimuth compression w ith RCMC

500 1000 1500 2000 2500 3000 3500 4000

3500

4000

4500

5000

5500

6000

6500

7000

7500

8000

RADARSAT: 38km*23km

Conclusions

• Contributions

– Implementation of range-azimuth algorithm for SAR on

TMS320C6678

– Parallel processing on multiple DSP cores

– Provide benchmark results for a typical 4k by 4k SAR image

– Achieve real-time performance with 4 frames/sec

• Next steps

– Extra modules

• Doppler parameter estimation, geometry information

• Auto focusing, computational intensive

– Larger size FFT (>4K)

– Evaluation with 4 DSPs connected to server with PCIe

22

23

Thanks &

Questions

synthetic aperture radar on low power multi-core · pdf fileoutlines •basics of synthetic...

Documents