synthetic aperture radar on low power multi-core · pdf fileoutlines •basics of synthetic...
TRANSCRIPT
Synthetic Aperture Radar on Low Power Multi-Core DSP
Dan Wang, Murtaza Ali
Texas Instruments
Outlines
• Basics of Synthetic Aperture Radar (SAR)
• TMS320C6678 Architecture
• SAR System Implementation on DSP
−Modulization
−Data flow
• Implementation Profiling
−Module profiling
−Single core vs. multi-core
−Comparison with alternative platforms in literature: GPU, FPGA
• Conclusion
2
SAR Geometry
3
• Use one antenna in time-
multiplex
• Use Doppler shift to obtain
fine azimuth resolution
• Two dimensions
– Range (cross-track, fast time)
Line-of-sight distance from
radar to target
– Azimuth (along-track, slow time)
Parallel to radar motion track
http://www.radart
utorial.eu/20.airb
orne/ab07.en.htm
l
SAR Algorithm
• SAR types
– Airborne or spaceborne
– Strip SAR, spot SAR, etc.
– Platforms: CPU, GPU, FPGA, etc
• Diverse algorithms
– Range-azimuth algorithm
– Chirp-scaling algorithm
– Ω-k algorithm
• Range-azimuth algorithm
– Achieve block processing efficiency
– Separability of processing in two directions
– Limited for low squint case
4
Strip mapping Spot mapping
Multicore DSP (TMS320C6678): Functional Diagram
Multicore Navigator
Tera
Net
C66x
DSP
C66x
DSP
L1L1 L2L2
C66x
DSP
C66x
DSP
L1L1 L2L2
C66x
DSP
C66x
DSP
L1L1 L2L2
C66x
DSP
C66x
DSP
L1L1 L2L2
C66x
DSP
C66x
DSP
L1L1 L2L2
C66x
DSP
C66x
DSP
L1L1 L2L2
C66x
DSP
C66x
DSP
L1L1 L2L2
C66x
DSP
C66x
DSP
L1L1 L2L2
8 x CorePac
SRIO
x4
SRIO
x4PCIe
x2
PCIe
x2EMIF
16
EMIF
16
TSIP
2x
TSIP
2xI2C
SPI
I2C
SPI UARTUART
Peripherals & IO
GbE
Switch
GbE
Switch
SGMIISGMIISGMIISGMII
IP Interfaces
CryptoCrypto
Packet
Accelerator
Packet
Accelerator
Network
CoProcessors
Power ManagementPower Management
DebugDebug
Multicore Shared Memory Controller
(MSMC)
Multicore Shared Memory Controller
(MSMC)
Shared Memory 4MBShared Memory 4MB
DDR3-
64b
DDR3-
64b
EDMAEDMA
SysMonSysMon
System Elements
Memory Subsystem
• Multicore KeyStone architecture SoC
• Fixed/Floating corePac
– 8 CorePac @ 1.25 GHz
– 4.0 MB Shared L2
– Performance: 320GMAC, 160GFLOPs,
60GDFLOPs
– Power ~10W@1GHz
• Navigator
– Queue Manager, Packet DMA
• Multicore shared memory controller
– Low latency, high bandwidth memory
access
• 3-port GigE switch (Layer 2)
• PCIe gen-2, 2-lanes
• SRIO gen-2, 4-lanes
• HyperLink
– Support connection to other keystone
devices providing resource scalability
– Provide a 50Gbps chip-level
interconnect
Hyper
Link
50
C66x – Core Architecture
• VLIW architecture
– Can issue 8 instructions per
cycle
• 2 data paths
– 4 units per data path
– L, S, D, M
– Access cross data path
• 64 registers (32 bit)
– 32 per data path
– Can be arranged in dual (64 bit)
or quad (128 bit) registers
• Single Instruction Multiple
Data (SIMD)
– Dual or quad multiplies (64 or
128 bits)
Data Path A Data Path B
Code
Composer
Studio
Multicore Navigator
Tera
Net
C66X
DSP
C66X
DSP
L1L1 L2L2
C66X
DSP
C66X
DSP
L1L1 L2L2
C66X
DSP
C66X
DSP
L1L1 L2L2
C66X
DSP
C66X
DSP
L1L1 L2L2
C66X
DSP
C66X
DSP
L1L1 L2L2
C66X
DSP
C66X
DSP
L1L1 L2L2
C66X
DSP
C66X
DSP
L1L1 L2L2
C66X
DSP
C66X
DSP
L1L1 L2L2
8 x CorePac
SRIO
x4
SRIO
x4PCIe
x2
PCIe
x2EMIF
16
EMIF
16
TSIP
x2
TSIP
x2I2C
SPI
I2C
SPI UARTUART
Peripherals & IO
GbE
Switch
GbE
Switch
SGMIISGMIISGMIISGMII
IP Interfaces
CryptoCrypto
Packet
Accelerator
Packet
Accelerator
Network
CoProcessors
Power ManagementPower Management
DebugDebug
Multicore Shared Memory Controller
(MSMC)
Multicore Shared Memory Controller
(MSMC)
Shared Memory 4MBShared Memory 4MB
DDR3-
64b
DDR3-
64b
EDMAEDMA
SysMonSysMon
System Elements
Memory Subsystem
HyperL
ink
Multicore Software Development KitMulticore Software Development Kit
Operating System
SYS/BIOS
Application Binary InterfaceOpenMP ABI
OpenMP Programming Layer
Customer Applications
Compiler
Directives
OpenMP
Library
Environment
Variables
Parallel Thread Interface
Master Thread Master Thread
Slave Thread
Slave Thread
Slave Thread
Slave Thread
Slave Thread
Slave Thread
Multicore Performance Single Core Simplicity
8
•4 c6678
•512 Gflops
•4GByte DDR3
•54W
•8 c6678
•>1 Teraflop
•16 GByte DDR3
•110 W
•1 c6678
•160 Gflops
•1GByte DDR3
•10W
Identified targets
range
azimuth
range(fast time)
azimuth(slow time)
Range-azimuth Algorithm
• Main steps
9
Range
Compression
RCM
Correction
Azimuth
CompressionTranspose
azimuth range
RCM Correction
• Cause
− Instantaneous range change
leads to variation in the range
delay that could be larger
than a range sample space
• Range-Doppler correction
− Interpolation in range-Doppler
domain
10
Range time
Azim
uth
tim
e
Point target signal in memory
Hyperbolic
curve
Range time
Azim
uth
tim
e
Range time
Azim
uth
fre
quen
cy
Range/azimuth
domainRange/Doppler
domain
Range time
Azim
uth
fre
quen
cy
Range/Doppler domain after
correction
Range-azimuth Algorithm Implementation
11
DAR ……
…
Geometry
model……
…
)(RKa
)(Rfcη
autofocus
Data Flow
12
… …4k
in_B
out_B
L2
4kreftwiddle
MSMC
FFT4k
in_BL2
MUL
IFFT
DDR3
Range
Compression
in_B
L2
out_B
DDR3
DDR3
Range
Azim
uth
Ra
ng
e
AzimuthTranspose
Data Flow
13
RCMC and Azimuth
Compression
… …
in_B
L2
reftwiddle
MSMC
FFT
MULIFFT
DDR3 … …
out_B
L2
Sinc coeff.
table
in_B
L2
out_B
RCMC
… …
DDR3
DDR3
in_B
L2
out_B
L2
… …DDR3
Muiticore Mapping
• Allocate memory for each
core
• OpenMP for multiple
threads running
simultaneously
• DMA read from DDR3
• Local processing
• DMA write to DDR3
14
Implementation Profiling
• Setups
– TMDXEVM6678L
– Compiler: v7.4.0.A12012
– OpenMP: v1.0.0.34_eng
– DSPLIB:v 3.1.0.0, (FFT, complex transpose)
– L1D cache 32K, L1P cache 32K
– L2 cache: 128K
– Data size: 4096*4096, complex single precision
15
Module Profiling
• Five modules
16
Range
Compression
RCM
Correction
Azimuth
CompressionTranspose Azimuth FFT
Module Profiling
17
Range comp.
Azimuth FFT
Azimuth comp.RCMC
Transpose
Tim
e:
ms
DDR3 time
• DDR3 transfer time:19 ms
• Transpose saturate around : 34 ms
56 %
7.9×
7.9×4.0×
4.5×
2.9×
Total Time
18
1 core
8 cores= 5.6
0.25 s
1.4 s
Tim
e:
ms
DDR3 time =94 ms
94 ms
250 ms
=37.6%
Time Breakdown
19
29%
14%
18%
27%
13%
Tim
e:
ms
Comparison
• GPU benchmark: Nvidia Tesla C1060 (Bisceglie’10)
– Core clock= 1.296GHz; Processor core #=240; memory= 4GB @
800MHz
– Testing algorithm: Range-azimuth algorithm, FFT size 4096
• FPGA: Xilinx VIRTEX-5 (Pfitzner’11)
• Comparison
20
ns/pixel
DSP GPU FPGA
23.0
14.9
53.3
Range-azimuth
only
Range, samples
Azim
uth
, sam
ple
s
Azimuth compression w ith RCMC
900 1000 1100 1200 1300 1400 1500 1600 1700 1800
7100
7200
7300
7400
7500
7600
7700
7800
7900
8000
8100
Range, samples
Azim
uth
, sam
ple
s
Raw SAR image
500 1000 1500 2000 2500 3000 3500 4000
500
1000
1500
2000
2500
3000
3500
4000
Image Example
21
Range, samples
Azim
uth
, sam
ple
s
Azimuth compression w ith RCMC
500 1000 1500 2000 2500 3000 3500 4000
3500
4000
4500
5000
5500
6000
6500
7000
7500
8000
RADARSAT: 38km*23km
Conclusions
• Contributions
– Implementation of range-azimuth algorithm for SAR on
TMS320C6678
– Parallel processing on multiple DSP cores
– Provide benchmark results for a typical 4k by 4k SAR image
– Achieve real-time performance with 4 frames/sec
• Next steps
– Extra modules
• Doppler parameter estimation, geometry information
• Auto focusing, computational intensive
– Larger size FFT (>4K)
– Evaluation with 4 DSPs connected to server with PCIe
22
23
Thanks &
Questions