optimal configuration of combined gpp/dsp/fpga systems for ...antonio/pubs/p-ann_rev98acs.pdf ·...

Optimal Configuration of Combined GPP/DSP/FPGA Systems for

Minimal SWAPby

John K. AntonioDepartment of Computer Science

College of EngineeringTexas Tech University

antonio@ttu.edu

First Annual ReviewJune 23, 1998

OutlineOutline

• Program Overview and Introduction (Quad Chart)

• Program Management Status

• Recent Accomplishments

• Status of Deliverable Checklist

Configuring Combined GPP/DSP/FPGA Systems for Minimal SWAPApplications

• SAR• STAP

Requirements• Throughput• SWAP

•Combined Technology•Minimal SWAP Configuration•Mixed-Mode Operation•Demonstration

Texas Tech University: John K. Antonio

New Ideas• Systematic determination of minimal SWAP

configuration based on proven mathematical programming techniques

• Optimal configuration based on automatic“tuning” of system design parameters- number and types of cards used- data mapping and communication schemes- place and route schemes

• Novel computing techniques based oncharacteristics of GPP/DSP/FPGA system

Jun 97Start

Jun 98 Jun 99 Dec 99End

ScheduleDevelop optimalconfigurationtechniques

Construction and integration of GPP/DSP/FPGA system

Implement and test optimal configurations onGPP/DSP/FPGA system

Develop practicaldesign methodsbased on SAR andSTAP applications

Demonstrate advantagesof combiningtechnologies

Impact• Embedded Systems requirements for the

21st Century can be satisfied with thecombined use of GPP, DSP, and FPGA technologies

• Demonstrate use of FPGA boards as co-processors for embedded multiprocessorGPP and DSP systems

• Demonstrate systematic approaches tooptimally configure GPP/DSP/FPGA syst. forminimal SWAP for embedded applications

OutlineOutline

Personnel (Program Management Status)

• John K. Antonio, Principal Investigator

• Ph.D., EE, Texas A&M Univ. (1989)

• Currently Assoc. Prof. of CS, Texas Tech Univ.

• Over 65 publications in HPC and related areas

• PI or co-PI of 17 contracts/grants

totaling over $2.1M

• Jeff Muehring, Research Assistant, Ph.D. student

Optimal GPP/DSP/FPGA Configuration Techniques for SAR

Intern at IBM/Houston, 1/98 to 6/98

• Jack West, Research Assistant, Ph.D. student

Optimal Mapping, Scheduling, and Configuration Techniques for STAP; Network Simulator

• Nikhil Gupta, Research Assistant, M.S. student

Algorithms for STAP Weight Calculation Mapping Inner Product Computations onto FPGAs

Graduating July 1998

• Tim Osmulski, Research Assistant, M.S. student

Power Prediction Simulator for FPGAs

Graduated May 1998

• Brian Veale, Research Assistant, M.S. student

Calibration of FPGA power prediction model; Implementation of STAP core on GPP/FPGA

New RA as of May 1998

• New Student, Research Assistant, M.S. student

Implementation of SAR core on GPP/FPGA

To be hired September 1998

Contacts, Partners, Vendors, and Other Communications (Program Management Status)

José Muñoz, DARPA Ralph Kohler, Rome Lab

MIT Lincoln LabDavid MartinezJim Ward

MITRERichard Games

Northrop GrummanMarc Campbell

Synplicity, Inc. Madelyn Miller

XilinxJason Feinsmith

Annapolis Micro SystemsJenny DonaldsonBill HulbertPaul Kowalewski

ISIMilissa BenincasaDavid Coker

Mercury ComputerThomas EinsteinEd HolstienCraig LundDave Toms

Mercury20 Slot Hybrid Chassis with SPARC 5VSolaris 2.5 with C Compiler9U VME RACE BoardSHARC Daughtercard (2CNs, 8MB/CN)SHARC Daughtercard (2CNs, 16MB/CN)SHARC Daughtercard (2CNs, 16MB/CN)MC/OS, Cross Assembler, Toolkit PowerPC Daughtercard (2CNs, 16MB/CN)

Annapolis Micro SystemsPCI WILDONE Card (1 Xilinx 4028EX-3)VME WILDFIRE Array Card (16 Xilinx 4028EX-3s)

Other VendorsModelSim Simulation Software (Model Technology, Inc.)Synplify Synthesis Software (Synplicity, Inc.)Xilinx Foundation Software (Xilinx, Inc.)

Equipment Status(Program Management Status)

√√√

Schedule of Milestones

June 1997 June 1998 June 1999 Dec. 1999Dec. 1998Dec. 1997

Design STAPIterative Weight Solver for FPGA

Inter-GPP/DSP Comm.Simulator for STAP

Optimal GPP/DSPConfig. for SAR

GPP/DSP/FPGA Platform Construction and Independent Testing of GPP/DSP and FPGA Subsystems

Implement STAP Iterative Weight Solver on FPGA

Optimal GPP/DSPConfig. for STAP

Implement SAR Linear Filteringon FPGA

Optimal GPP/DSP/FPGAConfig. for SAR/STAP

GPP/DSP and FPGA Subsystem Integration and Testing

Optimal GPP/DSP/FPGA Config. for SAR

Demonstrate Combined SAR/STAP onGPP/DSP/FPGA Platform

Implement SAR on GPP/DSP

Design SAR Linear Filteringfor FPGA

Implement STAP on GPP/DSP

Implement SAR onGPP/DSP/FPGA Platform

Optimal GPP/DSP/FPGA Config. for STAP

Implement STAP onGPP/DSP/FPGA Platform

Develop FPGA Power Consumption Simulator

KeyGPP/DSP Sub-System

Research/DesignImplement/Test

FPGA Sub-SystemResearch/DesignImplement/Test

GPP/DSP/FPGA SystemResearch/DesignImplement/Test

Test FPGA Power Consumption Simulator

FY 97Approved

FY 98Approved

FY 98Required*

FY 98“Deficit”

Personnel 22,066 56,710 84,517 27,807

Fringes 7,575 18,871 25,723 6,852

Consulting 0 0 15,000 15,000

Expenses 260 3,321 4,500 1,179

Travel 0 4,500 4,500 0

Equipment 74,000 55,608 85,088 29,480

Indirect Cost 13,634 39,198 62,623 23,425

Total 116,644 178,208 281,951 103,743

FY 97 and FY 98 Budgets(Program Management Status)

*Required to maintain 30 month completion date (i.e., 12/31/99).

FY 99 FY 00 ProjectTotal

Personnel 138,536 52,401 297,520

Fringes 39,911 14,404 87,614

Consulting 25,000 10,000 50,000

Expenses 7,078 3,000 14,838

Travel 12,000 5,000 20,500

Equipment 59,892 0 217,670

Indirect Cost 104,587 39,858 221,121

Total 387,004 124,664 909,262

FY 99 and FY 00 Budgets(Program Management Status)

OutlineOutline

Recent Accomplishments

• Network Communication Time Simulator for Parallel STAP

• FPGA Inner-Product Co-Processor Designs for STAP Weight Solver

• Power Prediction Simulator for the Xilinx4000-Series FPGA

Network Communication Time Simulator for Parallel STAP(Recent Accomplishments)

• Space-Time Adaptive Processing (STAP) Basics

• Mercury RACE Multicomputer

• Parallelization Approach for STAP

• RACE Network Simulator

• Preliminary Numerical Studies

• Conclusions

J. Ward, “Space-Time Adaptive Processing for Airborne Radar,” Technical Report 1015, MIT Lincoln Laboratory, Lexington, MA, 1994.

M. F. Skalabrin and T. H. Einstein, “STAP Processing on a Multicomputer: Distribution of 3-D Data Sets and Processor Allocation for Optimum InterprocessorCommunication,” Proc. Adaptive Sensor Array Processing (ASAP) Workshop, March 1996.

The RACE Multicomputer, Hardware Theory of Operation: Processors, I/O Interface, and the RACEway Interconnect, Volume I, ver. 1.3.

T. H. Einstein, “Mercury Computer Systems’ Modular Heterogeneous RACEMulticomputer,” Proc. 6th Heterogeneous Comp. Workshop, April 1997, pp. 60-71.

B. C. Kuszmaul, “The RACE Network Architecture,” Proc. 9th Int’l Parallel Processing Symp., April 1995, pp. 508-513.

G. Booch, I. Jacobson, and J. Rumbaugh, “The Unified Modeling Language for Object Oriented Development,” Documentation Set Version 1.1, September 1997.

Related STAP and RACE References

SSPACEPACE--TTIME IME AADAPTIVE DAPTIVE PPROCESSINGROCESSING

1. Space-Time Adaptive Processing (STAP) refers to a class of signal processing methods that operate on data collected from a set of sensors over a given time interval.

2. STAP simultaneously combines the signals received from an antenna array (spatial domain) and multiple pulse repetition periods (time domain).

3. STAP provides improved detection of smaller targets in the presence of ground clutter (overland and littoral environments) and hostile interference (electronic counter measures and jamming).

Pulses Pulses

Data Cube

Doppler Filter

Channels

Beamform

Beam Outputs

Pulses

QR Decomposition

Rotate

Channels

Pulses

Data Cube

Steering Vectors

Weights

Input Data

RotatePulse

Compress

Data CubeC

Pulses

STAPSTAP PPROCESSING ROCESSING FFLOWLOW

• Conclusions

1. 40Mhz clock, 32 bit data paths, 2048 byte circuit-switched packets.

2. Contention resolved using priorities.a. User-programmable message priority

b. Hardware priority assigned at each crossbar along a path (based on complex connection rules)

3. A packet with higher priority preempts (suspends) a lower priority packet (active or inactive) to gain control of a crossbar port.

SSOMEOME RACERACENNETWORK ETWORK FFEATURESEATURES

6-PortCrossbar

CNCN CNCNCNCN CNCN CNCN CNCNCNCN CNCN CNCN CNCNCNCN CNCNCNCN CNCN CNCN CNCN

6-PortCrossbar

Message DestinationMessage DestinationMessage SourceMessage Source

MessagePath

6-PortCrossbar

RACERACE NNETWORK ETWORK IINTERCONNECTNTERCONNECTFFATAT--TTREE REE TTOPOLOGYOPOLOGY

6-PortCrossbar

SSTANDARD TANDARD CCROSSBAR ROSSBAR PPRIORITY RIORITY AARBITRATION RBITRATION AALGORITHM LGORITHM TTABLEABLE

7 F A,B,C,D,E F A,B,C,D,E F A,B,C,D6 E F E F A,B,C,D* A,B,C,D*5 A,B,C,D F A,B,C,D F A,B,C,D F4 E A,B,C,D E A,B,C,D - -3 *A,B,C,D *A,B,C,D,E A,B,C,D* A,B,C,D* - -2 - - A,B,C,D E - -1 - - - - - -

HardwarePriority Entry Port Exit Port Entry Port Exit Port Entry Port Exit Port

Active Port E InvolvedNot Yet Active

Port E Not Involved

Transaction Status

* - Peer Kill Rules Apply

RACEway Interface

SHARCSHARC

SHARCProcessorSHARC

ProcessorECC LogicECC Logic DRAMDRAM

PerformanceMetering

DMAController

3-WayData

Switch

3-WayData

Switch

RACEwayMapping

OSSupport

Hardware

OSSupport

Hardware

CN ASIC

SHARCSHARC CCOMPUTE OMPUTE NNODEODE

• Conclusions

1. Partition STAP data cube over a 2-D process set.

2. Process the contiguous dimension.

3. Re-partition the data cube before processing the next dimension.

4. Rotate the newly distributed data to make the next dimension sequential in memory.

5. Repeat steps 1 through 4 before each processing phase.

SSUBUB--CUBE CUBE BBAR AR PPARTITIONING ARTITIONING MMETHODOLOGYETHODOLOGY

Pulse Compression Partitioningwith range dimension whole.Pulse Compression Partitioningwith range dimension whole.

Pulses Range

1 32 4

5 76 8

9 1110 12

Pulses

3 x 4 Process Set

Pulses

Doppler Filtering Partitioningwith pulses dimension whole.Doppler Filtering Partitioningwith pulses dimension whole.

Pulses Range

9 10 11 12

5 6 7 8

1 2 3 4

Pulses Range

1 32 4

5 76 8

9 1110 12

3 x 4 Process Set

STAPSTAP DDATA ATA CCUBE UBE PPARTITIONING ARTITIONING EEXAMPLESXAMPLES

Pulses

s• Re-Partitioning involves exchanging data with the next whole dimension.

1 32 4

5 76 8

9 1110 12

Pulses

3 x 4 Process Set

Range Dimension is Contiguous

1 32 4

5 76 8

9 1110 12

3 x 4 Process Set

Pulse Dimension is Contiguous

• Interprocessor Communication is required between processors in the same row.

Pulses

9 10 11 12

5 6 7 8

1 1 1 2 1 3 1 4

STAPSTAP DDATA ATA CCUBE UBE RREPARTITIONINGEPARTITIONING

Required Data TransfersRequired Data Transfers

Network Interconnection ConfigurationNetwork Interconnection Configuration

6-PortCrossbar

CN CN CN CN

34Pulses Range

Pulse Compression

Doppler Filtering

Pulses

9 10 11 12

5 6 7 8

1 2 3 4

STAPSTAP DDATA ATA CCUBE UBE RREPARTITIONINGEPARTITIONING

Data ReData Re--distribution Mappingdistribution Mapping

• Conclusions

1. Design and implement a network simulator that models the effect data mapping and scheduling has on the performance of a STAP algorithm.

2. Key features of the network simulator include:a. Developed and implemented in an OO paradigm.

b. Implemented using a sub-cube bar partitioning scheme.

c. Models both sub-cube bar mapping strategies and communication scheduling during both phases of data re-partitioning.

d. Completely generic.

RRESEARCH ESEARCH OOBJECTIVESBJECTIVESfor for SSIMULATORIMULATOR

NetworkNetwork

ClockClock

CrossbarCrossbar Routing TableRouting Table

File OutputFile Output

Random ScanRandom Scan

Data CubeData Cube

Process SetProcess Set

Gets Data From

UML NUML NETWORK ETWORK CCLASS LASS DDIAGRAMIAGRAM

CrossbarCrossbar

LinkLink Compute NodeCompute Node

Message QueueMessage Queue Packet StackPacket Stack

MessageMessage PacketPacket

UML CUML CROSSBAR ROSSBAR CCLASS LASS DDIAGRAMIAGRAM

0..*0..*

DataData

MessageMessage PacketPacket

Header RouteList

RouteRoute

Abstract ClassInheritance

UML DUML DATA ATA CCLASS LASS DDIAGRAMIAGRAM

CrossbarCrossbar CrossbarCrossbar

CrossbarCrossbar

Compute NodeProcessor InformationOutgoing and Received Message QueuesOutgoing and Received Packet Stack

LinkLink

Random ScanGenerates Pseudo-Random CN Scan Ordering

ClockBased on Network Clock Frequency (factor of 5)Data Transfer Rate Equates to Effective Network Bandwidth

Dynamic Network ConstructionDynamic Routing Table CreationDynamic CN and CE Message Traffic GenerationSimulates Packet Traffic

Network Methods

NNETWORK ETWORK CCLASS LASS DDETAILSETAILS

Implements Hardware Priority Arbitration • TOP-LEVEL ALGORITHM• STANDARD ALGORITHM

Query Port StatusRoutes Packets to Next LocationAllocates and Frees Internal Port Connections and Connected Link ObjectsTransmits Packet Data

Implements Hardware Priority Arbitration • TOP-LEVEL ALGORITHM• STANDARD ALGORITHM

Query Port StatusRoutes Packets to Next LocationAllocates and Frees Internal Port Connections and Connected Link ObjectsTransmits Packet Data

Crossbar Methods

LinkConnects Crossbar Objects Link Status: Occupied or Free

CrossbarTwo Parent Port ConnectionsFour Child Port ConnectionsInternal Switch ConnectionsFour CN Connections for TerminalCrossbars.

CCROSSBAR ROSSBAR CCLASS LASS DDETAILSETAILS

Compute Node Methods:Manages Outgoing and Received MessageQueuesManages Outgoing and Received Packet StackExplodes the Top Outgoing Message into Packets of Size 2048 or LessHandles DMA Chaining of PacketsEstablishes Path Through Network and Transmits Packet Data

Outgoing Message QueueOutgoing Message Queue

Message 1

Message 2

Message 3

Packet StackPacket StackEXPLODE

• PACKETS ARE SELF-ROUTING

Packet 2Packet 3Packet 4

Packet 1

CCOMPUTE OMPUTE NNODE ODE CCLASS LASS DDETAILSETAILS

SSIMULATOR IMULATOR UMLUMLSSEQUENCE EQUENCE DDIAGRAMIAGRAM

NetworkNetwork CrossbarCrossbarData CubeData Cube Process SetProcess Set CNCN<<actor>>

User<<actor>>

User ClockClock

Pass 1

Pass 2

Increment Simulation

Build Messages

R:200,P:22,C:16

CEs:48

X:6, Y:8

Routing:FCN Traffic,

Phase 1 DMA:Y

Connection/Data

Transfer

Clean Up

Message Matrices

X, Y,MappingMatrices

SimulationTime = 2 msSimulation

Time = 2 ms

Messages Time* iterative process

CCOMPUTEOMPUTE NNODEODEUML SUML STATECHARTTATECHART

Simulation PASS 1Simulation PASS 1Compute Node Subsystem

CurrentPacket

PacketStackStatus

MessageQueueStatus

ExplodeTop

Message

ExplodeTop

Message

PopTop

Packet

PopTop

Packet

Simulation SubsystemSimulation Subsystem

Simulate Pass 1

GenerateErrorCode

No Packet EmptyEmpty - Done

Not Empty Not Empty

Success

ErrorError

SuccessPacketFound

CCOMPUTEOMPUTE NNODEODEUML SUML STATECHARTTATECHART

Simulation PASS 2Simulation PASS 2Compute Node Subsystem

CurrentPacket

Simulation SubsystemSimulation Subsystem

Simulate Pass 2

PacketFound

No Packet

PPACKETACKET UML SUML STATECHARTTATECHARTSimulation Simulation Pass 1Pass 1 and and Pass 2Pass 2

Simulation Pass Subsystem

Start UpStart Up

Waitingfor Kill

CompletedCompletedSuspendedSuspended

BlockedBlocked ActiveActive

ReadyReady

Pass 1

Pass 2

• Conclusions

PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 1Communication Phase 1

Process Set - Phase 1 (CN:8, R:800, P:32, C:22, Routing:E)

7 8 9 10 11

Time (ms)

t CN 8 (6x4)CN 8 (4x6)

7 8 9 10 11

Time (ms)

t CN 8 (6x4)CN 8 (4x6)

0123456789

28 30 32 34 36 38 40

Time (ms)

t CN 8 (6x4)CN 8 (4x6)

0123456789

28 30 32 34 36 38 40

Time (ms)

t CN 8 (6x4)CN 8 (4x6)

Process Set - Phase 1(CN:16, R:200, P:22, C:16, Routing:F)

101214161820

0.7 0.8 0.9 1 1.1 1.2 1.3

Time (ms)

t CN 16 (12x4)CN 16 (8x6)CN 16 (4x12)

101214161820

0.7 0.8 0.9 1 1.1 1.2 1.3

Time (ms)

t CN 16 (12x4)CN 16 (8x6)CN 16 (4x12)

2.5 3 3.5 4 4.5 5 5.5

Time (ms)

t CN 16 (12x4)CN 16 (8x6)CN 16 (4x12)

2.5 3 3.5 4 4.5 5 5.5

Time (ms)

t CN 16 (12x4)CN 16 (8x6)CN 16 (4x12)

Process Set - Phase 1 (CN:12, R:200, P:22, C:16, Routing:F)

101520253035404550

0.5 1 1.5 2

Time (ms)

CN 12 (12x3)CN 12 (9x4)CN 12 (6x6)CN 12 (4x9)

0123456789

3 3.5 4 4.5 5 5.5 6

Time (ms)

CN 12 (12x3)CN 12 (9x4)CN 12 (6x6)CN 12 (4x9)

0123456789

3 3.5 4 4.5 5 5.5 6

Time (ms)

CN 12 (12x3)CN 12 (9x4)CN 12 (6x6)CN 12 (4x9)

101520253035404550

0 0.5 1 1.5 2

Time (ms)

t CN 12 (3x12)CN 12 (12x3)CN 12 (4x9)

101520253035404550

0 0.5 1 1.5 2

Time (ms)

t CN 12 (3x12)CN 12 (12x3)CN 12 (4x9)

2.5 3.5 4.5 5.5 6.5

Time (ms)

t CN 12 (3x12)CN 12 (12x3)CN 12 (4x9)

2.5 3.5 4.5 5.5 6.5

Time (ms)

t CN 12 (3x12)CN 12 (12x3)CN 12 (4x9)

Process Set - Phase 1(CN, R:200, P:22, C:16, Routing:F)

0 0.2 0.4 0.6 0.8 1

Time (ms)

t CN 12 (3x12)CN 16 (4x12)

0 0.2 0.4 0.6 0.8 1

Time (ms)

t CN 12 (3x12)CN 16 (4x12)

2.6 2.8 3 3.2 3.4 3.6 3.8

Time (ms)

t CN 12 (3x12)CN 16 (4x12)

2.6 2.8 3 3.2 3.4 3.6 3.8

Time (ms)

t CN 12 (3x12)CN 16 (4x12)

MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC

Message Traffic - Phase 1 (CN:16, X:12, Y:4, R:400, P:22, C:16, Routing:EF)

0123456789

2 2.1 2.2 2.3 2.4 2.5

Time (ms)

t CN TrafficCE Traffic

0123456789

2 2.1 2.2 2.3 2.4 2.5

Time (ms)

012345678

10 15 20 25

Time (ms)

012345678

10 15 20 25

Time (ms)

0.85 0.851 0.852 0.853 0.854

Time (ms)

0.85 0.851 0.852 0.853 0.854

Time (ms)

012345678

10 15 20 25

Time (ms)

012345678

10 15 20 25

Time (ms)

Message Traffic - Phase 1 (CN:12, X:6, Y:6, R:800, P:32, C:22, Routing:E)

4.95 4.9505 4.951 4.9515 4.952

Time (ms)

4.95 4.9505 4.951 4.9515 4.952

Time (ms)

0123456789

43 45 47 49 51

Time (ms)

0123456789

43 45 47 49 51

Time (ms)

17 18 19 20 21 22 23

Time (ms)

17 18 19 20 21 22 23

Time (ms)

41 43 45 47 49

Time (ms)

41 43 45 47 49

Time (ms)

DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC

DMA Chaining - Phase 1 (CE:24, X:8, Y:3, R:200, P:22, C:16, Routing:F)

0123456789

1.7 1.8 1.9 2 2.1

Time (ms)

t ChainingNo Chaining

0123456789

2.5 2.7 2.9 3.1 3.3 3.5

Time (ms)

0123456789

2.5 2.7 2.9 3.1 3.3 3.5

Time (ms)

0123456789

3.4 3.5 3.6 3.7 3.8 3.9 4 4.1

Time (ms)

0123456789

3.4 3.5 3.6 3.7 3.8 3.9 4 4.1

Time (ms)

0123456789

5.2 5.7 6.2 6.7

Time (ms)

0123456789

5.2 5.7 6.2 6.7

Time (ms)

0123456789

14 16 18 20 22

Time (ms)

0123456789

14 16 18 20 22

Time (ms)

012345678

21 22 23 24 25 26 27

Time (ms)

012345678

21 22 23 24 25 26 27

Time (ms)

AADAPTIVEDAPTIVE RROUTINGOUTINGPPERFORMANCE ERFORMANCE MMETRICETRIC

Adaptive Routing - Phase 1 (CN:16, X:8, Y:6, R:800, P:32, C:22)

0123456789

7 8 9 10 11 12 13

Time (ms)

t Adaptive EAdaptive FAdaptive E/F

0123456789

7 8 9 10 11 12 13

Time (ms)

Adaptive Routing - Phase 2 ( CN:16, X:8, Y:6, R:800, P:32, C:22)

0123456789

26 31 36 41 46Time (ms)

Adaptive EAdaptive FAdaptive E/F

Adaptive Routing - Phase 2 ( CN:16, X:8, Y:6, R:800, P:32, C:22)

0123456789

26 31 36 41 46Time (ms)

Adaptive EAdaptive FAdaptive E/F

1.5 2 2.5 3 3.5

Time (ms)

1.5 2 2.5 3 3.5

Time (ms)

0123456789

7 8 9 10 11 12 13

Time (ms)

0123456789

7 8 9 10 11 12 13

Time (ms)

• Conclusions

1. Designed and implemented a platform independent simulator.

4. Communication pattern implemented for STAP but may be used for other applications with phased communication pattern.

2. Simulator demonstrates that the Process Set, the CN or CE Message Traffic, the DMA chaining, the adaptive routing, and the scheduling of the messages affects performance.

3. Allows users to experiment with possible current and future configurations.

CCONCLUSIONSONCLUSIONS

FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers

(Recent Accomplishments)

• Overview of STAP Weight Calculation

• Two Candidate STAP Weight Solvers: QR Versus CG

• Two FPGA Inner-Product Circuit Designs

• Numerical Accuracy Studies

References for STAP Weight Solverand FPGA Design

J. Ward, “Space-Time Adaptive Processing for Airborne Radar,” Technical Report 1015, MIT Lincoln Laboratory, Lexington, MA, 1994.

K. C. Cain, J. A. Torres, and R. T. Williams, (R. A. Games, Project Leader), “RT_STAP: Real-Time Space-Time Adaptive Processing Benchmark,” MITRE Technical Report MTR 96B0000021, Feb. 1997.

MCARM Data Files, Rome Laboratory, (http://sunrise.oc.rl.af.mil).

D. G. Luenberger, Linear and Nonlinear Programming, Addison-Wesley, Reading, MA, 1984.

WildOne Hardware Reference Manual, Number 11927-0000, Revision 0.1, Annapolis Micro Systems, Inc., MD, 1997.

Doppler Filter

Weight Computation

Steering Vector

Input Data

Pulse Compress Data Cube Data Cube

Weight Application

ThresholdDetection

Target Decision

Typical STAP Processing Flow

pulses

Doppler

range8%

CovarianceMatrix

STAP CPI Data Cube

PRI (32-128)

Channels(24)

Range(625-2500)

Principle Behind STAP

• Range gates are divided into non-overlapping blocks having a fixed number of range gates

• These blocks are referred to as the Range Segments

Channels Lr

Number of Range Segments = L/Lr

Range Segment

• Works with data on all M Doppler bins and all Nchannels

• Computes and applies a separate adaptive weight to every element and Doppler bin

• The weight vector is of size MN for each range gate.

Space-Time Adaptive Processing

• Fully Adaptive STAP

• Characteristics of Fully Adaptive STAP• Requires solving a large system of linear equations• Size of the linear system grows with

• Array size (the number of channels)• Number of pulses

Example: for each instance, if M = 32 and N = 24 then, complexity ≈ (MN)3 = 452,984,830

• Implementation of fully adaptive STAP is impractical• Complexity of each instance is O((MN)3)• Product MN being several hundreds puts it beyond

current capabilities in real-time computing• Instances of the problem must be solved for each

range segment

• Problem is broken down into a number of smaller,more manageable adaptive problems

• STAP applied to these lower dimension problems

• Partially Adaptive STAP

• The partially adaptive STAP works with data on

Example: for each instance, if M = 32, N = 24 and K = 3, thencomplexity ≈ M(KN)3 = 11,943,936

• All N Channels

• And a few adjacent Doppler bins, denoted as K

• Complexity is reduced to O(M(KN)3), for K<< M

• Effective partially adaptive STAP technique

• The architecture consists of

• Doppler processing across all pulse repetition intervals

• Adaptive filtering across• all channels and• K adjacent Doppler bins

Kth- Order Doppler Factored STAP

1 31 ˆ:),(

×=× NN

∑+−=

rkxrkx

rL 1)1(

),(),(1

Kth-Order Doppler Factored STAP

bth Ran

Doppler

k (k - 1)(k + 1)

Data matrix needed for calculating covariance matrix for kth Doppler Bin

and bth Range Segment using Kth-OrderDoppler Factored STAP with K = 3

Matrix-Based Derivation of

rr LNLN

3 ˆ:),(

),(),(1

),(),(1),(1)1(

bLrkxrkxbk

= ∑+−=

sbkwbk =),(),(ψ

The Weight Equation:

),( bkψ

Methods for STAP Weight Calculation

• Two approaches to solve the weight equation

• QR-decomposition method (direct)

• Conjugate Gradient method (iterative)

STAP Weight Calculation

sLbkwRR

sbkwRRL

bkwRQQRL

sbkwbkbkL

sbkwbk

]0[ that Note

),(1),(1

),( :onDecomposti QR Take

),(),(),(1

),(),(

onsubstituti backward using ),(for Solve

neliminatio forward using for Solve

),(Let

sw =ψ :Equation Weight thesolve toMethodion decomposit-QR Using

Iteration

STAP Weight Calculation

Initialization

sw =ψ :Equation Weight thesolve toMethodGradient Conjugate Using

00000 ,set , Choose dgwsdw −=−= ψ

Preliminary Numerical Studies

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1

Tolerance

Lr = 25010-1

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1

Tolerance

Lr = 125

ww −=Error Relative

Preliminary Numerical Studies

Lr = 125

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1

Tolerance

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1

Lr = 250

Tolerance

Motivation for FPGA Inner-Product Co-Processors

• Inner-products are a core calculation for both CG- and QR-based STAP weight solvers

• Computations are highly numeric and regular

• Opportunities to exploit reduced precision arithmetic

• Control flow of CG and QR best implemented on GPP or DSP - Inner product calculations can be offloaded to available FPGA resources

PCI BUSPCI BUS

Dual Port MemController 0

Dual Port MemController 1

Processing Element

0Fifo1Fifo1Fifo0Fifo0

SIMDConnector

External I/OConnector

Overview of WildOne Architecture

Output Register

Sign+16 bitmantissa

Normalizing unit

1’s comp/register

a bsign of a

BUFFER

INTERCONNECTION

HOSTPROCESSOR

• Multiply-Accumulate Pipe• Reads two operands

per cycle • Performs two operations

per cycle• Performs exponent

normalization prior to accumulation

• 2 N-vectors reduced to a constant number of partial sums

FPGA Inner Product Co-Processor:Design 1

• Multiply-Add Reduction Pipe• Reads four operands

per cycle • Performs three operations

per cycle• No normalization required• 2 N-vectors reduced to N/2 partial sums

• Basic Tradeoff: First design has lower throughput, but can perform more work

1’s comp/register

Sign bSign a

Sign+16 bit mantissa

INTERCONNECTION

HOSTPROCESSOR

BUFFER

Data forFirst

Multiplier

Data forSecond

Multiplier

Unitclocked

FPGA Inner Product Co-Processor:Design 2

Block Floating Point Unit

Inner-ProductCo-Processor

UML Description of Basic Co-Processor Design

Block Floating Point Unit

Multiplying Unit Complementor

Normalizing UnitAccumulator

UML Description of Block Floating Point Unit

Multiplying UnitRegister

4-Bit Adder

Multiply Stage1

UML Description of Multiplying Unit

Accumulator

4-Bit Adder Register

3-Bit Adder

UML Description of Accumulator Unit

Normalizing Unit

SubtractorRegister

MagnitudeComparator

UML Description of Normalizing Unit

Host ProgramWild-One

Open board

Program the board with the image

Interrupt for Exponent

Exponent written to FIFO

Interrupt for Mantissa Vectors

Mantissa Vectors written to the FIFO

Processing Done ans in FIFO/Memory

Read back the answer

Close the board

MESSAGES

Sequence Diagram for Interactions between Host and FPGA Board

Get Exponent Wait forExponent Int

Int Req

Int AckRead Exponent

Get Mantissa

Read Mantissa Write Mantissa

Wait forMantissa Int

Write Exponent

Int Req

Int Ack

Multiply-and-add/accumulate

Write Back

Wait for Answer Int

Read Back Answer

Ack = 1

Ack = 0

Req = 1

Req = 0

Req = 1Ack = 1

Ack = 0 Req = 0

Req = 1

Done = 1

Req = 0Processing Sub-System

System

Statechart Diagram for Interactions between Host and FPGA Board

Compare Count [Count = Threshold]

Read Two Operands

Multiply

Accumulate

[Count ≠ Threshold]

Write to MemoryFeedback SumIncrement Count

Set Done flag

Circuit Activity Diagram:Design 1

Compare Count [Count = Threshold]

Read Two Operands

Multiply

[Count ≠ Threshold]

optimal configuration of combined gpp/dsp/fpga systems for ...antonio/pubs/p-ann_rev98acs.pdf ·...

Documents

national gpp

gpp manual

update gpp

การจัดทำผลิตภัณฑ0มวลรวมจังหวัด...

gpp criteria waste water...

1 multiprocessors computer organization computer...

biscoitos gpp

jogar como jogar? como jogar? créditos gpp games gpp games...

gpp - skripta

revised eu gpp criteria for the - choose your...

gpp training

oblasti gpp

informatica gpp

การเตรียมตัวของร้านยาเพื่อรองรับกฎกระทรวง...

eu gpp criteria for office buildings - european...

gpp spreadsheets

eu gpp criteria for...

gpp presentation

eu gpp criteria for...

gpp mission report - reliefweb.int gpp mission report...