optimal configuration of combined gpp/dsp/fpga systems for ...antonio/pubs/p-ann_rev98acs.pdf ·...

144
Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP by John K. Antonio Department of Computer Science College of Engineering Texas Tech University [email protected] First Annual Review June 23, 1998

Upload: others

Post on 23-Mar-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Optimal Configuration of Combined GPP/DSP/FPGA Systems for

Minimal SWAPby

John K. AntonioDepartment of Computer Science

College of EngineeringTexas Tech University

[email protected]

First Annual ReviewJune 23, 1998

Page 2: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

OutlineOutline

• Program Overview and Introduction (Quad Chart)

• Program Management Status

• Recent Accomplishments

• Status of Deliverable Checklist

Page 3: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Configuring Combined GPP/DSP/FPGA Systems for Minimal SWAPApplications

• SAR• STAP

Requirements• Throughput• SWAP

•Combined Technology•Minimal SWAP Configuration•Mixed-Mode Operation•Demonstration

Texas Tech University: John K. Antonio

New Ideas• Systematic determination of minimal SWAP

configuration based on proven mathematical programming techniques

• Optimal configuration based on automatic“tuning” of system design parameters- number and types of cards used- data mapping and communication schemes- place and route schemes

• Novel computing techniques based oncharacteristics of GPP/DSP/FPGA system

Jun 97Start

Jun 98 Jun 99 Dec 99End

ScheduleDevelop optimalconfigurationtechniques

Construction and integration of GPP/DSP/FPGA system

Implement and test optimal configurations onGPP/DSP/FPGA system

Develop practicaldesign methodsbased on SAR andSTAP applications

Demonstrate advantagesof combiningtechnologies

Impact• Embedded Systems requirements for the

21st Century can be satisfied with thecombined use of GPP, DSP, and FPGA technologies

• Demonstrate use of FPGA boards as co-processors for embedded multiprocessorGPP and DSP systems

• Demonstrate systematic approaches tooptimally configure GPP/DSP/FPGA syst. forminimal SWAP for embedded applications

Page 4: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

OutlineOutline

• Program Overview and Introduction (Quad Chart)

• Program Management Status

• Recent Accomplishments

• Status of Deliverable Checklist

Page 5: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Personnel (Program Management Status)

• John K. Antonio, Principal Investigator

• Ph.D., EE, Texas A&M Univ. (1989)

• Currently Assoc. Prof. of CS, Texas Tech Univ.

• Over 65 publications in HPC and related areas

• PI or co-PI of 17 contracts/grants

totaling over $2.1M

Page 6: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

• Jeff Muehring, Research Assistant, Ph.D. student

Optimal GPP/DSP/FPGA Configuration Techniques for SAR

Intern at IBM/Houston, 1/98 to 6/98

• Jack West, Research Assistant, Ph.D. student

Optimal Mapping, Scheduling, and Configuration Techniques for STAP; Network Simulator

Personnel (Program Management Status)

Page 7: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

• Nikhil Gupta, Research Assistant, M.S. student

Algorithms for STAP Weight Calculation Mapping Inner Product Computations onto FPGAs

Graduating July 1998

• Tim Osmulski, Research Assistant, M.S. student

Power Prediction Simulator for FPGAs

Graduated May 1998

Personnel (Program Management Status)

Page 8: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

• Brian Veale, Research Assistant, M.S. student

Calibration of FPGA power prediction model; Implementation of STAP core on GPP/FPGA

New RA as of May 1998

• New Student, Research Assistant, M.S. student

Implementation of SAR core on GPP/FPGA

To be hired September 1998

Personnel (Program Management Status)

Page 9: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Contacts, Partners, Vendors, and Other Communications (Program Management Status)

José Muñoz, DARPA Ralph Kohler, Rome Lab

MIT Lincoln LabDavid MartinezJim Ward

MITRERichard Games

Northrop GrummanMarc Campbell

Synplicity, Inc. Madelyn Miller

XilinxJason Feinsmith

Annapolis Micro SystemsJenny DonaldsonBill HulbertPaul Kowalewski

ISIMilissa BenincasaDavid Coker

Mercury ComputerThomas EinsteinEd HolstienCraig LundDave Toms

Page 10: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Mercury20 Slot Hybrid Chassis with SPARC 5VSolaris 2.5 with C Compiler9U VME RACE BoardSHARC Daughtercard (2CNs, 8MB/CN)SHARC Daughtercard (2CNs, 16MB/CN)SHARC Daughtercard (2CNs, 16MB/CN)MC/OS, Cross Assembler, Toolkit PowerPC Daughtercard (2CNs, 16MB/CN)

Annapolis Micro SystemsPCI WILDONE Card (1 Xilinx 4028EX-3)VME WILDFIRE Array Card (16 Xilinx 4028EX-3s)

Other VendorsModelSim Simulation Software (Model Technology, Inc.)Synplify Synthesis Software (Synplicity, Inc.)Xilinx Foundation Software (Xilinx, Inc.)

Equipment Status(Program Management Status)

√√√

√√√

Page 11: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Schedule of Milestones

June 1997 June 1998 June 1999 Dec. 1999Dec. 1998Dec. 1997

Design STAPIterative Weight Solver for FPGA

Inter-GPP/DSP Comm.Simulator for STAP

Optimal GPP/DSPConfig. for SAR

GPP/DSP/FPGA Platform Construction and Independent Testing of GPP/DSP and FPGA Subsystems

Implement STAP Iterative Weight Solver on FPGA

Optimal GPP/DSPConfig. for STAP

Implement SAR Linear Filteringon FPGA

Optimal GPP/DSP/FPGAConfig. for SAR/STAP

GPP/DSP and FPGA Subsystem Integration and Testing

Optimal GPP/DSP/FPGA Config. for SAR

Demonstrate Combined SAR/STAP onGPP/DSP/FPGA Platform

Implement SAR on GPP/DSP

Design SAR Linear Filteringfor FPGA

Implement STAP on GPP/DSP

Implement SAR onGPP/DSP/FPGA Platform

Optimal GPP/DSP/FPGA Config. for STAP

Implement STAP onGPP/DSP/FPGA Platform

Develop FPGA Power Consumption Simulator

KeyGPP/DSP Sub-System

Research/DesignImplement/Test

FPGA Sub-SystemResearch/DesignImplement/Test

GPP/DSP/FPGA SystemResearch/DesignImplement/Test

Test FPGA Power Consumption Simulator

Page 12: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

FY 97Approved

FY 98Approved

FY 98Required*

FY 98“Deficit”

Personnel 22,066 56,710 84,517 27,807

Fringes 7,575 18,871 25,723 6,852

Consulting 0 0 15,000 15,000

Expenses 260 3,321 4,500 1,179

Travel 0 4,500 4,500 0

Equipment 74,000 55,608 85,088 29,480

Indirect Cost 13,634 39,198 62,623 23,425

Total 116,644 178,208 281,951 103,743

FY 97 and FY 98 Budgets(Program Management Status)

*Required to maintain 30 month completion date (i.e., 12/31/99).

Page 13: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

FY 99 FY 00 ProjectTotal

Personnel 138,536 52,401 297,520

Fringes 39,911 14,404 87,614

Consulting 25,000 10,000 50,000

Expenses 7,078 3,000 14,838

Travel 12,000 5,000 20,500

Equipment 59,892 0 217,670

Indirect Cost 104,587 39,858 221,121

Total 387,004 124,664 909,262

FY 99 and FY 00 Budgets(Program Management Status)

Page 14: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

OutlineOutline

• Program Overview and Introduction (Quad Chart)

• Program Management Status

• Recent Accomplishments

• Status of Deliverable Checklist

Page 15: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Recent Accomplishments

• Network Communication Time Simulator for Parallel STAP

• FPGA Inner-Product Co-Processor Designs for STAP Weight Solver

• Power Prediction Simulator for the Xilinx4000-Series FPGA

Page 16: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Network Communication Time Simulator for Parallel STAP(Recent Accomplishments)

• Space-Time Adaptive Processing (STAP) Basics

• Mercury RACE Multicomputer

• Parallelization Approach for STAP

• RACE Network Simulator

• Preliminary Numerical Studies

• Conclusions

Page 17: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

J. Ward, “Space-Time Adaptive Processing for Airborne Radar,” Technical Report 1015, MIT Lincoln Laboratory, Lexington, MA, 1994.

M. F. Skalabrin and T. H. Einstein, “STAP Processing on a Multicomputer: Distribution of 3-D Data Sets and Processor Allocation for Optimum InterprocessorCommunication,” Proc. Adaptive Sensor Array Processing (ASAP) Workshop, March 1996.

The RACE Multicomputer, Hardware Theory of Operation: Processors, I/O Interface, and the RACEway Interconnect, Volume I, ver. 1.3.

T. H. Einstein, “Mercury Computer Systems’ Modular Heterogeneous RACEMulticomputer,” Proc. 6th Heterogeneous Comp. Workshop, April 1997, pp. 60-71.

B. C. Kuszmaul, “The RACE Network Architecture,” Proc. 9th Int’l Parallel Processing Symp., April 1995, pp. 508-513.

G. Booch, I. Jacobson, and J. Rumbaugh, “The Unified Modeling Language for Object Oriented Development,” Documentation Set Version 1.1, September 1997.

Related STAP and RACE References

Page 18: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

SSPACEPACE--TTIME IME AADAPTIVE DAPTIVE PPROCESSINGROCESSING

1. Space-Time Adaptive Processing (STAP) refers to a class of signal processing methods that operate on data collected from a set of sensors over a given time interval.

2. STAP simultaneously combines the signals received from an antenna array (spatial domain) and multiple pulse repetition periods (time domain).

3. STAP provides improved detection of smaller targets in the presence of ground clutter (overland and littoral environments) and hostile interference (electronic counter measures and jamming).

Page 19: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Pulses Pulses

Data Cube

Data Cube

Doppler Filter

Channels

Ran

ge

Ran

ge

Channels

Beamform

Beam Outputs

Ran

ge

Pulses

QR Decomposition

Rotate

Channels

Ran

ge

Pulses

Data Cube

Steering Vectors

Weights

Input Data

RotatePulse

Compress

Data CubeC

hann

els

Pulses

Range

STAPSTAP PPROCESSING ROCESSING FFLOWLOW

Page 20: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

• Mercury RACE Multicomputer

• Space-Time Adaptive Processing (STAP) Basics

• Parallelization Approach for STAP

• RACE Network Simulator

• Preliminary Numerical Studies

• Conclusions

Network Communication Time Simulator for Parallel STAP(Recent Accomplishments)

Page 21: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

1. 40Mhz clock, 32 bit data paths, 2048 byte circuit-switched packets.

2. Contention resolved using priorities.a. User-programmable message priority

b. Hardware priority assigned at each crossbar along a path (based on complex connection rules)

3. A packet with higher priority preempts (suspends) a lower priority packet (active or inactive) to gain control of a crossbar port.

SSOMEOME RACERACENNETWORK ETWORK FFEATURESEATURES

Page 22: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

CNCN CNCNCNCN CNCN CNCN CNCNCNCN CNCN CNCN CNCNCNCN CNCNCNCN CNCN CNCN CNCN

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

CN

6-PortCrossbar

6-PortCrossbar

Message DestinationMessage DestinationMessage SourceMessage Source

MessagePath

MessagePath

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

CN

RACERACE NNETWORK ETWORK IINTERCONNECTNTERCONNECTFFATAT--TTREE REE TTOPOLOGYOPOLOGY

6-PortCrossbar

6-PortCrossbar

CNCN

6-PortCrossbar

Page 23: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

SSTANDARD TANDARD CCROSSBAR ROSSBAR PPRIORITY RIORITY AARBITRATION RBITRATION AALGORITHM LGORITHM TTABLEABLE

7 F A,B,C,D,E F A,B,C,D,E F A,B,C,D6 E F E F A,B,C,D* A,B,C,D*5 A,B,C,D F A,B,C,D F A,B,C,D F4 E A,B,C,D E A,B,C,D - -3 *A,B,C,D *A,B,C,D,E A,B,C,D* A,B,C,D* - -2 - - A,B,C,D E - -1 - - - - - -

HardwarePriority Entry Port Exit Port Entry Port Exit Port Entry Port Exit Port

Active Port E InvolvedNot Yet Active

Port E Not Involved

Transaction Status

* - Peer Kill Rules Apply

Page 24: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

RACEway Interface

SHARCSHARC

SHARCSHARC

SHARCProcessorSHARC

ProcessorECC LogicECC Logic DRAMDRAM

PerformanceMetering

PerformanceMetering

DMAController

DMAController

3-WayData

Switch

3-WayData

Switch

RACEwayMapping

Logic

RACEwayMapping

Logic

OSSupport

Hardware

OSSupport

Hardware

CN ASIC

SHARCSHARC CCOMPUTE OMPUTE NNODEODE

Page 25: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

• Parallelization Approach for STAP

• Space-Time Adaptive Processing (STAP) Basics

• Mercury RACE Multicomputer

• RACE Network Simulator

• Preliminary Numerical Studies

• Conclusions

Network Communication Time Simulator for Parallel STAP(Recent Accomplishments)

Page 26: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

1. Partition STAP data cube over a 2-D process set.

2. Process the contiguous dimension.

3. Re-partition the data cube before processing the next dimension.

4. Rotate the newly distributed data to make the next dimension sequential in memory.

5. Repeat steps 1 through 4 before each processing phase.

SSUBUB--CUBE CUBE BBAR AR PPARTITIONING ARTITIONING MMETHODOLOGYETHODOLOGY

Page 27: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Pulse Compression Partitioningwith range dimension whole.Pulse Compression Partitioningwith range dimension whole.

Pulses Range

Cha

nnel

s

Cha

nnel

s

1 32 4

5 76 8

9 1110 12

Pulses

+

3 x 4 Process Set

Pulses

5

1

9

Range

Cha

nnel

s

Doppler Filtering Partitioningwith pulses dimension whole.Doppler Filtering Partitioningwith pulses dimension whole.

Pulses Range

Cha

nnel

s

9 10 11 12

5 6 7 8

1 2 3 4

Pulses Range

Cha

nnel

s

+

Cha

nnel

s

1 32 4

5 76 8

9 1110 12

Range

3 x 4 Process Set

STAPSTAP DDATA ATA CCUBE UBE PPARTITIONING ARTITIONING EEXAMPLESXAMPLES

Page 28: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Pulses

5

1

9

Range

Cha

nnel

s• Re-Partitioning involves exchanging data with the next whole dimension.

Cha

nnel

s

1 32 4

5 76 8

9 1110 12

Pulses

3 x 4 Process Set

Range Dimension is Contiguous

Cha

nnel

s

1 32 4

5 76 8

9 1110 12

Range

3 x 4 Process Set

Pulse Dimension is Contiguous

• Interprocessor Communication is required between processors in the same row.

Pulses

Range

Cha

nnel

s

9 10 11 12

5 6 7 8

1 1 1 2 1 3 1 4

STAPSTAP DDATA ATA CCUBE UBE RREPARTITIONINGEPARTITIONING

Page 29: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Required Data TransfersRequired Data Transfers

Network Interconnection ConfigurationNetwork Interconnection Configuration

6-PortCrossbar

CN CN CN CN

12

3

45

6 78

9

1011

12

IPC

56

78

910

1112

Cha

nnel

12

34Pulses Range

Pulse Compression

1

4CN

7

10

CN

CN

CN

CN

CN

3

4

3

3

4

3

Doppler Filtering

Pulses

Cha

nnel

Range

9 10 11 12

5 6 7 8

1 2 3 4

STAPSTAP DDATA ATA CCUBE UBE RREPARTITIONINGEPARTITIONING

Data ReData Re--distribution Mappingdistribution Mapping

Page 30: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

• RACE Network Simulator

• Space-Time Adaptive Processing (STAP) Basics

• Mercury RACE Multicomputer

• Parallelization Approach for STAP

• Preliminary Numerical Studies

• Conclusions

Network Communication Time Simulator for Parallel STAP(Recent Accomplishments)

Page 31: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

1. Design and implement a network simulator that models the effect data mapping and scheduling has on the performance of a STAP algorithm.

2. Key features of the network simulator include:a. Developed and implemented in an OO paradigm.

b. Implemented using a sub-cube bar partitioning scheme.

c. Models both sub-cube bar mapping strategies and communication scheduling during both phases of data re-partitioning.

d. Completely generic.

RRESEARCH ESEARCH OOBJECTIVESBJECTIVESfor for SSIMULATORIMULATOR

Page 32: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

NetworkNetwork

ClockClock

CrossbarCrossbar Routing TableRouting Table

File OutputFile Output

Random ScanRandom Scan

Data CubeData Cube

Process SetProcess Set

1

11

1

1..*

1

1

1

Gets Data From

UML NUML NETWORK ETWORK CCLASS LASS DDIAGRAMIAGRAM

11

Page 33: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

CrossbarCrossbar

LinkLink Compute NodeCompute Node

Message QueueMessage Queue Packet StackPacket Stack

MessageMessage PacketPacket

UML CUML CROSSBAR ROSSBAR CCLASS LASS DDIAGRAMIAGRAM

0..*0..*

1 1

1

2

1

2

2,6

11

0,4

Page 34: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

DataData

MessageMessage PacketPacket

Header RouteList

Header RouteList

RouteRoute

Abstract ClassInheritance

UML DUML DATA ATA CCLASS LASS DDIAGRAMIAGRAM

11

1..*

1

Page 35: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

CrossbarCrossbar CrossbarCrossbar

CrossbarCrossbar

Compute NodeProcessor InformationOutgoing and Received Message QueuesOutgoing and Received Packet Stack

Compute NodeProcessor InformationOutgoing and Received Message QueuesOutgoing and Received Packet Stack

LinkLink

Random ScanGenerates Pseudo-Random CN Scan Ordering

Random ScanGenerates Pseudo-Random CN Scan Ordering

ClockBased on Network Clock Frequency (factor of 5)Data Transfer Rate Equates to Effective Network Bandwidth

ClockBased on Network Clock Frequency (factor of 5)Data Transfer Rate Equates to Effective Network Bandwidth

Dynamic Network ConstructionDynamic Routing Table CreationDynamic CN and CE Message Traffic GenerationSimulates Packet Traffic

Dynamic Network ConstructionDynamic Routing Table CreationDynamic CN and CE Message Traffic GenerationSimulates Packet Traffic

Network Methods

NNETWORK ETWORK CCLASS LASS DDETAILSETAILS

Page 36: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Implements Hardware Priority Arbitration • TOP-LEVEL ALGORITHM• STANDARD ALGORITHM

Query Port StatusRoutes Packets to Next LocationAllocates and Frees Internal Port Connections and Connected Link ObjectsTransmits Packet Data

Implements Hardware Priority Arbitration • TOP-LEVEL ALGORITHM• STANDARD ALGORITHM

Query Port StatusRoutes Packets to Next LocationAllocates and Frees Internal Port Connections and Connected Link ObjectsTransmits Packet Data

Crossbar Methods

LinkConnects Crossbar Objects Link Status: Occupied or Free

LinkConnects Crossbar Objects Link Status: Occupied or Free

CrossbarTwo Parent Port ConnectionsFour Child Port ConnectionsInternal Switch ConnectionsFour CN Connections for TerminalCrossbars.

CrossbarTwo Parent Port ConnectionsFour Child Port ConnectionsInternal Switch ConnectionsFour CN Connections for TerminalCrossbars.

CCROSSBAR ROSSBAR CCLASS LASS DDETAILSETAILS

Page 37: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Compute Node Methods:Manages Outgoing and Received MessageQueuesManages Outgoing and Received Packet StackExplodes the Top Outgoing Message into Packets of Size 2048 or LessHandles DMA Chaining of PacketsEstablishes Path Through Network and Transmits Packet Data

Compute Node Methods:Manages Outgoing and Received MessageQueuesManages Outgoing and Received Packet StackExplodes the Top Outgoing Message into Packets of Size 2048 or LessHandles DMA Chaining of PacketsEstablishes Path Through Network and Transmits Packet Data

Outgoing Message QueueOutgoing Message Queue

Message 1

Message 2

Message 3

::

Packet StackPacket StackEXPLODE

Compute NodeProcessor InformationOutgoing and Received Message QueuesOutgoing and Received Packet Stack

• PACKETS ARE SELF-ROUTING

Compute NodeProcessor InformationOutgoing and Received Message QueuesOutgoing and Received Packet Stack

• PACKETS ARE SELF-ROUTING

::

Packet 2Packet 3Packet 4

Packet 1

CCOMPUTE OMPUTE NNODE ODE CCLASS LASS DDETAILSETAILS

Page 38: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

SSIMULATOR IMULATOR UMLUMLSSEQUENCE EQUENCE DDIAGRAMIAGRAM

NetworkNetwork CrossbarCrossbarData CubeData Cube Process SetProcess Set CNCN<<actor>>

User<<actor>>

User ClockClock

Pass 1

Pass 2

Increment Simulation

Clock

Build Messages

R:200,P:22,C:16

CEs:48

X:6, Y:8

Routing:FCN Traffic,

Phase 1 DMA:Y

Connection/Data

Transfer

Clean Up

Message Matrices

X, Y,MappingMatrices

SimulationTime = 2 msSimulation

Time = 2 ms

Messages Time* iterative process

Page 39: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

CCOMPUTEOMPUTE NNODEODEUML SUML STATECHARTTATECHART

Simulation PASS 1Simulation PASS 1Compute Node Subsystem

CurrentPacket

CurrentPacket

PacketStackStatus

PacketStackStatus

MessageQueueStatus

MessageQueueStatus

ExplodeTop

Message

ExplodeTop

Message

PopTop

Packet

PopTop

Packet

Simulation SubsystemSimulation Subsystem

Simulate Pass 1

Simulate Pass 1

GenerateErrorCode

GenerateErrorCode

No Packet EmptyEmpty - Done

Not Empty Not Empty

Success

ErrorError

SuccessPacketFound

Page 40: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

CCOMPUTEOMPUTE NNODEODEUML SUML STATECHARTTATECHART

Simulation PASS 2Simulation PASS 2Compute Node Subsystem

CurrentPacket

CurrentPacket

Simulation SubsystemSimulation Subsystem

Simulate Pass 2

Simulate Pass 2

PacketFound

No Packet

Page 41: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

PPACKETACKET UML SUML STATECHARTTATECHARTSimulation Simulation Pass 1Pass 1 and and Pass 2Pass 2

Simulation Pass Subsystem

Start UpStart Up

Waitingfor Kill

Waitingfor Kill

CompletedCompletedSuspendedSuspended

BlockedBlocked ActiveActive

ReadyReady

Pass 1

Pass 2

Page 42: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

• Preliminary Numerical Studies

• Space-Time Adaptive Processing (STAP) Basics

• Mercury RACE Multicomputer

• Parallelization Approach for STAP

• RACE Network Simulator

• Conclusions

Network Communication Time Simulator for Parallel STAP(Recent Accomplishments)

Page 43: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 1Communication Phase 1

Process Set - Phase 1 (CN:8, R:800, P:32, C:22, Routing:E)

0

10

20

30

40

50

60

7 8 9 10 11

Time (ms)

Coun

t CN 8 (6x4)CN 8 (4x6)

Process Set - Phase 1 (CN:8, R:800, P:32, C:22, Routing:E)

0

10

20

30

40

50

60

7 8 9 10 11

Time (ms)

Coun

t CN 8 (6x4)CN 8 (4x6)

Page 44: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 2Communication Phase 2

Process Set - Phase 2 (CN:8, R:800, P:32, C:22, Routing:E)

0123456789

28 30 32 34 36 38 40

Time (ms)

Coun

t CN 8 (6x4)CN 8 (4x6)

Process Set - Phase 2 (CN:8, R:800, P:32, C:22, Routing:E)

0123456789

28 30 32 34 36 38 40

Time (ms)

Coun

t CN 8 (6x4)CN 8 (4x6)

Page 45: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 1Communication Phase 1

Process Set - Phase 1(CN:16, R:200, P:22, C:16, Routing:F)

02468

101214161820

0.7 0.8 0.9 1 1.1 1.2 1.3

Time (ms)

Coun

t CN 16 (12x4)CN 16 (8x6)CN 16 (4x12)

Process Set - Phase 1(CN:16, R:200, P:22, C:16, Routing:F)

02468

101214161820

0.7 0.8 0.9 1 1.1 1.2 1.3

Time (ms)

Coun

t CN 16 (12x4)CN 16 (8x6)CN 16 (4x12)

Page 46: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 2Communication Phase 2

Process Set - Phase 2(CN:16, R:200, P:22, C:16, Routing:F)

0

2

4

6

8

10

12

14

2.5 3 3.5 4 4.5 5 5.5

Time (ms)

Coun

t CN 16 (12x4)CN 16 (8x6)CN 16 (4x12)

Process Set - Phase 2(CN:16, R:200, P:22, C:16, Routing:F)

0

2

4

6

8

10

12

14

2.5 3 3.5 4 4.5 5 5.5

Time (ms)

Coun

t CN 16 (12x4)CN 16 (8x6)CN 16 (4x12)

Page 47: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 1Communication Phase 1

Process Set - Phase 1 (CN:12, R:200, P:22, C:16, Routing:F)

05

101520253035404550

0.5 1 1.5 2

Time (ms)

Coun

t

CN 12 (12x3)CN 12 (9x4)CN 12 (6x6)CN 12 (4x9)

Page 48: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 2Communication Phase 2

Process Set - Phase 2 (CN:12, R:200, P:22, C:16, Routing:F)

0123456789

10

3 3.5 4 4.5 5 5.5 6

Time (ms)

Coun

t

CN 12 (12x3)CN 12 (9x4)CN 12 (6x6)CN 12 (4x9)

Process Set - Phase 2 (CN:12, R:200, P:22, C:16, Routing:F)

0123456789

10

3 3.5 4 4.5 5 5.5 6

Time (ms)

Coun

t

CN 12 (12x3)CN 12 (9x4)CN 12 (6x6)CN 12 (4x9)

Page 49: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 1Communication Phase 1

Process Set - Phase 1 (CN:12, R:200, P:22, C:16, Routing:F)

05

101520253035404550

0 0.5 1 1.5 2

Time (ms)

Coun

t CN 12 (3x12)CN 12 (12x3)CN 12 (4x9)

Process Set - Phase 1 (CN:12, R:200, P:22, C:16, Routing:F)

05

101520253035404550

0 0.5 1 1.5 2

Time (ms)

Coun

t CN 12 (3x12)CN 12 (12x3)CN 12 (4x9)

Page 50: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 2Communication Phase 2

Process Set - Phase 2 (CN:12, R:200, P:22, C:16, Routing:F)

0

2

4

6

8

10

12

14

2.5 3.5 4.5 5.5 6.5

Time (ms)

Coun

t CN 12 (3x12)CN 12 (12x3)CN 12 (4x9)

Process Set - Phase 2 (CN:12, R:200, P:22, C:16, Routing:F)

0

2

4

6

8

10

12

14

2.5 3.5 4.5 5.5 6.5

Time (ms)

Coun

t CN 12 (3x12)CN 12 (12x3)CN 12 (4x9)

Page 51: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 1Communication Phase 1

Process Set - Phase 1(CN, R:200, P:22, C:16, Routing:F)

0

10

20

30

40

50

60

0 0.2 0.4 0.6 0.8 1

Time (ms)

Coun

t CN 12 (3x12)CN 16 (4x12)

Process Set - Phase 1(CN, R:200, P:22, C:16, Routing:F)

0

10

20

30

40

50

60

0 0.2 0.4 0.6 0.8 1

Time (ms)

Coun

t CN 12 (3x12)CN 16 (4x12)

Page 52: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 2Communication Phase 2

Process Set - Phase 2(CN, R:200, P:22, C:16, Routing:F)

0

2

4

6

8

10

12

14

2.6 2.8 3 3.2 3.4 3.6 3.8

Time (ms)

Coun

t CN 12 (3x12)CN 16 (4x12)

Process Set - Phase 2(CN, R:200, P:22, C:16, Routing:F)

0

2

4

6

8

10

12

14

2.6 2.8 3 3.2 3.4 3.6 3.8

Time (ms)

Coun

t CN 12 (3x12)CN 16 (4x12)

Page 53: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 1Communication Phase 1

Message Traffic - Phase 1 (CN:16, X:12, Y:4, R:400, P:22, C:16, Routing:EF)

0123456789

2 2.1 2.2 2.3 2.4 2.5

Time (ms)

Coun

t CN TrafficCE Traffic

Message Traffic - Phase 1 (CN:16, X:12, Y:4, R:400, P:22, C:16, Routing:EF)

0123456789

2 2.1 2.2 2.3 2.4 2.5

Time (ms)

Coun

t CN TrafficCE Traffic

Page 54: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 2Communication Phase 2

Message Traffic - Phase 2 (CN:16, X:12, Y:4, R:400, P:22, C:16, Routing:EF)

012345678

10 15 20 25

Time (ms)

Coun

t CN TrafficCE Traffic

Message Traffic - Phase 2 (CN:16, X:12, Y:4, R:400, P:22, C:16, Routing:EF)

012345678

10 15 20 25

Time (ms)

Coun

t CN TrafficCE Traffic

Page 55: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 1Communication Phase 1

Message Traffic - Phase 1 (CN:16, X:6, Y:8, R:400, P:22, C:16, Routing:EF)

0

10

20

30

40

50

60

0.85 0.851 0.852 0.853 0.854

Time (ms)

Coun

t CN TrafficCE Traffic

Message Traffic - Phase 1 (CN:16, X:6, Y:8, R:400, P:22, C:16, Routing:EF)

0

10

20

30

40

50

60

0.85 0.851 0.852 0.853 0.854

Time (ms)

Coun

t CN TrafficCE Traffic

Page 56: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 2Communication Phase 2

Message Traffic - Phase 2 (CN:16, X:6, Y:8, R:400, P:22, C:16, Routing:EF)

012345678

10 15 20 25

Time (ms)

Coun

t CN TrafficCE Traffic

Message Traffic - Phase 2 (CN:16, X:6, Y:8, R:400, P:22, C:16, Routing:EF)

012345678

10 15 20 25

Time (ms)

Coun

t CN TrafficCE Traffic

Page 57: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 1Communication Phase 1

Message Traffic - Phase 1 (CN:12, X:6, Y:6, R:800, P:32, C:22, Routing:E)

0

10

20

30

40

50

60

4.95 4.9505 4.951 4.9515 4.952

Time (ms)

Coun

t CN TrafficCE Traffic

Message Traffic - Phase 1 (CN:12, X:6, Y:6, R:800, P:32, C:22, Routing:E)

0

10

20

30

40

50

60

4.95 4.9505 4.951 4.9515 4.952

Time (ms)

Coun

t CN TrafficCE Traffic

Page 58: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 2Communication Phase 2

Message Traffic - Phase 2 (CN:12, X:6, Y:6, R:800, P:32, C:22, Routing:E)

0123456789

10

43 45 47 49 51

Time (ms)

Coun

t CN TrafficCE Traffic

Message Traffic - Phase 2 (CN:12, X:6, Y:6, R:800, P:32, C:22, Routing:E)

0123456789

10

43 45 47 49 51

Time (ms)

Coun

t CN TrafficCE Traffic

Page 59: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 1Communication Phase 1

Message Traffic - Phase 1 (CN:12, X:9, Y:4, R:800, P:32, C:22, Routing:E)

0

1

2

3

4

5

6

7

17 18 19 20 21 22 23

Time (ms)

Coun

t CN TrafficCE Traffic

Message Traffic - Phase 1 (CN:12, X:9, Y:4, R:800, P:32, C:22, Routing:E)

0

1

2

3

4

5

6

7

17 18 19 20 21 22 23

Time (ms)

Coun

t CN TrafficCE Traffic

Page 60: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 2Communication Phase 2

Message Traffic - Phase 2 (CN:12, X:9, Y:4, R:800, P:32, C:22, Routing:E)

0

1

2

3

4

5

6

7

41 43 45 47 49

Time (ms)

Coun

t CN TrafficCE Traffic

Message Traffic - Phase 2 (CN:12, X:9, Y:4, R:800, P:32, C:22, Routing:E)

0

1

2

3

4

5

6

7

41 43 45 47 49

Time (ms)

Coun

t CN TrafficCE Traffic

Page 61: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 1Communication Phase 1

DMA Chaining - Phase 1 (CE:24, X:8, Y:3, R:200, P:22, C:16, Routing:F)

0123456789

10

1.7 1.8 1.9 2 2.1

Time (ms)

Coun

t ChainingNo Chaining

Page 62: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 2Communication Phase 2

DMA Chaining - Phase 2 (CE:24, X:8, Y:3, R:200, P:22, C:16, Routing:F)

0123456789

2.5 2.7 2.9 3.1 3.3 3.5

Time (ms)

Coun

t ChainingNo Chaining

DMA Chaining - Phase 2 (CE:24, X:8, Y:3, R:200, P:22, C:16, Routing:F)

0123456789

2.5 2.7 2.9 3.1 3.3 3.5

Time (ms)

Coun

t ChainingNo Chaining

Page 63: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 1Communication Phase 1

DMA Chaining - Phase 1 (CE:24, X:8, Y:3, R:400, P:22, C:16, Routing:F)

0123456789

3.4 3.5 3.6 3.7 3.8 3.9 4 4.1

Time (ms)

Coun

t ChainingNo Chaining

DMA Chaining - Phase 1 (CE:24, X:8, Y:3, R:400, P:22, C:16, Routing:F)

0123456789

3.4 3.5 3.6 3.7 3.8 3.9 4 4.1

Time (ms)

Coun

t ChainingNo Chaining

Page 64: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 2Communication Phase 2

DMA Chaining - Phase 2 (CE:24, X:8, Y:3, R:400, P:22, C:16, Routing:F)

0123456789

5.2 5.7 6.2 6.7

Time (ms)

Coun

t ChainingNo Chaining

DMA Chaining - Phase 2 (CE:24, X:8, Y:3, R:400, P:22, C:16, Routing:F)

0123456789

5.2 5.7 6.2 6.7

Time (ms)

Coun

t ChainingNo Chaining

Page 65: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 1Communication Phase 1

DMA Chaining - Phase 1 (CE:24, X:8, Y:3, R:800, P:32, C:22, Routing:F)

0123456789

14 16 18 20 22

Time (ms)

Coun

t ChainingNo Chaining

DMA Chaining - Phase 1 (CE:24, X:8, Y:3, R:800, P:32, C:22, Routing:F)

0123456789

14 16 18 20 22

Time (ms)

Coun

t ChainingNo Chaining

Page 66: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 2Communication Phase 2

DMA Chaining - Phase 2 (CE:24, X:8, Y:3, R:800, P:32, C:22, Routing:F)

012345678

21 22 23 24 25 26 27

Time (ms)

Coun

t ChainingNo Chaining

DMA Chaining - Phase 2 (CE:24, X:8, Y:3, R:800, P:32, C:22, Routing:F)

012345678

21 22 23 24 25 26 27

Time (ms)

Coun

t ChainingNo Chaining

Page 67: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

AADAPTIVEDAPTIVE RROUTINGOUTINGPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 1Communication Phase 1

Adaptive Routing - Phase 1 (CN:16, X:8, Y:6, R:800, P:32, C:22)

0123456789

7 8 9 10 11 12 13

Time (ms)

Coun

t Adaptive EAdaptive FAdaptive E/F

Adaptive Routing - Phase 1 (CN:16, X:8, Y:6, R:800, P:32, C:22)

0123456789

7 8 9 10 11 12 13

Time (ms)

Coun

t Adaptive EAdaptive FAdaptive E/F

Page 68: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

AADAPTIVEDAPTIVE RROUTINGOUTINGPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 2Communication Phase 2

Adaptive Routing - Phase 2 ( CN:16, X:8, Y:6, R:800, P:32, C:22)

0123456789

26 31 36 41 46Time (ms)

Cou

nt

Adaptive EAdaptive FAdaptive E/F

Adaptive Routing - Phase 2 ( CN:16, X:8, Y:6, R:800, P:32, C:22)

0123456789

26 31 36 41 46Time (ms)

Cou

nt

Adaptive EAdaptive FAdaptive E/F

Page 69: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

AADAPTIVEDAPTIVE RROUTINGOUTINGPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 1Communication Phase 1

Adaptive Routing - Phase 1 (CN:16, X:8, Y:6, R:400, P:22, C:16)

0

2

4

6

8

10

12

1.5 2 2.5 3 3.5

Time (ms)

Coun

t Adaptive EAdaptive FAdaptive E/F

Adaptive Routing - Phase 1 (CN:16, X:8, Y:6, R:400, P:22, C:16)

0

2

4

6

8

10

12

1.5 2 2.5 3 3.5

Time (ms)

Coun

t Adaptive EAdaptive FAdaptive E/F

Page 70: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

AADAPTIVEDAPTIVE RROUTINGOUTINGPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 2Communication Phase 2

Adaptive Routing - Phase 2 (CN:16, X:8, Y:6, R:400, P:22, C:16)

0123456789

10

7 8 9 10 11 12 13

Time (ms)

Coun

t Adaptive EAdaptive FAdaptive E/F

Adaptive Routing - Phase 2 (CN:16, X:8, Y:6, R:400, P:22, C:16)

0123456789

10

7 8 9 10 11 12 13

Time (ms)

Coun

t Adaptive EAdaptive FAdaptive E/F

Page 71: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

• Space-Time Adaptive Processing (STAP) Basics

• Mercury RACE Multicomputer

• Parallelization Approach for STAP

• RACE Network Simulator

• Preliminary Numerical Studies

• Conclusions

Network Communication Time Simulator for Parallel STAP(Recent Accomplishments)

Page 72: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

1. Designed and implemented a platform independent simulator.

4. Communication pattern implemented for STAP but may be used for other applications with phased communication pattern.

2. Simulator demonstrates that the Process Set, the CN or CE Message Traffic, the DMA chaining, the adaptive routing, and the scheduling of the messages affects performance.

3. Allows users to experiment with possible current and future configurations.

CCONCLUSIONSONCLUSIONS

Page 73: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Recent Accomplishments

• Network Communication Time Simulator for Parallel STAP

• FPGA Inner-Product Co-Processor Designs for STAP Weight Solver

• Power Prediction Simulator for the Xilinx4000-Series FPGA

Page 74: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers

(Recent Accomplishments)

• Overview of STAP Weight Calculation

• Two Candidate STAP Weight Solvers: QR Versus CG

• Two FPGA Inner-Product Circuit Designs

• Numerical Accuracy Studies

Page 75: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

References for STAP Weight Solverand FPGA Design

J. Ward, “Space-Time Adaptive Processing for Airborne Radar,” Technical Report 1015, MIT Lincoln Laboratory, Lexington, MA, 1994.

K. C. Cain, J. A. Torres, and R. T. Williams, (R. A. Games, Project Leader), “RT_STAP: Real-Time Space-Time Adaptive Processing Benchmark,” MITRE Technical Report MTR 96B0000021, Feb. 1997.

MCARM Data Files, Rome Laboratory, (http://sunrise.oc.rl.af.mil).

D. G. Luenberger, Linear and Nonlinear Programming, Addison-Wesley, Reading, MA, 1984.

WildOne Hardware Reference Manual, Number 11927-0000, Revision 0.1, Annapolis Micro Systems, Inc., MD, 1997.

Page 76: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Doppler Filter

Weight Computation

Steering Vector

Input Data

Pulse Compress Data Cube Data Cube

Weight Application

ThresholdDetection

Target Decision

Typical STAP Processing Flow

pulses

range

Doppler

range8%

91.5%

0.5%

CovarianceMatrix

Page 77: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

STAP CPI Data Cube

1 M

L

1

N

1

PRI (32-128)

Channels(24)

Range(625-2500)

Page 78: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Principle Behind STAP

• Range gates are divided into non-overlapping blocks having a fixed number of range gates

• These blocks are referred to as the Range Segments

1 M

L

1

N

1PRI

Channels Lr

Number of Range Segments = L/Lr

Range Segment

Page 79: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

• Works with data on all M Doppler bins and all Nchannels

• Computes and applies a separate adaptive weight to every element and Doppler bin

• The weight vector is of size MN for each range gate.

Space-Time Adaptive Processing

• Fully Adaptive STAP

Page 80: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Space-Time Adaptive Processing

• Characteristics of Fully Adaptive STAP• Requires solving a large system of linear equations• Size of the linear system grows with

• Array size (the number of channels)• Number of pulses

Example: for each instance, if M = 32 and N = 24 then, complexity ≈ (MN)3 = 452,984,830

• Implementation of fully adaptive STAP is impractical• Complexity of each instance is O((MN)3)• Product MN being several hundreds puts it beyond

current capabilities in real-time computing• Instances of the problem must be solved for each

range segment

Page 81: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

• Problem is broken down into a number of smaller,more manageable adaptive problems

• STAP applied to these lower dimension problems

Space-Time Adaptive Processing

• Partially Adaptive STAP

Page 82: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

• The partially adaptive STAP works with data on

Example: for each instance, if M = 32, N = 24 and K = 3, thencomplexity ≈ M(KN)3 = 11,943,936

Space-Time Adaptive Processing

• All N Channels

• And a few adjacent Doppler bins, denoted as K

• Complexity is reduced to O(M(KN)3), for K<< M

Page 83: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Space-Time Adaptive Processing

• Effective partially adaptive STAP technique

• The architecture consists of

• Doppler processing across all pulse repetition intervals

• Adaptive filtering across• all channels and• K adjacent Doppler bins

Kth- Order Doppler Factored STAP

Page 84: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

1 31 ˆ:),(

×=× NN

rkx

r

∑+−=

=bL

rkxrkx

bkR

rLbr

H

rL 1)1(

),(),(1

),(ψ

Kth-Order Doppler Factored STAP

bth Ran

ge

Segm

ent

(with

L rce

lls)N

Cha

nnel

s

Doppler

k (k - 1)(k + 1)

Data matrix needed for calculating covariance matrix for kth Doppler Bin

and bth Range Segment using Kth-OrderDoppler Factored STAP with K = 3

Page 85: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Matrix-Based Derivation of

rr LNLN

bk

3 ˆ:),(

×=×

X

),(),(1

),(),(1),(1)1(

bkbk

bLrkxrkxbk

H

r

Lbr

H

r

L

LR

r

XX

ψ

=

= ∑+−=

sbkwbk =),(),(ψ

The Weight Equation:

),( bkψ

Page 86: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers

(Recent Accomplishments)

• Overview of STAP Weight Calculation

• Two Candidate STAP Weight Solvers: QR Versus CG

• Two FPGA Inner-Product Circuit Designs

• Numerical Accuracy Studies

Page 87: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Methods for STAP Weight Calculation

• Two approaches to solve the weight equation

• QR-decomposition method (direct)

• Conjugate Gradient method (iterative)

Page 88: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

STAP Weight Calculation

sLbkwRR

RR

sbkwRRL

bkwRQQRL

QRbk

sbkwbkbkL

sbkwbk

rT

TT

T

r

TT

r

T

H

r

=

=

==

=

=

=

),(

]0[ that Note

),(1),(1

),( :onDecomposti QR Take

),(),(),(1

),(),(

*11

1

***

X

XX

ψ

onsubstituti backward using ),(for Solve

),(

neliminatio forward using for Solve

),(Let

*1

1

*1

bkw

pbkwR

p

sLpR

pbkwR

rT

=

=

=

sw =ψ :Equation Weight thesolve toMethodion decomposit-QR Using

Page 89: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Iteration

STAP Weight Calculation

Initialization

ikTi

iTi

ii

ii

ii

Ti

iTi

ii

ddd

dggd

swg

ddd

dgww

+−=

−=

−=

+++

++

+

)(1

11

11

1

ψψ

ψ

ψ

sw =ψ :Equation Weight thesolve toMethodGradient Conjugate Using

00000 ,set , Choose dgwsdw −=−= ψ

Page 90: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Preliminary Numerical Studies

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1

Tolerance

Rel

ativ

e Er

ror

Lr = 25010-1

10-2

10-3

10-4

10-5

10-6

10-7

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1

Rel

ativ

e Er

ror

100

10-1

10-2

10-3

10-4

10-5

10-6

10-7

10-8

10-9

Tolerance

Lr = 125

qr

cgqr

w

ww −=Error Relative

Page 91: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Preliminary Numerical Studies

Lr = 125

Flop

Cou

nt

108

109

1010

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1

Tolerance

CGQR

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1

Lr = 250

Tolerance

1010

109

108

Flop

Cou

nt

Tolerance

CGQR

Page 92: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers

(Recent Accomplishments)

• Overview of STAP Weight Calculation

• Two Candidate STAP Weight Solvers: QR Versus CG

• Two FPGA Inner-Product Circuit Designs

• Numerical Accuracy Studies

Page 93: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Motivation for FPGA Inner-Product Co-Processors

• Inner-products are a core calculation for both CG- and QR-based STAP weight solvers

• Computations are highly numeric and regular

• Opportunities to exploit reduced precision arithmetic

• Control flow of CG and QR best implemented on GPP or DSP - Inner product calculations can be offloaded to available FPGA resources

Page 94: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

PCI BUSPCI BUS

Dual Port MemController 0

Dual Port MemController 0

Dual Port MemController 1

Dual Port MemController 1

Processing Element

1

Processing Element

1

Processing Element

0

Processing Element

0Fifo1Fifo1Fifo0Fifo0

SIMDConnector

External I/OConnector

Overview of WildOne Architecture

Page 95: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

+

Output Register

a b

Sign+16 bitmantissa

Normalizing unit

1’s comp/register

a bsign of a

a b

b

BUFFER

X

BUFFER

FPGA

BOARD

INTERCONNECTION

BUS

HOSTPROCESSOR

• Multiply-Accumulate Pipe• Reads two operands

per cycle • Performs two operations

per cycle• Performs exponent

normalization prior to accumulation

• 2 N-vectors reduced to a constant number of partial sums

FPGA Inner Product Co-Processor:Design 1

Page 96: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

• Multiply-Add Reduction Pipe• Reads four operands

per cycle • Performs three operations

per cycle• No normalization required• 2 N-vectors reduced to N/2 partial sums

• Basic Tradeoff: First design has lower throughput, but can perform more work

X X

1’s comp/register

Sign bSign a

+

Sign+16 bit mantissa

INTERCONNECTION

BUS

HOSTPROCESSOR

BUFFER

BUFFER

FPGA

BOARD

2 ff

Data forFirst

Multiplier

Data forSecond

Multiplier

Unitclocked

here

FPGA Inner Product Co-Processor:Design 2

Page 97: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Block Floating Point Unit

Inner-ProductCo-Processor

1

1

UML Description of Basic Co-Processor Design

Page 98: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Block Floating Point Unit

Multiplying Unit Complementor

Normalizing UnitAccumulator

1 1

1

11

1

1 1

UML Description of Block Floating Point Unit

Page 99: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Multiplying UnitRegister

4-Bit Adder

Multiply Stage1

1

132

4

8

UML Description of Multiplying Unit

Page 100: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Accumulator

4-Bit Adder Register

3-Bit Adder

1

5

1

1

124

UML Description of Accumulator Unit

Page 101: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Normalizing Unit

SubtractorRegister

MagnitudeComparator

*

1 1 1

1

1

UML Description of Normalizing Unit

Page 102: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Host ProgramWild-One

Open board

Program the board with the image

Interrupt for Exponent

Exponent written to FIFO

Interrupt for Mantissa Vectors

Mantissa Vectors written to the FIFO

Processing Done ans in FIFO/Memory

Read back the answer

Close the board

TI

ME

MESSAGES

Sequence Diagram for Interactions between Host and FPGA Board

Page 103: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Get Exponent Wait forExponent Int

Int Req

Int AckRead Exponent

Get Mantissa

Read Mantissa Write Mantissa

Wait forMantissa Int

Write Exponent

Int Req

Int Ack

Multiply-and-add/accumulate

Write Back

Wait for Answer Int

Read Back Answer

Ack = 1

Ack = 0

Req = 1

Req = 0

Req = 1Ack = 1

Ack = 0 Req = 0

Req = 1

Done = 1

Req = 0Processing Sub-System

FPGA

Board

Host

System

Statechart Diagram for Interactions between Host and FPGA Board

Page 104: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Compare Count [Count = Threshold]

Read Two Operands

Multiply

Accumulate

[Count ≠ Threshold]

Write to MemoryFeedback SumIncrement Count

Set Done flag

Circuit Activity Diagram:Design 1

Page 105: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Compare Count [Count = Threshold]

Read Two Operands

Multiply

Add

[Count ≠ Threshold]

Read Next Two Operands

Multiply

Write to Memory

Increment Count Set Done flag

Circuit Activity Diagram:Design 2

Page 106: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers

(Recent Accomplishments)

• Overview of STAP Weight Calculation

• Two Candidate STAP Weight Solvers: QR Versus CG

• Two FPGA Inner-Product Circuit Designs

• Numerical Accuracy Studies

Page 107: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Setup for Numerical Accuracy Studies

• Randomly generated, 512-element test vectors processed by both designs

• Range of vectors’ data values controlled to study effect dynamic range has on accuracy

• Output of each circuit compared to corresponding results calculated on host (using IEEE 32-bit floating point arithmetic)

• Accuracy metric is ratio of obtained values to corresponding IEEE floating point value

Page 108: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Zero Order of Magnitude Experiment

Data Histogram

05

101520253035404550

0.00

1

0.06

3

0.12

6

0.18

8

0.25

1

0.31

3

0.37

6

0.43

8

0.50

0

0.56

3

0.62

5

0.68

8

0.75

0

0.81

3

0.87

5

0.93

8

1.00

0

Freq

uenc

y

Exponent Histogram

0

100

200

300

400

500

600

114

116

118

120

122

124

126

128

130

132

134

Freq

uenc

y

Accuracy HistogramDesign 2

020406080

100120140160180

0.99

84

0.99

85

0.99

86

0.99

87

0.99

88

0.99

89

0.99

90

0.99

91

0.99

92

0.99

93

0.99

94

0.99

95

0.99

96

0.99

97

0.99

98

0.99

99

1.00

00

Freq

uenc

y

Accuracy HistogramDesign 1

0

1

2

3

4

5

6

7

8

0.999855 0.99986375 0.9998725 0.99988125 0.99989

Freq

uenc

y

Page 109: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Two Orders of Magnitude Experiment

Accuracy HistogramDesign 1

0

1

2

3

4

5

6

7

0.999893 0.9999015 0.99991 0.9999185 0.999927

Freq

uenc

y

Data Histogram

05

101520253035404550

0 7 14 21 27 34 41 48 55 62 69 76 82 89 96 103

110

Freq

uenc

y

Exponent Histogram

050

100150200250300350400450500

119

121

123

125

127

129

131

133

135

137

139

141

143

145

Freq

uenc

y

Accuracy HistogramDesign 2

0

50

100

150

200

250

0.99

399

0.99

436

0.99

474

0.99

511

0.99

549

0.99

586

0.99

624

0.99

661

0.99

699

0.99

736

0.99

774

0.99

811

0.99

849

0.99

886

0.99

924

0.99

961

0.99

999

Freq

uenc

y

Page 110: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Four Orders of Magnitude Experiment

Accuracy HistogramDesign 1

0

1

2

3

4

5

6

7

8

9

0.999889 0.99989925 0.9999095 0.99991975 0.99993

Freq

uenc

y

Data Value Histogram

05

1015

2025

3035

4045

50

0

687

1373

2060

2747

3434

4120

4807

5494

6180

6867

7554

8241

8927

9614

1030

1

1098

5

Freq

uenc

y

Exponent Histogram

0

50

100

150

200

250

300

350

400

450

119

121

123

125

127

129

131

133

135

137

139

141

143

145

Freq

uenc

y

Accuracy HistogramDesign 2

0

50

100

150

200

250

300

0.46

7

0.50

0

0.53

4

0.56

7

0.60

0

0.63

4

0.66

7

0.70

0

0.73

3

0.76

7

0.80

0

0.83

3

0.86

7

0.90

0

0.93

3

0.96

7

1.00

0

Freq

uenc

y

Page 111: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Five Orders of Magnitude Experiment

Accuracy HistogramDesign 1

0

1

2

3

4

5

6

7

8

0.999912 0.99991875 0.9999255 0.99993225 0.999998

Freq

uenc

y

Data Value Histogram

05

101520253035404550

0

6867

1373

4

2060

2

2746

9

3433

6

4120

3

4807

0

5493

7

6180

5

6867

2

7553

9

8240

6

8927

3

9614

1

1030

08

Freq

uenc

y

Exponent Histogram

0

100

200

300

400

500

600

700

800

119 121 123 125 127 129 131 133 135 137 139 141 143

Freq

uenc

y

Accuracy HistogramDesign 2

0

50

100

150

200

250

300

0.00

000

0.06

250

0.12

500

0.18

750

0.25

000

0.31

249

0.37

499

0.43

749

0.49

999

0.56

249

0.62

499

0.68

749

0.74

999

0.81

249

0.87

499

0.93

748

0.99

998

Freq

uenc

y

Page 112: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

“Outlyer” Experiment

Accuracy HistogramDesign 2

0

5

10

15

20

25

30

35

40

45

50

0.00

0.06

0.12

0.17

0.23

0.29

0.35

0.40

0.46

0.52

0.58

0.64

0.69

0.75

0.81

0.87

0.92

Freq

uenc

y

Exponent Histogram

0

100

200

300

400

500

600

114

116

118

120

122

124

126

128

130

132

134

136

138

Freq

uenc

y

Data Value Histogram

0

200

400

600

800

1000

1200

0.00

09

62.5

008

125.

0007

187.

5007

250.

0006

312.

5006

375.

0005

437.

5005

500.

0004

562.

5004

625.

0003

687.

5003

750.

0002

812.

5002

875.

0001

937.

5001

1000

.000

0

Freq

uenc

y

Accuracy HistogramDesign 1

0

2

4

6

8

10

12

0.593067 0.6398925 0.686718 0.7335435 0.78369

Freq

uenc

y

outlyeroutlyer

Page 113: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Conclusions

• CG weight solver provides tradeoff between accuracy and required FLOPs(compared to QR weight solver)

• Tradeoff between two FPGA designs: Design 1 (Mult & Accum) has lower peak throughput, but can perform more total work than Design 2

• Block floating point provides acceptable accuracy for uniformly distributed data over reasonable dynamic ranges

• Block floating point accuracy breaks down when there are a few large outlyers in the data set

Page 114: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Recent Accomplishments

• Network Communication Time Simulator for Parallel STAP

• FPGA Inner-Product Co-Processor Designs for STAP Weight Solver

• Power Prediction Simulator for the Xilinx4000-Series FPGA

Page 115: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Power Prediction Simulator for the Xilinx 4000-Series FPGA

(Recent Accomplishments)

• CMOS Power Consumption and Past Research

• Design and Implementation of the Power Prediction Simulator

• Preliminary Experimental Results

• Conclusions and Current Work, Demo

Page 116: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

References for FPGA Power Prediction

K. P. Parker and E. J. McCluskey, “Probabilistic Treatment of General Combinatorial Networks,” IEEE Trans. Computers, Vol. C-24, June 1975, pp. 668-670.

Kaushik Roy and Sharat Prasad, “Circuit Activity Based LogicSynthesis for Low Power Reliable Operations,” IEEE Trans. VLSI Systems, Vol. 1, No. 4, Dec.1993, pp.

Kaushik Roy, “Power Dissipation Driven FPGA Place and Route under Timing Constraints,” School of Electrical and Computer Engineering, Purdue University.

“XC4000 Series Field Programmable Gate Arrays,” Xilinx, Inc., September 18, 1996.

Page 117: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Leakage CurrentDynamic Capacitance Charging Current

Most important for CMOSDependant on clock frequency

Power Dissipation in CMOS

Transient Current

Dependant on signal activityDependant on signal activity

Page 118: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Power Equations

Equivalent model of a transistor’s gate...

( )

−=

−RC

teVtvc 1

( ) RCt

VetvR

−=

( )ReVtp

RCt

R

22

=

∫∫−

−−

==ττ

ττ0

22

0

22 2

21 dte

RCCVdt

ReVp RC

tRCt

avg

222

21

2CVeCVp

o

RCt

avg ττ

τ

≈−

=−

Page 119: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

( ) 50.0=clockp

( ) 88.01 =xp

( ) 29.02 =xp

( ) 69.03 =xp ( ) 27.03 =xA

( ) 0.1=clockA

( ) 10.01 =xA

( ) 17.02 =xA

p(s): the probability that signal sattains a logical value of true at any given clock cycle.

A(s): the probability that signal stransitions at any given clock cycle.

Probabilistic Modeling

Page 120: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Probabilistic Modeling

x3

x2

x1

y

y

x3

x2

x1

:)(1 tx:)(2 tx:)(3 tx

:)(21 txx:)(321 txxx

p=0.88, A=0.10

p=0.29, A=0.17

p=0.69, A=0.27

p=0.83, A=0.17

p=0.10, A=0.13

Calculation of average power:

∑∈

=gates all

2

21

ggavg ACVP

Page 121: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Probabilistic Equations

( )

( )1 where,)(1

1

===

=

∏∑

∑ ∏

=

=

ii

k

ii

k

ii

Pyp

f

ππ

( ) ( )

( ) ( ){ }

( ) ( ){ }

∑∑ ∏

∑ ∏

∑ ∏

+

−⊕+

−⊕+

−⊕

⋅=

===≠≠ ∉

==≠ ∉

= ≠

X n

kjikji kjil

llkkjjiikji

n

jiji jik

kkjjiiji

n

i ijjjiii

xzPxzPxzPxzPzzzXfXf

xzPxzPxzPzzXfXf

xzPxzPzXfXf

XPyA

K

1,1,1,,

1,1,

1

)(1)()()(),,;()(31

)(1)()(),;()(21

)(1)();()(

)()(

*

* Probabilistic Treatment of General Combinatorial Networks† Estimation of Circuit Activity Considering Signal Correlations and Simultaneous Switching

Signal probability transformations...

Signal activity transformations...†

Page 122: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Power Prediction Simulator for the Xilinx 4000-Series FPGA

(Recent Accomplishments)

• CMOS Power Consumption and Past Research

• Design and Implementation of the Power Prediction Simulator

• Preliminary Experimental Results

• Conclusions and Current Work, Demo

Page 123: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

FPGA Design

FPGA internal structure design...

CLB

IOB BUF

Page 124: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Routing Fabric Design

Example routings...

Xilinx 4000 series routing fabric is very intricate.

Xilinx synthesis tools use shortest path routing where possible.

The distance the signal travels is the metric considered in this model.

Page 125: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Signal Design

Symbolic Probability

Numeric Probability

Numeric Activity

Signal Reference

Manhattan Distance

CLBCLB

R

L

Local Signal Remote Signal

Page 126: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Iteration Example

4

4 InterconnectionLUT

LUT

LUT

LUT

LUT

LUT

Page 127: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Iteration Example

R

R

R

R

R

R

R

R

L

L

L

RRRR

RRRR

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

L

L

L

RRRR

RRRR

R

R

R

R

R

R

R

R

LUT

LUT

LUT

LUT

LUT

LUT

L

L

L

L

Page 128: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Probabilistic Feedback Example

ab

d

ec pe

pa

pb

• Feedback Circuits Require Symbolic Iteration of Probability Expressions

• Assume pa , pb , pe are known; then pd and pc are determined using iteration

pd

d = a + bc

dc

pc

c = d e

Iteration 1:

pd = pa

pc = pa pe

Iteration 2:

pd = pa + pa pb pe

pc = (pa pe + pa pb pe) pe = pa pe

Iteration 3:

pd = pa + pa pb pe

pc = pa pe

Page 129: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Power Prediction Simulator for the Xilinx 4000-Series FPGA

(Recent Accomplishments)

• CMOS Power Consumption and Past Research

• Design and Implementation of the Power Prediction Simulator

• Preliminary Experimental Results

• Conclusions and Current Work, Demo

Page 130: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Experimental Results

Probabilistic signals are correctly propagated through combinational and sequential logic.

Configurations making use of feedback converge for all test cases.

Probabilistic modeling is more than an order of magnitude faster than time-domain modeling techniques.

Page 131: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Convergence of Probabilistic Signals

Probability Convergence

020406080

100120

0 5 10

Iterations

% C

onve

rgen

ce

Adder4FIFOPipeAdderMult32

All test cases converged in the following manner:Steep Slope: Signals not involved with feedback rapidly

propagated through the FPGA.Plateau: Signals dependent on feedback converge slowly.

Page 132: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Symbolic Term Explosion

Mixing 12 signals this way...

…gives 6 signals with at most 4 terms.

Mixing 12 signals this way...

…gives 1 signal with at most 4096 terms.

Page 133: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Power Measurements

• Heat Measurements

• Developed hardware instrumentation to measure surface temperature of FPGA

• Thermistor attached to FPGA with heat conductive epoxy

• Instrumentation accurate to within 0.1 degrees F

Page 134: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Frequency Response of the FPGA

• The FPGA consumes more power as its clock frequency rises.• The simulator gives 125mW +43.6mW/MHz for this situation.

120135150165180

0 10 20 30 40 50Frequency (MHz)

Tem

pera

ture

(F)

Surface Temperature versus Frequency

Page 135: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Power Prediction Simulator for the Xilinx 4000-Series FPGA

(Recent Accomplishments)

• CMOS Power Consumption and Past Research

• Design and Implementation of the Power Prediction Simulator

• Preliminary Experimental Results

• Conclusions and Current Work, Demo

Page 136: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Conclusions and Current Work

• Designed and Implemented power prediction simulator for Xilinx 4000 series FPGAs.

• Inputs to simulator:• Place & Route bit stream (from Xilinx Tool)• Activity and Probability factors for pin signals

• Simulator calculates probabilities and activities for all internal signals

• Tool outputs power consumption of FPGA chip

• Currently calibrating/tuning simulator using both heat and DC current measurement cross-calibration methods

Page 137: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

OutlineOutline

• Program Overview and Introduction (Quad Chart)

• Program Management Status

• Recent Accomplishments

• Status of Deliverable Checklist

Page 138: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Deliverables

• Prototype VME-Based GPP/DSP/FPGA platform– 20 Slot Chassis with SPARC 5V Host– 9U VME RACE Board– 2 SHARC Daughtercards:12 SHARCs, 48MB – 2 PowerPC Daughtercards: 4 PowerPCs, 64MB– VME WILDFIRE Array Card (16 Xilinx 4028EX-3s)

Page 139: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Deliverables

• FPGA Power Prediction Simulator– Simulator Input: Probabilistic Input Data

Characteristics; FPGA configuration data file– Simulator Output: Power Prediction to within 10%

relative accuracy (expected)– Will demonstrate fidelity across different applications

and even different implementations of the same design– Will operate at interactive speeds – Completely Portable Java Implementation

Page 140: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Deliverables

• Network Simulator for Parallel STAP– Network Feature Inputs: number and types of

switching elements; interconnection scheme; number and types of processors at each network port, etc.

– Data Mapping Input: Data layout across the processors for each phase of processing

– Data Ordering Input: Order in which data items at each network port are to be transmitted

– Simulator Output: Number of network cycles required for all phases of STAP communication

– Relative accuracy of simulator 10% (expected)– Will operate at interactive speeds – Completely Portable Java Implementation

Page 141: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Deliverables

• Linear Filtering Implementation on FPGA– Investigation of different data formats and arithmetic

approaches for FPGA calculations– Demonstrate performance improvement (throughput

and/or power) over GPP/DSP implementation

• STAP Weight Equation Solver on GPP/DSP/FPGA System– Investigation of different data formats and arithmetic

approaches for FPGA calculations– Demonstrate performance improvement (throughput

and/or power) over GPP/DSP implementation

Page 142: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Deliverables

• Optimal configuration techniques for executing SAR on GPP/DSP/FPGA system– Based on optimally balancing memory and processor

utilization, selection of most appropriate data formats and arithmetic techniques, etc.

– Will utilize the FPGA power prediction simulator– Will optimally integrate most appropriate FPGA

circuit implementations and GPP/DSP algorithms– Optimization techniques based on proven

mathematical programming methods– Will demonstrate 2 to 10 times power savings over

nominal configurations of GPP/DSP systems

Page 143: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Deliverables

• Optimal configuration techniques for executing STAP on GPP/DSP/FPGA system– Techniques based on optimal data layout to minimize

latency through interconnection network, optimal combined use of processors and FPGAs for intensive weight calculation, will include desired numerical accuracy as an input parameter

– Will utilize the FPGA power prediction simulator and the network simulator for parallel STAP

– Will demonstrate 2 to 10 times power savings over nominal configurations of GPP/DSP systems

– Optimization techniques based on proven mathematical programming methods

Page 144: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor

Deliverables

• Optimal configuration techniques for SAR and STAP on GPP/DSP/FPGA system– Will generalize the SAR-only and STAP-only

configuration techniques– Will consider how to best configure the

GPP/DSP/FPGA to simultaneously satisfy both the SAR and STAP requirements and minimize power consumption

– Will demonstrate 2 to 10 times power savings over nominal configurations of GPP/DSP systems

– Optimization techniques based on proven mathematical programming methods