optimal configuration of combined gpp/dsp/fpga systems for ...antonio/pubs/p-ann_rev98acs.pdf ·...

Optimal Configuration of Combined GPP/DSP/FPGA Systems for

Minimal SWAPby

John K. AntonioDepartment of Computer Science

College of EngineeringTexas Tech University

[email protected]

First Annual ReviewJune 23, 1998

OutlineOutline

• Program Overview and Introduction (Quad Chart)

• Program Management Status

• Recent Accomplishments

• Status of Deliverable Checklist

Configuring Combined GPP/DSP/FPGA Systems for Minimal SWAPApplications

• SAR• STAP

Requirements• Throughput• SWAP

•Combined Technology•Minimal SWAP Configuration•Mixed-Mode Operation•Demonstration

Texas Tech University: John K. Antonio

New Ideas• Systematic determination of minimal SWAP

configuration based on proven mathematical programming techniques

• Optimal configuration based on automatic“tuning” of system design parameters- number and types of cards used- data mapping and communication schemes- place and route schemes

• Novel computing techniques based oncharacteristics of GPP/DSP/FPGA system

Jun 97Start

Jun 98 Jun 99 Dec 99End

ScheduleDevelop optimalconfigurationtechniques

Construction and integration of GPP/DSP/FPGA system

Implement and test optimal configurations onGPP/DSP/FPGA system

Develop practicaldesign methodsbased on SAR andSTAP applications

Demonstrate advantagesof combiningtechnologies

Impact• Embedded Systems requirements for the

21st Century can be satisfied with thecombined use of GPP, DSP, and FPGA technologies

• Demonstrate use of FPGA boards as co-processors for embedded multiprocessorGPP and DSP systems

• Demonstrate systematic approaches tooptimally configure GPP/DSP/FPGA syst. forminimal SWAP for embedded applications

OutlineOutline





Personnel (Program Management Status)

• John K. Antonio, Principal Investigator

• Ph.D., EE, Texas A&M Univ. (1989)

• Currently Assoc. Prof. of CS, Texas Tech Univ.

• Over 65 publications in HPC and related areas

• PI or co-PI of 17 contracts/grants

totaling over $2.1M

• Jeff Muehring, Research Assistant, Ph.D. student

Optimal GPP/DSP/FPGA Configuration Techniques for SAR

Intern at IBM/Houston, 1/98 to 6/98

• Jack West, Research Assistant, Ph.D. student

Optimal Mapping, Scheduling, and Configuration Techniques for STAP; Network Simulator


• Nikhil Gupta, Research Assistant, M.S. student

Algorithms for STAP Weight Calculation Mapping Inner Product Computations onto FPGAs

Graduating July 1998

• Tim Osmulski, Research Assistant, M.S. student

Power Prediction Simulator for FPGAs

Graduated May 1998


• Brian Veale, Research Assistant, M.S. student

Calibration of FPGA power prediction model; Implementation of STAP core on GPP/FPGA

New RA as of May 1998

• New Student, Research Assistant, M.S. student

Implementation of SAR core on GPP/FPGA

To be hired September 1998


Contacts, Partners, Vendors, and Other Communications (Program Management Status)

José Muñoz, DARPA Ralph Kohler, Rome Lab

MIT Lincoln LabDavid MartinezJim Ward

MITRERichard Games

Northrop GrummanMarc Campbell

Synplicity, Inc. Madelyn Miller

XilinxJason Feinsmith

Annapolis Micro SystemsJenny DonaldsonBill HulbertPaul Kowalewski

ISIMilissa BenincasaDavid Coker

Mercury ComputerThomas EinsteinEd HolstienCraig LundDave Toms

Mercury20 Slot Hybrid Chassis with SPARC 5VSolaris 2.5 with C Compiler9U VME RACE BoardSHARC Daughtercard (2CNs, 8MB/CN)SHARC Daughtercard (2CNs, 16MB/CN)SHARC Daughtercard (2CNs, 16MB/CN)MC/OS, Cross Assembler, Toolkit PowerPC Daughtercard (2CNs, 16MB/CN)

Annapolis Micro SystemsPCI WILDONE Card (1 Xilinx 4028EX-3)VME WILDFIRE Array Card (16 Xilinx 4028EX-3s)

Other VendorsModelSim Simulation Software (Model Technology, Inc.)Synplify Synthesis Software (Synplicity, Inc.)Xilinx Foundation Software (Xilinx, Inc.)

Equipment Status(Program Management Status)

√√√

√

√

√

√√√

Schedule of Milestones

June 1997 June 1998 June 1999 Dec. 1999Dec. 1998Dec. 1997

Design STAPIterative Weight Solver for FPGA

Inter-GPP/DSP Comm.Simulator for STAP

Optimal GPP/DSPConfig. for SAR

GPP/DSP/FPGA Platform Construction and Independent Testing of GPP/DSP and FPGA Subsystems

Implement STAP Iterative Weight Solver on FPGA

Optimal GPP/DSPConfig. for STAP

Implement SAR Linear Filteringon FPGA

Optimal GPP/DSP/FPGAConfig. for SAR/STAP

GPP/DSP and FPGA Subsystem Integration and Testing

Optimal GPP/DSP/FPGA Config. for SAR

Demonstrate Combined SAR/STAP onGPP/DSP/FPGA Platform

Implement SAR on GPP/DSP

Design SAR Linear Filteringfor FPGA

Implement STAP on GPP/DSP

Implement SAR onGPP/DSP/FPGA Platform

Optimal GPP/DSP/FPGA Config. for STAP

Implement STAP onGPP/DSP/FPGA Platform

Develop FPGA Power Consumption Simulator

KeyGPP/DSP Sub-System

Research/DesignImplement/Test

FPGA Sub-SystemResearch/DesignImplement/Test

GPP/DSP/FPGA SystemResearch/DesignImplement/Test

Test FPGA Power Consumption Simulator

FY 97Approved

FY 98Approved

FY 98Required*

FY 98“Deficit”

Personnel 22,066 56,710 84,517 27,807

Fringes 7,575 18,871 25,723 6,852

Consulting 0 0 15,000 15,000

Expenses 260 3,321 4,500 1,179

Travel 0 4,500 4,500 0

Equipment 74,000 55,608 85,088 29,480

Indirect Cost 13,634 39,198 62,623 23,425

Total 116,644 178,208 281,951 103,743

FY 97 and FY 98 Budgets(Program Management Status)

*Required to maintain 30 month completion date (i.e., 12/31/99).

FY 99 FY 00 ProjectTotal

Personnel 138,536 52,401 297,520

Fringes 39,911 14,404 87,614

Consulting 25,000 10,000 50,000

Expenses 7,078 3,000 14,838

Travel 12,000 5,000 20,500

Equipment 59,892 0 217,670

Indirect Cost 104,587 39,858 221,121

Total 387,004 124,664 909,262

FY 99 and FY 00 Budgets(Program Management Status)

OutlineOutline





Recent Accomplishments

• Network Communication Time Simulator for Parallel STAP

• FPGA Inner-Product Co-Processor Designs for STAP Weight Solver

• Power Prediction Simulator for the Xilinx4000-Series FPGA

Network Communication Time Simulator for Parallel STAP(Recent Accomplishments)

• Space-Time Adaptive Processing (STAP) Basics

• Mercury RACE Multicomputer

• Parallelization Approach for STAP

• RACE Network Simulator

• Preliminary Numerical Studies

• Conclusions

J. Ward, “Space-Time Adaptive Processing for Airborne Radar,” Technical Report 1015, MIT Lincoln Laboratory, Lexington, MA, 1994.

M. F. Skalabrin and T. H. Einstein, “STAP Processing on a Multicomputer: Distribution of 3-D Data Sets and Processor Allocation for Optimum InterprocessorCommunication,” Proc. Adaptive Sensor Array Processing (ASAP) Workshop, March 1996.

The RACE Multicomputer, Hardware Theory of Operation: Processors, I/O Interface, and the RACEway Interconnect, Volume I, ver. 1.3.

T. H. Einstein, “Mercury Computer Systems’ Modular Heterogeneous RACEMulticomputer,” Proc. 6th Heterogeneous Comp. Workshop, April 1997, pp. 60-71.

B. C. Kuszmaul, “The RACE Network Architecture,” Proc. 9th Int’l Parallel Processing Symp., April 1995, pp. 508-513.

G. Booch, I. Jacobson, and J. Rumbaugh, “The Unified Modeling Language for Object Oriented Development,” Documentation Set Version 1.1, September 1997.

Related STAP and RACE References

SSPACEPACE--TTIME IME AADAPTIVE DAPTIVE PPROCESSINGROCESSING

1. Space-Time Adaptive Processing (STAP) refers to a class of signal processing methods that operate on data collected from a set of sensors over a given time interval.

2. STAP simultaneously combines the signals received from an antenna array (spatial domain) and multiple pulse repetition periods (time domain).

3. STAP provides improved detection of smaller targets in the presence of ground clutter (overland and littoral environments) and hostile interference (electronic counter measures and jamming).

Pulses Pulses

Data Cube

Data Cube

Doppler Filter

Channels

Ran

ge

Ran

ge

Channels

Beamform

Beam Outputs

Ran

ge

Pulses

QR Decomposition

Rotate

Channels

Ran

ge

Pulses

Data Cube

Steering Vectors

Weights

Input Data

RotatePulse

Compress

Data CubeC

hann

els

Pulses

Range

STAPSTAP PPROCESSING ROCESSING FFLOWLOW






• Conclusions


1. 40Mhz clock, 32 bit data paths, 2048 byte circuit-switched packets.

2. Contention resolved using priorities.a. User-programmable message priority

b. Hardware priority assigned at each crossbar along a path (based on complex connection rules)

3. A packet with higher priority preempts (suspends) a lower priority packet (active or inactive) to gain control of a crossbar port.

SSOMEOME RACERACENNETWORK ETWORK FFEATURESEATURES

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

CNCN CNCNCNCN CNCN CNCN CNCNCNCN CNCN CNCN CNCNCNCN CNCNCNCN CNCN CNCN CNCN

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

CN

6-PortCrossbar

6-PortCrossbar

Message DestinationMessage DestinationMessage SourceMessage Source

MessagePath

MessagePath

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

CN

RACERACE NNETWORK ETWORK IINTERCONNECTNTERCONNECTFFATAT--TTREE REE TTOPOLOGYOPOLOGY

6-PortCrossbar

6-PortCrossbar

CNCN

6-PortCrossbar

SSTANDARD TANDARD CCROSSBAR ROSSBAR PPRIORITY RIORITY AARBITRATION RBITRATION AALGORITHM LGORITHM TTABLEABLE

7 F A,B,C,D,E F A,B,C,D,E F A,B,C,D6 E F E F A,B,C,D* A,B,C,D*5 A,B,C,D F A,B,C,D F A,B,C,D F4 E A,B,C,D E A,B,C,D - -3 *A,B,C,D *A,B,C,D,E A,B,C,D* A,B,C,D* - -2 - - A,B,C,D E - -1 - - - - - -

HardwarePriority Entry Port Exit Port Entry Port Exit Port Entry Port Exit Port

Active Port E InvolvedNot Yet Active

Port E Not Involved

Transaction Status

* - Peer Kill Rules Apply

RACEway Interface

SHARCSHARC

SHARCSHARC

SHARCProcessorSHARC

ProcessorECC LogicECC Logic DRAMDRAM

PerformanceMetering

PerformanceMetering

DMAController

DMAController

3-WayData

Switch

3-WayData

Switch

RACEwayMapping

Logic

RACEwayMapping

Logic

OSSupport

Hardware

OSSupport

Hardware

CN ASIC

SHARCSHARC CCOMPUTE OMPUTE NNODEODE






• Conclusions


1. Partition STAP data cube over a 2-D process set.

2. Process the contiguous dimension.

3. Re-partition the data cube before processing the next dimension.

4. Rotate the newly distributed data to make the next dimension sequential in memory.

5. Repeat steps 1 through 4 before each processing phase.

SSUBUB--CUBE CUBE BBAR AR PPARTITIONING ARTITIONING MMETHODOLOGYETHODOLOGY

Pulse Compression Partitioningwith range dimension whole.Pulse Compression Partitioningwith range dimension whole.

Pulses Range

Cha

nnel

s

Cha

nnel

s

1 32 4

5 76 8

9 1110 12

Pulses

+

3 x 4 Process Set

Pulses

5

1

9

Range

Cha

nnel

s

Doppler Filtering Partitioningwith pulses dimension whole.Doppler Filtering Partitioningwith pulses dimension whole.

Pulses Range

Cha

nnel

s

9 10 11 12

5 6 7 8

1 2 3 4

Pulses Range

Cha

nnel

s

+

Cha

nnel

s

1 32 4

5 76 8

9 1110 12

Range

3 x 4 Process Set

STAPSTAP DDATA ATA CCUBE UBE PPARTITIONING ARTITIONING EEXAMPLESXAMPLES

Pulses

5

1

9

Range

Cha

nnel

s• Re-Partitioning involves exchanging data with the next whole dimension.

Cha

nnel

s

1 32 4

5 76 8

9 1110 12

Pulses

3 x 4 Process Set

Range Dimension is Contiguous

Cha

nnel

s

1 32 4

5 76 8

9 1110 12

Range

3 x 4 Process Set

Pulse Dimension is Contiguous

• Interprocessor Communication is required between processors in the same row.

Pulses

Range

Cha

nnel

s

9 10 11 12

5 6 7 8

1 1 1 2 1 3 1 4

STAPSTAP DDATA ATA CCUBE UBE RREPARTITIONINGEPARTITIONING

Required Data TransfersRequired Data Transfers

Network Interconnection ConfigurationNetwork Interconnection Configuration

6-PortCrossbar

CN CN CN CN

12

3

45

6 78

9

1011

12

IPC

56

78

910

1112

Cha

nnel

12

34Pulses Range

Pulse Compression

1

4CN

7

10

CN

CN

CN

CN

CN

3

4

3

3

4

3

Doppler Filtering

Pulses

Cha

nnel

Range

9 10 11 12

5 6 7 8

1 2 3 4

STAPSTAP DDATA ATA CCUBE UBE RREPARTITIONINGEPARTITIONING

Data ReData Re--distribution Mappingdistribution Mapping






• Conclusions


1. Design and implement a network simulator that models the effect data mapping and scheduling has on the performance of a STAP algorithm.

2. Key features of the network simulator include:a. Developed and implemented in an OO paradigm.

b. Implemented using a sub-cube bar partitioning scheme.

c. Models both sub-cube bar mapping strategies and communication scheduling during both phases of data re-partitioning.

d. Completely generic.

RRESEARCH ESEARCH OOBJECTIVESBJECTIVESfor for SSIMULATORIMULATOR

NetworkNetwork

ClockClock

CrossbarCrossbar Routing TableRouting Table

File OutputFile Output

Random ScanRandom Scan

Data CubeData Cube

Process SetProcess Set

1

11

1

1..*

1

1

1

Gets Data From

UML NUML NETWORK ETWORK CCLASS LASS DDIAGRAMIAGRAM

11

CrossbarCrossbar

LinkLink Compute NodeCompute Node

Message QueueMessage Queue Packet StackPacket Stack

MessageMessage PacketPacket

UML CUML CROSSBAR ROSSBAR CCLASS LASS DDIAGRAMIAGRAM

0..*0..*

1 1

1

2

1

2

2,6

11

0,4

DataData

MessageMessage PacketPacket

Header RouteList

Header RouteList

RouteRoute

Abstract ClassInheritance

UML DUML DATA ATA CCLASS LASS DDIAGRAMIAGRAM

11

1..*

1

CrossbarCrossbar CrossbarCrossbar

CrossbarCrossbar

Compute NodeProcessor InformationOutgoing and Received Message QueuesOutgoing and Received Packet Stack


LinkLink

Random ScanGenerates Pseudo-Random CN Scan Ordering

Random ScanGenerates Pseudo-Random CN Scan Ordering

ClockBased on Network Clock Frequency (factor of 5)Data Transfer Rate Equates to Effective Network Bandwidth

ClockBased on Network Clock Frequency (factor of 5)Data Transfer Rate Equates to Effective Network Bandwidth

Dynamic Network ConstructionDynamic Routing Table CreationDynamic CN and CE Message Traffic GenerationSimulates Packet Traffic

Dynamic Network ConstructionDynamic Routing Table CreationDynamic CN and CE Message Traffic GenerationSimulates Packet Traffic

Network Methods

NNETWORK ETWORK CCLASS LASS DDETAILSETAILS

Implements Hardware Priority Arbitration • TOP-LEVEL ALGORITHM• STANDARD ALGORITHM

Query Port StatusRoutes Packets to Next LocationAllocates and Frees Internal Port Connections and Connected Link ObjectsTransmits Packet Data

Implements Hardware Priority Arbitration • TOP-LEVEL ALGORITHM• STANDARD ALGORITHM

Query Port StatusRoutes Packets to Next LocationAllocates and Frees Internal Port Connections and Connected Link ObjectsTransmits Packet Data

Crossbar Methods

LinkConnects Crossbar Objects Link Status: Occupied or Free

LinkConnects Crossbar Objects Link Status: Occupied or Free

CrossbarTwo Parent Port ConnectionsFour Child Port ConnectionsInternal Switch ConnectionsFour CN Connections for TerminalCrossbars.

CrossbarTwo Parent Port ConnectionsFour Child Port ConnectionsInternal Switch ConnectionsFour CN Connections for TerminalCrossbars.

CCROSSBAR ROSSBAR CCLASS LASS DDETAILSETAILS

Compute Node Methods:Manages Outgoing and Received MessageQueuesManages Outgoing and Received Packet StackExplodes the Top Outgoing Message into Packets of Size 2048 or LessHandles DMA Chaining of PacketsEstablishes Path Through Network and Transmits Packet Data

Compute Node Methods:Manages Outgoing and Received MessageQueuesManages Outgoing and Received Packet StackExplodes the Top Outgoing Message into Packets of Size 2048 or LessHandles DMA Chaining of PacketsEstablishes Path Through Network and Transmits Packet Data

Outgoing Message QueueOutgoing Message Queue

Message 1

Message 2

Message 3

::

Packet StackPacket StackEXPLODE


• PACKETS ARE SELF-ROUTING


• PACKETS ARE SELF-ROUTING

::

Packet 2Packet 3Packet 4

Packet 1

CCOMPUTE OMPUTE NNODE ODE CCLASS LASS DDETAILSETAILS

SSIMULATOR IMULATOR UMLUMLSSEQUENCE EQUENCE DDIAGRAMIAGRAM

NetworkNetwork CrossbarCrossbarData CubeData Cube Process SetProcess Set CNCN<<actor>>

User<<actor>>

User ClockClock

Pass 1

Pass 2

Increment Simulation

Clock

Build Messages

R:200,P:22,C:16

CEs:48

X:6, Y:8

Routing:FCN Traffic,

Phase 1 DMA:Y

Connection/Data

Transfer

Clean Up

Message Matrices

X, Y,MappingMatrices

SimulationTime = 2 msSimulation

Time = 2 ms

Messages Time* iterative process

CCOMPUTEOMPUTE NNODEODEUML SUML STATECHARTTATECHART

Simulation PASS 1Simulation PASS 1Compute Node Subsystem

CurrentPacket

CurrentPacket

PacketStackStatus

PacketStackStatus

MessageQueueStatus

MessageQueueStatus

ExplodeTop

Message

ExplodeTop

Message

PopTop

Packet

PopTop

Packet

Simulation SubsystemSimulation Subsystem

Simulate Pass 1

Simulate Pass 1

GenerateErrorCode

GenerateErrorCode

No Packet EmptyEmpty - Done

Not Empty Not Empty

Success

ErrorError

SuccessPacketFound

CCOMPUTEOMPUTE NNODEODEUML SUML STATECHARTTATECHART

Simulation PASS 2Simulation PASS 2Compute Node Subsystem

CurrentPacket

CurrentPacket

Simulation SubsystemSimulation Subsystem

Simulate Pass 2

Simulate Pass 2

PacketFound

No Packet

PPACKETACKET UML SUML STATECHARTTATECHARTSimulation Simulation Pass 1Pass 1 and and Pass 2Pass 2

Simulation Pass Subsystem

Start UpStart Up

Waitingfor Kill

Waitingfor Kill

CompletedCompletedSuspendedSuspended

BlockedBlocked ActiveActive

ReadyReady

Pass 1

Pass 2






• Conclusions


PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 1Communication Phase 1

Process Set - Phase 1 (CN:8, R:800, P:32, C:22, Routing:E)

0

10

20

30

40

50

60

7 8 9 10 11

Time (ms)

Coun

t CN 8 (6x4)CN 8 (4x6)


0

10

20

30

40

50

60

7 8 9 10 11

Time (ms)

Coun

t CN 8 (6x4)CN 8 (4x6)




0123456789

28 30 32 34 36 38 40

Time (ms)

Coun

t CN 8 (6x4)CN 8 (4x6)


0123456789

28 30 32 34 36 38 40

Time (ms)

Coun

t CN 8 (6x4)CN 8 (4x6)



Process Set - Phase 1(CN:16, R:200, P:22, C:16, Routing:F)

02468

101214161820

0.7 0.8 0.9 1 1.1 1.2 1.3

Time (ms)

Coun

t CN 16 (12x4)CN 16 (8x6)CN 16 (4x12)


02468

101214161820

0.7 0.8 0.9 1 1.1 1.2 1.3

Time (ms)

Coun

t CN 16 (12x4)CN 16 (8x6)CN 16 (4x12)




0

2

4

6

8

10

12

14

2.5 3 3.5 4 4.5 5 5.5

Time (ms)

Coun

t CN 16 (12x4)CN 16 (8x6)CN 16 (4x12)


0

2

4

6

8

10

12

14

2.5 3 3.5 4 4.5 5 5.5

Time (ms)

Coun

t CN 16 (12x4)CN 16 (8x6)CN 16 (4x12)



Process Set - Phase 1 (CN:12, R:200, P:22, C:16, Routing:F)

05

101520253035404550

0.5 1 1.5 2

Time (ms)

Coun

t

CN 12 (12x3)CN 12 (9x4)CN 12 (6x6)CN 12 (4x9)




0123456789

10

3 3.5 4 4.5 5 5.5 6

Time (ms)

Coun

t

CN 12 (12x3)CN 12 (9x4)CN 12 (6x6)CN 12 (4x9)


0123456789

10

3 3.5 4 4.5 5 5.5 6

Time (ms)

Coun

t

CN 12 (12x3)CN 12 (9x4)CN 12 (6x6)CN 12 (4x9)




05

101520253035404550

0 0.5 1 1.5 2

Time (ms)

Coun

t CN 12 (3x12)CN 12 (12x3)CN 12 (4x9)


05

101520253035404550

0 0.5 1 1.5 2

Time (ms)

Coun

t CN 12 (3x12)CN 12 (12x3)CN 12 (4x9)




0

2

4

6

8

10

12

14

2.5 3.5 4.5 5.5 6.5

Time (ms)

Coun

t CN 12 (3x12)CN 12 (12x3)CN 12 (4x9)


0

2

4

6

8

10

12

14

2.5 3.5 4.5 5.5 6.5

Time (ms)

Coun

t CN 12 (3x12)CN 12 (12x3)CN 12 (4x9)



Process Set - Phase 1(CN, R:200, P:22, C:16, Routing:F)

0

10

20

30

40

50

60

0 0.2 0.4 0.6 0.8 1

Time (ms)

Coun

t CN 12 (3x12)CN 16 (4x12)


0

10

20

30

40

50

60

0 0.2 0.4 0.6 0.8 1

Time (ms)

Coun

t CN 12 (3x12)CN 16 (4x12)




0

2

4

6

8

10

12

14

2.6 2.8 3 3.2 3.4 3.6 3.8

Time (ms)

Coun

t CN 12 (3x12)CN 16 (4x12)


0

2

4

6

8

10

12

14

2.6 2.8 3 3.2 3.4 3.6 3.8

Time (ms)

Coun

t CN 12 (3x12)CN 16 (4x12)

MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC


Message Traffic - Phase 1 (CN:16, X:12, Y:4, R:400, P:22, C:16, Routing:EF)

0123456789

2 2.1 2.2 2.3 2.4 2.5

Time (ms)

Coun

t CN TrafficCE Traffic


0123456789

2 2.1 2.2 2.3 2.4 2.5

Time (ms)

Coun





012345678

10 15 20 25

Time (ms)

Coun



012345678

10 15 20 25

Time (ms)

Coun





0

10

20

30

40

50

60

0.85 0.851 0.852 0.853 0.854

Time (ms)

Coun



0

10

20

30

40

50

60

0.85 0.851 0.852 0.853 0.854

Time (ms)

Coun





012345678

10 15 20 25

Time (ms)

Coun



012345678

10 15 20 25

Time (ms)

Coun




Message Traffic - Phase 1 (CN:12, X:6, Y:6, R:800, P:32, C:22, Routing:E)

0

10

20

30

40

50

60

4.95 4.9505 4.951 4.9515 4.952

Time (ms)

Coun



0

10

20

30

40

50

60

4.95 4.9505 4.951 4.9515 4.952

Time (ms)

Coun





0123456789

10

43 45 47 49 51

Time (ms)

Coun



0123456789

10

43 45 47 49 51

Time (ms)

Coun





0

1

2

3

4

5

6

7

17 18 19 20 21 22 23

Time (ms)

Coun



0

1

2

3

4

5

6

7

17 18 19 20 21 22 23

Time (ms)

Coun





0

1

2

3

4

5

6

7

41 43 45 47 49

Time (ms)

Coun



0

1

2

3

4

5

6

7

41 43 45 47 49

Time (ms)

Coun


DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC


DMA Chaining - Phase 1 (CE:24, X:8, Y:3, R:200, P:22, C:16, Routing:F)

0123456789

10

1.7 1.8 1.9 2 2.1

Time (ms)

Coun

t ChainingNo Chaining




0123456789

2.5 2.7 2.9 3.1 3.3 3.5

Time (ms)

Coun



0123456789

2.5 2.7 2.9 3.1 3.3 3.5

Time (ms)

Coun





0123456789

3.4 3.5 3.6 3.7 3.8 3.9 4 4.1

Time (ms)

Coun



0123456789

3.4 3.5 3.6 3.7 3.8 3.9 4 4.1

Time (ms)

Coun





0123456789

5.2 5.7 6.2 6.7

Time (ms)

Coun



0123456789

5.2 5.7 6.2 6.7

Time (ms)

Coun





0123456789

14 16 18 20 22

Time (ms)

Coun



0123456789

14 16 18 20 22

Time (ms)

Coun





012345678

21 22 23 24 25 26 27

Time (ms)

Coun



012345678

21 22 23 24 25 26 27

Time (ms)

Coun


AADAPTIVEDAPTIVE RROUTINGOUTINGPPERFORMANCE ERFORMANCE MMETRICETRIC


Adaptive Routing - Phase 1 (CN:16, X:8, Y:6, R:800, P:32, C:22)

0123456789

7 8 9 10 11 12 13

Time (ms)

Coun

t Adaptive EAdaptive FAdaptive E/F


0123456789

7 8 9 10 11 12 13

Time (ms)

Coun




Adaptive Routing - Phase 2 ( CN:16, X:8, Y:6, R:800, P:32, C:22)

0123456789

26 31 36 41 46Time (ms)

Cou

nt

Adaptive EAdaptive FAdaptive E/F

Adaptive Routing - Phase 2 ( CN:16, X:8, Y:6, R:800, P:32, C:22)

0123456789

26 31 36 41 46Time (ms)

Cou

nt

Adaptive EAdaptive FAdaptive E/F




0

2

4

6

8

10

12

1.5 2 2.5 3 3.5

Time (ms)

Coun



0

2

4

6

8

10

12

1.5 2 2.5 3 3.5

Time (ms)

Coun





0123456789

10

7 8 9 10 11 12 13

Time (ms)

Coun



0123456789

10

7 8 9 10 11 12 13

Time (ms)

Coun







• Conclusions


1. Designed and implemented a platform independent simulator.

4. Communication pattern implemented for STAP but may be used for other applications with phased communication pattern.

2. Simulator demonstrates that the Process Set, the CN or CE Message Traffic, the DMA chaining, the adaptive routing, and the scheduling of the messages affects performance.

3. Allows users to experiment with possible current and future configurations.

CCONCLUSIONSONCLUSIONS

FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers

(Recent Accomplishments)

• Overview of STAP Weight Calculation

• Two Candidate STAP Weight Solvers: QR Versus CG

• Two FPGA Inner-Product Circuit Designs

• Numerical Accuracy Studies

References for STAP Weight Solverand FPGA Design

J. Ward, “Space-Time Adaptive Processing for Airborne Radar,” Technical Report 1015, MIT Lincoln Laboratory, Lexington, MA, 1994.

K. C. Cain, J. A. Torres, and R. T. Williams, (R. A. Games, Project Leader), “RT_STAP: Real-Time Space-Time Adaptive Processing Benchmark,” MITRE Technical Report MTR 96B0000021, Feb. 1997.

MCARM Data Files, Rome Laboratory, (http://sunrise.oc.rl.af.mil).

D. G. Luenberger, Linear and Nonlinear Programming, Addison-Wesley, Reading, MA, 1984.

WildOne Hardware Reference Manual, Number 11927-0000, Revision 0.1, Annapolis Micro Systems, Inc., MD, 1997.

Doppler Filter

Weight Computation

Steering Vector

Input Data

Pulse Compress Data Cube Data Cube

Weight Application

ThresholdDetection

Target Decision

Typical STAP Processing Flow

pulses

range

Doppler

range8%

91.5%

0.5%

CovarianceMatrix

STAP CPI Data Cube

1 M

L

1

N

1

PRI (32-128)

Channels(24)

Range(625-2500)

Principle Behind STAP

• Range gates are divided into non-overlapping blocks having a fixed number of range gates

• These blocks are referred to as the Range Segments

1 M

L

1

N

1PRI

Channels Lr

Number of Range Segments = L/Lr

Range Segment

• Works with data on all M Doppler bins and all Nchannels

• Computes and applies a separate adaptive weight to every element and Doppler bin

• The weight vector is of size MN for each range gate.

Space-Time Adaptive Processing

• Fully Adaptive STAP


• Characteristics of Fully Adaptive STAP• Requires solving a large system of linear equations• Size of the linear system grows with

• Array size (the number of channels)• Number of pulses

Example: for each instance, if M = 32 and N = 24 then, complexity ≈ (MN)3 = 452,984,830

• Implementation of fully adaptive STAP is impractical• Complexity of each instance is O((MN)3)• Product MN being several hundreds puts it beyond

current capabilities in real-time computing• Instances of the problem must be solved for each

range segment

• Problem is broken down into a number of smaller,more manageable adaptive problems

• STAP applied to these lower dimension problems


• Partially Adaptive STAP

• The partially adaptive STAP works with data on

Example: for each instance, if M = 32, N = 24 and K = 3, thencomplexity ≈ M(KN)3 = 11,943,936


• All N Channels

• And a few adjacent Doppler bins, denoted as K

• Complexity is reduced to O(M(KN)3), for K<< M


• Effective partially adaptive STAP technique

• The architecture consists of

• Doppler processing across all pulse repetition intervals

• Adaptive filtering across• all channels and• K adjacent Doppler bins

Kth- Order Doppler Factored STAP

1 31 ˆ:),(

×=× NN

rkx

r

∑+−=

=bL

rkxrkx

bkR

rLbr

H

rL 1)1(

),(),(1

),(ψ

Kth-Order Doppler Factored STAP

bth Ran

ge

Segm

ent

(with

L rce

lls)N

Cha

nnel

s

Doppler

k (k - 1)(k + 1)

Data matrix needed for calculating covariance matrix for kth Doppler Bin

and bth Range Segment using Kth-OrderDoppler Factored STAP with K = 3

Matrix-Based Derivation of

rr LNLN

bk

3 ˆ:),(

×=×

X

),(),(1

),(),(1),(1)1(

bkbk

bLrkxrkxbk

H

r

Lbr

H

r

L

LR

r

XX

ψ

=

= ∑+−=

sbkwbk =),(),(ψ

The Weight Equation:

),( bkψ

Methods for STAP Weight Calculation

• Two approaches to solve the weight equation

• QR-decomposition method (direct)

• Conjugate Gradient method (iterative)

STAP Weight Calculation

sLbkwRR

RR

sbkwRRL

bkwRQQRL

QRbk

sbkwbkbkL

sbkwbk

rT

TT

T

r

TT

r

T

H

r

=

=

==

=

=

=

),(

]0[ that Note

),(1),(1

),( :onDecomposti QR Take

),(),(),(1

),(),(

*11

1

***

X

XX

ψ

onsubstituti backward using ),(for Solve

),(

neliminatio forward using for Solve

),(Let

*1

1

*1

bkw

pbkwR

p

sLpR

pbkwR

rT

=

=

=

sw =ψ :Equation Weight thesolve toMethodion decomposit-QR Using

Iteration

STAP Weight Calculation

Initialization

ikTi

iTi

ii

ii

ii

Ti

iTi

ii

ddd

dggd

swg

ddd

dgww

+−=

−=

−=

+++

++

+

)(1

11

11

1

ψψ

ψ

ψ

sw =ψ :Equation Weight thesolve toMethodGradient Conjugate Using

00000 ,set , Choose dgwsdw −=−= ψ

Preliminary Numerical Studies

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1

Tolerance

Rel

ativ

e Er

ror

Lr = 25010-1

10-2

10-3

10-4

10-5

10-6

10-7

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1

Rel

ativ

e Er

ror

100

10-1

10-2

10-3

10-4

10-5

10-6

10-7

10-8

10-9

Tolerance

Lr = 125

qr

cgqr

w

ww −=Error Relative

Preliminary Numerical Studies

Lr = 125

Flop

Cou

nt

108

109

1010

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1

Tolerance

CGQR

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1

Lr = 250

Tolerance

1010

109

108

Flop

Cou

nt

Tolerance

CGQR

Motivation for FPGA Inner-Product Co-Processors

• Inner-products are a core calculation for both CG- and QR-based STAP weight solvers

• Computations are highly numeric and regular

• Opportunities to exploit reduced precision arithmetic

• Control flow of CG and QR best implemented on GPP or DSP - Inner product calculations can be offloaded to available FPGA resources

PCI BUSPCI BUS

Dual Port MemController 0




Processing Element

1

Processing Element

1

Processing Element

0

Processing Element

0Fifo1Fifo1Fifo0Fifo0

SIMDConnector

External I/OConnector

Overview of WildOne Architecture

+

Output Register

a b

Sign+16 bitmantissa

Normalizing unit

1’s comp/register

a bsign of a

a b

b

BUFFER

X

BUFFER

FPGA

BOARD

INTERCONNECTION

BUS

HOSTPROCESSOR

• Multiply-Accumulate Pipe• Reads two operands

per cycle • Performs two operations

per cycle• Performs exponent

normalization prior to accumulation

• 2 N-vectors reduced to a constant number of partial sums

FPGA Inner Product Co-Processor:Design 1

• Multiply-Add Reduction Pipe• Reads four operands

per cycle • Performs three operations

per cycle• No normalization required• 2 N-vectors reduced to N/2 partial sums

• Basic Tradeoff: First design has lower throughput, but can perform more work

X X

1’s comp/register

Sign bSign a

+

Sign+16 bit mantissa

INTERCONNECTION

BUS

HOSTPROCESSOR

BUFFER

BUFFER

FPGA

BOARD

2 ff

Data forFirst

Multiplier

Data forSecond

Multiplier

Unitclocked

here

FPGA Inner Product Co-Processor:Design 2

Block Floating Point Unit

Inner-ProductCo-Processor

1

1

UML Description of Basic Co-Processor Design

Block Floating Point Unit

Multiplying Unit Complementor

Normalizing UnitAccumulator

1 1

1

11

1

1 1

UML Description of Block Floating Point Unit

Multiplying UnitRegister

4-Bit Adder

Multiply Stage1

1

132

4

8

UML Description of Multiplying Unit

Accumulator

4-Bit Adder Register

3-Bit Adder

1

5

1

1

124

UML Description of Accumulator Unit

Normalizing Unit

SubtractorRegister

MagnitudeComparator

*

1 1 1

1

1

UML Description of Normalizing Unit

Host ProgramWild-One

Open board

Program the board with the image

Interrupt for Exponent

Exponent written to FIFO

Interrupt for Mantissa Vectors

Mantissa Vectors written to the FIFO

Processing Done ans in FIFO/Memory

Read back the answer

Close the board

TI

ME

MESSAGES

Sequence Diagram for Interactions between Host and FPGA Board

Get Exponent Wait forExponent Int

Int Req

Int AckRead Exponent

Get Mantissa

Read Mantissa Write Mantissa

Wait forMantissa Int

Write Exponent

Int Req

Int Ack

Multiply-and-add/accumulate

Write Back

Wait for Answer Int

Read Back Answer

Ack = 1

Ack = 0

Req = 1

Req = 0

Req = 1Ack = 1

Ack = 0 Req = 0

Req = 1

Done = 1

Req = 0Processing Sub-System

FPGA

Board

Host

System

Statechart Diagram for Interactions between Host and FPGA Board

Compare Count [Count = Threshold]

Read Two Operands

Multiply

Accumulate

[Count ≠ Threshold]

Write to MemoryFeedback SumIncrement Count

Set Done flag

Circuit Activity Diagram:Design 1

Compare Count [Count = Threshold]

Read Two Operands

Multiply

Add

[Count ≠ Threshold]

Read Next Two Operands

Multiply

Write to Memory

Increment Count Set Done flag

Circuit Activity Diagram:Design 2

Setup for Numerical Accuracy Studies

• Randomly generated, 512-element test vectors processed by both designs

• Range of vectors’ data values controlled to study effect dynamic range has on accuracy

• Output of each circuit compared to corresponding results calculated on host (using IEEE 32-bit floating point arithmetic)

• Accuracy metric is ratio of obtained values to corresponding IEEE floating point value

Zero Order of Magnitude Experiment

Data Histogram

05

101520253035404550

0.00

1

0.06

3

0.12

6

0.18

8

0.25

1

0.31

3

0.37

6

0.43

8

0.50

0

0.56

3

0.62

5

0.68

8

0.75

0

0.81

3

0.87

5

0.93

8

1.00

0

Freq

uenc

y

Exponent Histogram

0

100

200

300

400

500

600

114

116

118

120

122

124

126

128

130

132

134

Freq

uenc

y

Accuracy HistogramDesign 2

020406080

100120140160180

0.99

84

0.99

85

0.99

86

0.99

87

0.99

88

0.99

89

0.99

90

0.99

91

0.99

92

0.99

93

0.99

94

0.99

95

0.99

96

0.99

97

0.99

98

0.99

99

1.00

00

Freq

uenc

y


0

1

2

3

4

5

6

7

8

0.999855 0.99986375 0.9998725 0.99988125 0.99989

Freq

uenc

y

Two Orders of Magnitude Experiment


0

1

2

3

4

5

6

7

0.999893 0.9999015 0.99991 0.9999185 0.999927

Freq

uenc

y

Data Histogram

05

101520253035404550

0 7 14 21 27 34 41 48 55 62 69 76 82 89 96 103

110

Freq

uenc

y

Exponent Histogram

050

100150200250300350400450500

119

121

123

125

127

129

131

133

135

137

139

141

143

145

Freq

uenc

y


0

50

100

150

200

250

0.99

399

0.99

436

0.99

474

0.99

511

0.99

549

0.99

586

0.99

624

0.99

661

0.99

699

0.99

736

0.99

774

0.99

811

0.99

849

0.99

886

0.99

924

0.99

961

0.99

999

Freq

uenc

y

Four Orders of Magnitude Experiment


0

1

2

3

4

5

6

7

8

9

0.999889 0.99989925 0.9999095 0.99991975 0.99993

Freq

uenc

y

Data Value Histogram

05

1015

2025

3035

4045

50

0

687

1373

2060

2747

3434

4120

4807

5494

6180

6867

7554

8241

8927

9614

1030

1

1098

5

Freq

uenc

y

Exponent Histogram

0

50

100

150

200

250

300

350

400

450

119

121

123

125

127

129

131

133

135

137

139

141

143

145

Freq

uenc

y


0

50

100

150

200

250

300

0.46

7

0.50

0

0.53

4

0.56

7

0.60

0

0.63

4

0.66

7

0.70

0

0.73

3

0.76

7

0.80

0

0.83

3

0.86

7

0.90

0

0.93

3

0.96

7

1.00

0

Freq

uenc

y

Five Orders of Magnitude Experiment


0

1

2

3

4

5

6

7

8

0.999912 0.99991875 0.9999255 0.99993225 0.999998

Freq

uenc

y


05

101520253035404550

0

6867

1373

4

2060

2

2746

9

3433

6

4120

3

4807

0

5493

7

6180

5

6867

2

7553

9

8240

6

8927

3

9614

1

1030

08

Freq

uenc

y

Exponent Histogram

0

100

200

300

400

500

600

700

800

119 121 123 125 127 129 131 133 135 137 139 141 143

Freq

uenc

y


0

50

100

150

200

250

300

0.00

000

0.06

250

0.12

500

0.18

750

0.25

000

0.31

249

0.37

499

0.43

749

0.49

999

0.56

249

0.62

499

0.68

749

0.74

999

0.81

249

0.87

499

0.93

748

0.99

998

Freq

uenc

y

“Outlyer” Experiment


0

5

10

15

20

25

30

35

40

45

50

0.00

0.06

0.12

0.17

0.23

0.29

0.35

0.40

0.46

0.52

0.58

0.64

0.69

0.75

0.81

0.87

0.92

Freq

uenc

y

Exponent Histogram

0

100

200

300

400

500

600

114

116

118

120

122

124

126

128

130

132

134

136

138

Freq

uenc

y


0

200

400

600

800

1000

1200

0.00

09

62.5

008

125.

0007

187.

5007

250.

0006

312.

5006

375.

0005

437.

5005

500.

0004

562.

5004

625.

0003

687.

5003

750.

0002

812.

5002

875.

0001

937.

5001

1000

.000

0

Freq

uenc

y


0

2

4

6

8

10

12

0.593067 0.6398925 0.686718 0.7335435 0.78369

Freq

uenc

y

outlyeroutlyer

Conclusions

• CG weight solver provides tradeoff between accuracy and required FLOPs(compared to QR weight solver)

• Tradeoff between two FPGA designs: Design 1 (Mult & Accum) has lower peak throughput, but can perform more total work than Design 2

• Block floating point provides acceptable accuracy for uniformly distributed data over reasonable dynamic ranges

• Block floating point accuracy breaks down when there are a few large outlyers in the data set

Power Prediction Simulator for the Xilinx 4000-Series FPGA


• CMOS Power Consumption and Past Research

• Design and Implementation of the Power Prediction Simulator

• Preliminary Experimental Results

• Conclusions and Current Work, Demo

References for FPGA Power Prediction

K. P. Parker and E. J. McCluskey, “Probabilistic Treatment of General Combinatorial Networks,” IEEE Trans. Computers, Vol. C-24, June 1975, pp. 668-670.

Kaushik Roy and Sharat Prasad, “Circuit Activity Based LogicSynthesis for Low Power Reliable Operations,” IEEE Trans. VLSI Systems, Vol. 1, No. 4, Dec.1993, pp.

Kaushik Roy, “Power Dissipation Driven FPGA Place and Route under Timing Constraints,” School of Electrical and Computer Engineering, Purdue University.

“XC4000 Series Field Programmable Gate Arrays,” Xilinx, Inc., September 18, 1996.

Leakage CurrentDynamic Capacitance Charging Current

Most important for CMOSDependant on clock frequency

Power Dissipation in CMOS

Transient Current

Dependant on signal activityDependant on signal activity

Power Equations

Equivalent model of a transistor’s gate...

( )

−=

−RC

teVtvc 1

( ) RCt

VetvR

−=

( )ReVtp

RCt

R

22

−

=

∫∫−

−

−−

==ττ

ττ0

22

0

22 2

21 dte

RCCVdt

ReVp RC

tRCt

avg

222

21

2CVeCVp

o

RCt

avg ττ

τ

≈−

=−

( ) 50.0=clockp

( ) 88.01 =xp

( ) 29.02 =xp

( ) 69.03 =xp ( ) 27.03 =xA

( ) 0.1=clockA

( ) 10.01 =xA

( ) 17.02 =xA

p(s): the probability that signal sattains a logical value of true at any given clock cycle.

A(s): the probability that signal stransitions at any given clock cycle.

Probabilistic Modeling

Probabilistic Modeling

x3

x2

x1

y

y

x3

x2

x1

:)(1 tx:)(2 tx:)(3 tx

:)(21 txx:)(321 txxx

p=0.88, A=0.10

p=0.29, A=0.17

p=0.69, A=0.27

p=0.83, A=0.17

p=0.10, A=0.13

Calculation of average power:

∑∈

=gates all

2

21

ggavg ACVP

Probabilistic Equations

( )

( )1 where,)(1

1

===

=

∏∑

∑ ∏

=

=

ii

k

ii

k

ii

Pyp

f

ππ

( ) ( )

( ) ( ){ }

( ) ( ){ }

∑∑ ∏

∑ ∏

∑ ∏

+

−⊕+

−⊕+

−⊕

⋅=

===≠≠ ∉

==≠ ∉

= ≠

X n

kjikji kjil

llkkjjiikji

n

jiji jik

kkjjiiji

n

i ijjjiii

xzPxzPxzPxzPzzzXfXf

xzPxzPxzPzzXfXf

xzPxzPzXfXf

XPyA

K

1,1,1,,

1,1,

1

)(1)()()(),,;()(31

)(1)()(),;()(21

)(1)();()(

)()(

*

* Probabilistic Treatment of General Combinatorial Networks† Estimation of Circuit Activity Considering Signal Correlations and Simultaneous Switching

Signal probability transformations...

Signal activity transformations...†

FPGA Design

FPGA internal structure design...

CLB

IOB BUF

Routing Fabric Design

Example routings...

Xilinx 4000 series routing fabric is very intricate.

Xilinx synthesis tools use shortest path routing where possible.

The distance the signal travels is the metric considered in this model.

Signal Design

Symbolic Probability

Numeric Probability

Numeric Activity

Signal Reference

Manhattan Distance

CLBCLB

R

L

Local Signal Remote Signal

Iteration Example

4

4 InterconnectionLUT

LUT

LUT

LUT

LUT

LUT

Iteration Example

R

R

R

R

R

R

R

R

L

L

L

RRRR

RRRR

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

L

L

L

RRRR

RRRR

R

R

R

R

R

R

R

R

LUT

LUT

LUT

LUT

LUT

LUT

L

L

L

L

Probabilistic Feedback Example

ab

d

ec pe

pa

pb

• Feedback Circuits Require Symbolic Iteration of Probability Expressions

• Assume pa , pb , pe are known; then pd and pc are determined using iteration

pd

d = a + bc

dc

pc

c = d e

Iteration 1:

pd = pa

pc = pa pe

Iteration 2:

pd = pa + pa pb pe

pc = (pa pe + pa pb pe) pe = pa pe

Iteration 3:

pd = pa + pa pb pe

pc = pa pe

Experimental Results

Probabilistic signals are correctly propagated through combinational and sequential logic.

Configurations making use of feedback converge for all test cases.

Probabilistic modeling is more than an order of magnitude faster than time-domain modeling techniques.

Convergence of Probabilistic Signals

Probability Convergence

020406080

100120

0 5 10

Iterations

% C

onve

rgen

ce

Adder4FIFOPipeAdderMult32

All test cases converged in the following manner:Steep Slope: Signals not involved with feedback rapidly

propagated through the FPGA.Plateau: Signals dependent on feedback converge slowly.

Symbolic Term Explosion

Mixing 12 signals this way...

…gives 6 signals with at most 4 terms.

Mixing 12 signals this way...

…gives 1 signal with at most 4096 terms.

Power Measurements

• Heat Measurements

• Developed hardware instrumentation to measure surface temperature of FPGA

• Thermistor attached to FPGA with heat conductive epoxy

• Instrumentation accurate to within 0.1 degrees F

Frequency Response of the FPGA

• The FPGA consumes more power as its clock frequency rises.• The simulator gives 125mW +43.6mW/MHz for this situation.

120135150165180

0 10 20 30 40 50Frequency (MHz)

Tem

pera

ture

(F)

Surface Temperature versus Frequency

Conclusions and Current Work

• Designed and Implemented power prediction simulator for Xilinx 4000 series FPGAs.

• Inputs to simulator:• Place & Route bit stream (from Xilinx Tool)• Activity and Probability factors for pin signals

• Simulator calculates probabilities and activities for all internal signals

• Tool outputs power consumption of FPGA chip

• Currently calibrating/tuning simulator using both heat and DC current measurement cross-calibration methods

OutlineOutline





Deliverables

• Prototype VME-Based GPP/DSP/FPGA platform– 20 Slot Chassis with SPARC 5V Host– 9U VME RACE Board– 2 SHARC Daughtercards:12 SHARCs, 48MB – 2 PowerPC Daughtercards: 4 PowerPCs, 64MB– VME WILDFIRE Array Card (16 Xilinx 4028EX-3s)

√

√

√

Deliverables

• FPGA Power Prediction Simulator– Simulator Input: Probabilistic Input Data

Characteristics; FPGA configuration data file– Simulator Output: Power Prediction to within 10%

relative accuracy (expected)– Will demonstrate fidelity across different applications

and even different implementations of the same design– Will operate at interactive speeds – Completely Portable Java Implementation

√

√

√

√

√

Deliverables

• Network Simulator for Parallel STAP– Network Feature Inputs: number and types of

switching elements; interconnection scheme; number and types of processors at each network port, etc.

– Data Mapping Input: Data layout across the processors for each phase of processing

– Data Ordering Input: Order in which data items at each network port are to be transmitted

– Simulator Output: Number of network cycles required for all phases of STAP communication

– Relative accuracy of simulator 10% (expected)– Will operate at interactive speeds – Completely Portable Java Implementation

√

√

√

√

√

√

√

Deliverables

• Linear Filtering Implementation on FPGA– Investigation of different data formats and arithmetic

approaches for FPGA calculations– Demonstrate performance improvement (throughput

and/or power) over GPP/DSP implementation

• STAP Weight Equation Solver on GPP/DSP/FPGA System– Investigation of different data formats and arithmetic

approaches for FPGA calculations– Demonstrate performance improvement (throughput

and/or power) over GPP/DSP implementation

√

√

√

Deliverables

• Optimal configuration techniques for executing SAR on GPP/DSP/FPGA system– Based on optimally balancing memory and processor

utilization, selection of most appropriate data formats and arithmetic techniques, etc.

– Will utilize the FPGA power prediction simulator– Will optimally integrate most appropriate FPGA

circuit implementations and GPP/DSP algorithms– Optimization techniques based on proven

mathematical programming methods– Will demonstrate 2 to 10 times power savings over

nominal configurations of GPP/DSP systems

√

√

√

Deliverables

• Optimal configuration techniques for executing STAP on GPP/DSP/FPGA system– Techniques based on optimal data layout to minimize

latency through interconnection network, optimal combined use of processors and FPGAs for intensive weight calculation, will include desired numerical accuracy as an input parameter

– Will utilize the FPGA power prediction simulator and the network simulator for parallel STAP

– Will demonstrate 2 to 10 times power savings over nominal configurations of GPP/DSP systems

– Optimization techniques based on proven mathematical programming methods

√

√

Deliverables

• Optimal configuration techniques for SAR and STAP on GPP/DSP/FPGA system– Will generalize the SAR-only and STAP-only

configuration techniques– Will consider how to best configure the

GPP/DSP/FPGA to simultaneously satisfy both the SAR and STAP requirements and minimize power consumption

– Will demonstrate 2 to 10 times power savings over nominal configurations of GPP/DSP systems

– Optimization techniques based on proven mathematical programming methods

optimal configuration of combined gpp/dsp/fpga systems for ...antonio/pubs/p-ann_rev98acs.pdf ·...

Documents