optimal configuration of combined gpp/dsp/fpga systems for ...antonio/pubs/p-ann_rev98acs.pdf ·...
TRANSCRIPT
Optimal Configuration of Combined GPP/DSP/FPGA Systems for
Minimal SWAPby
John K. AntonioDepartment of Computer Science
College of EngineeringTexas Tech University
First Annual ReviewJune 23, 1998
OutlineOutline
• Program Overview and Introduction (Quad Chart)
• Program Management Status
• Recent Accomplishments
• Status of Deliverable Checklist
Configuring Combined GPP/DSP/FPGA Systems for Minimal SWAPApplications
• SAR• STAP
Requirements• Throughput• SWAP
•Combined Technology•Minimal SWAP Configuration•Mixed-Mode Operation•Demonstration
Texas Tech University: John K. Antonio
New Ideas• Systematic determination of minimal SWAP
configuration based on proven mathematical programming techniques
• Optimal configuration based on automatic“tuning” of system design parameters- number and types of cards used- data mapping and communication schemes- place and route schemes
• Novel computing techniques based oncharacteristics of GPP/DSP/FPGA system
Jun 97Start
Jun 98 Jun 99 Dec 99End
ScheduleDevelop optimalconfigurationtechniques
Construction and integration of GPP/DSP/FPGA system
Implement and test optimal configurations onGPP/DSP/FPGA system
Develop practicaldesign methodsbased on SAR andSTAP applications
Demonstrate advantagesof combiningtechnologies
Impact• Embedded Systems requirements for the
21st Century can be satisfied with thecombined use of GPP, DSP, and FPGA technologies
• Demonstrate use of FPGA boards as co-processors for embedded multiprocessorGPP and DSP systems
• Demonstrate systematic approaches tooptimally configure GPP/DSP/FPGA syst. forminimal SWAP for embedded applications
OutlineOutline
• Program Overview and Introduction (Quad Chart)
• Program Management Status
• Recent Accomplishments
• Status of Deliverable Checklist
Personnel (Program Management Status)
• John K. Antonio, Principal Investigator
• Ph.D., EE, Texas A&M Univ. (1989)
• Currently Assoc. Prof. of CS, Texas Tech Univ.
• Over 65 publications in HPC and related areas
• PI or co-PI of 17 contracts/grants
totaling over $2.1M
• Jeff Muehring, Research Assistant, Ph.D. student
Optimal GPP/DSP/FPGA Configuration Techniques for SAR
Intern at IBM/Houston, 1/98 to 6/98
• Jack West, Research Assistant, Ph.D. student
Optimal Mapping, Scheduling, and Configuration Techniques for STAP; Network Simulator
Personnel (Program Management Status)
• Nikhil Gupta, Research Assistant, M.S. student
Algorithms for STAP Weight Calculation Mapping Inner Product Computations onto FPGAs
Graduating July 1998
• Tim Osmulski, Research Assistant, M.S. student
Power Prediction Simulator for FPGAs
Graduated May 1998
Personnel (Program Management Status)
• Brian Veale, Research Assistant, M.S. student
Calibration of FPGA power prediction model; Implementation of STAP core on GPP/FPGA
New RA as of May 1998
• New Student, Research Assistant, M.S. student
Implementation of SAR core on GPP/FPGA
To be hired September 1998
Personnel (Program Management Status)
Contacts, Partners, Vendors, and Other Communications (Program Management Status)
José Muñoz, DARPA Ralph Kohler, Rome Lab
MIT Lincoln LabDavid MartinezJim Ward
MITRERichard Games
Northrop GrummanMarc Campbell
Synplicity, Inc. Madelyn Miller
XilinxJason Feinsmith
Annapolis Micro SystemsJenny DonaldsonBill HulbertPaul Kowalewski
ISIMilissa BenincasaDavid Coker
Mercury ComputerThomas EinsteinEd HolstienCraig LundDave Toms
Mercury20 Slot Hybrid Chassis with SPARC 5VSolaris 2.5 with C Compiler9U VME RACE BoardSHARC Daughtercard (2CNs, 8MB/CN)SHARC Daughtercard (2CNs, 16MB/CN)SHARC Daughtercard (2CNs, 16MB/CN)MC/OS, Cross Assembler, Toolkit PowerPC Daughtercard (2CNs, 16MB/CN)
Annapolis Micro SystemsPCI WILDONE Card (1 Xilinx 4028EX-3)VME WILDFIRE Array Card (16 Xilinx 4028EX-3s)
Other VendorsModelSim Simulation Software (Model Technology, Inc.)Synplify Synthesis Software (Synplicity, Inc.)Xilinx Foundation Software (Xilinx, Inc.)
Equipment Status(Program Management Status)
√√√
√
√
√
√√√
Schedule of Milestones
June 1997 June 1998 June 1999 Dec. 1999Dec. 1998Dec. 1997
Design STAPIterative Weight Solver for FPGA
Inter-GPP/DSP Comm.Simulator for STAP
Optimal GPP/DSPConfig. for SAR
GPP/DSP/FPGA Platform Construction and Independent Testing of GPP/DSP and FPGA Subsystems
Implement STAP Iterative Weight Solver on FPGA
Optimal GPP/DSPConfig. for STAP
Implement SAR Linear Filteringon FPGA
Optimal GPP/DSP/FPGAConfig. for SAR/STAP
GPP/DSP and FPGA Subsystem Integration and Testing
Optimal GPP/DSP/FPGA Config. for SAR
Demonstrate Combined SAR/STAP onGPP/DSP/FPGA Platform
Implement SAR on GPP/DSP
Design SAR Linear Filteringfor FPGA
Implement STAP on GPP/DSP
Implement SAR onGPP/DSP/FPGA Platform
Optimal GPP/DSP/FPGA Config. for STAP
Implement STAP onGPP/DSP/FPGA Platform
Develop FPGA Power Consumption Simulator
KeyGPP/DSP Sub-System
Research/DesignImplement/Test
FPGA Sub-SystemResearch/DesignImplement/Test
GPP/DSP/FPGA SystemResearch/DesignImplement/Test
Test FPGA Power Consumption Simulator
FY 97Approved
FY 98Approved
FY 98Required*
FY 98“Deficit”
Personnel 22,066 56,710 84,517 27,807
Fringes 7,575 18,871 25,723 6,852
Consulting 0 0 15,000 15,000
Expenses 260 3,321 4,500 1,179
Travel 0 4,500 4,500 0
Equipment 74,000 55,608 85,088 29,480
Indirect Cost 13,634 39,198 62,623 23,425
Total 116,644 178,208 281,951 103,743
FY 97 and FY 98 Budgets(Program Management Status)
*Required to maintain 30 month completion date (i.e., 12/31/99).
FY 99 FY 00 ProjectTotal
Personnel 138,536 52,401 297,520
Fringes 39,911 14,404 87,614
Consulting 25,000 10,000 50,000
Expenses 7,078 3,000 14,838
Travel 12,000 5,000 20,500
Equipment 59,892 0 217,670
Indirect Cost 104,587 39,858 221,121
Total 387,004 124,664 909,262
FY 99 and FY 00 Budgets(Program Management Status)
OutlineOutline
• Program Overview and Introduction (Quad Chart)
• Program Management Status
• Recent Accomplishments
• Status of Deliverable Checklist
Recent Accomplishments
• Network Communication Time Simulator for Parallel STAP
• FPGA Inner-Product Co-Processor Designs for STAP Weight Solver
• Power Prediction Simulator for the Xilinx4000-Series FPGA
Network Communication Time Simulator for Parallel STAP(Recent Accomplishments)
• Space-Time Adaptive Processing (STAP) Basics
• Mercury RACE Multicomputer
• Parallelization Approach for STAP
• RACE Network Simulator
• Preliminary Numerical Studies
• Conclusions
J. Ward, “Space-Time Adaptive Processing for Airborne Radar,” Technical Report 1015, MIT Lincoln Laboratory, Lexington, MA, 1994.
M. F. Skalabrin and T. H. Einstein, “STAP Processing on a Multicomputer: Distribution of 3-D Data Sets and Processor Allocation for Optimum InterprocessorCommunication,” Proc. Adaptive Sensor Array Processing (ASAP) Workshop, March 1996.
The RACE Multicomputer, Hardware Theory of Operation: Processors, I/O Interface, and the RACEway Interconnect, Volume I, ver. 1.3.
T. H. Einstein, “Mercury Computer Systems’ Modular Heterogeneous RACEMulticomputer,” Proc. 6th Heterogeneous Comp. Workshop, April 1997, pp. 60-71.
B. C. Kuszmaul, “The RACE Network Architecture,” Proc. 9th Int’l Parallel Processing Symp., April 1995, pp. 508-513.
G. Booch, I. Jacobson, and J. Rumbaugh, “The Unified Modeling Language for Object Oriented Development,” Documentation Set Version 1.1, September 1997.
Related STAP and RACE References
SSPACEPACE--TTIME IME AADAPTIVE DAPTIVE PPROCESSINGROCESSING
1. Space-Time Adaptive Processing (STAP) refers to a class of signal processing methods that operate on data collected from a set of sensors over a given time interval.
2. STAP simultaneously combines the signals received from an antenna array (spatial domain) and multiple pulse repetition periods (time domain).
3. STAP provides improved detection of smaller targets in the presence of ground clutter (overland and littoral environments) and hostile interference (electronic counter measures and jamming).
Pulses Pulses
Data Cube
Data Cube
Doppler Filter
Channels
Ran
ge
Ran
ge
Channels
Beamform
Beam Outputs
Ran
ge
Pulses
QR Decomposition
Rotate
Channels
Ran
ge
Pulses
Data Cube
Steering Vectors
Weights
Input Data
RotatePulse
Compress
Data CubeC
hann
els
Pulses
Range
STAPSTAP PPROCESSING ROCESSING FFLOWLOW
• Mercury RACE Multicomputer
• Space-Time Adaptive Processing (STAP) Basics
• Parallelization Approach for STAP
• RACE Network Simulator
• Preliminary Numerical Studies
• Conclusions
Network Communication Time Simulator for Parallel STAP(Recent Accomplishments)
1. 40Mhz clock, 32 bit data paths, 2048 byte circuit-switched packets.
2. Contention resolved using priorities.a. User-programmable message priority
b. Hardware priority assigned at each crossbar along a path (based on complex connection rules)
3. A packet with higher priority preempts (suspends) a lower priority packet (active or inactive) to gain control of a crossbar port.
SSOMEOME RACERACENNETWORK ETWORK FFEATURESEATURES
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
CNCN CNCNCNCN CNCN CNCN CNCNCNCN CNCN CNCN CNCNCNCN CNCNCNCN CNCN CNCN CNCN
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
CN
6-PortCrossbar
6-PortCrossbar
Message DestinationMessage DestinationMessage SourceMessage Source
MessagePath
MessagePath
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
CN
RACERACE NNETWORK ETWORK IINTERCONNECTNTERCONNECTFFATAT--TTREE REE TTOPOLOGYOPOLOGY
6-PortCrossbar
6-PortCrossbar
CNCN
6-PortCrossbar
SSTANDARD TANDARD CCROSSBAR ROSSBAR PPRIORITY RIORITY AARBITRATION RBITRATION AALGORITHM LGORITHM TTABLEABLE
7 F A,B,C,D,E F A,B,C,D,E F A,B,C,D6 E F E F A,B,C,D* A,B,C,D*5 A,B,C,D F A,B,C,D F A,B,C,D F4 E A,B,C,D E A,B,C,D - -3 *A,B,C,D *A,B,C,D,E A,B,C,D* A,B,C,D* - -2 - - A,B,C,D E - -1 - - - - - -
HardwarePriority Entry Port Exit Port Entry Port Exit Port Entry Port Exit Port
Active Port E InvolvedNot Yet Active
Port E Not Involved
Transaction Status
* - Peer Kill Rules Apply
RACEway Interface
SHARCSHARC
SHARCSHARC
SHARCProcessorSHARC
ProcessorECC LogicECC Logic DRAMDRAM
PerformanceMetering
PerformanceMetering
DMAController
DMAController
3-WayData
Switch
3-WayData
Switch
RACEwayMapping
Logic
RACEwayMapping
Logic
OSSupport
Hardware
OSSupport
Hardware
CN ASIC
SHARCSHARC CCOMPUTE OMPUTE NNODEODE
• Parallelization Approach for STAP
• Space-Time Adaptive Processing (STAP) Basics
• Mercury RACE Multicomputer
• RACE Network Simulator
• Preliminary Numerical Studies
• Conclusions
Network Communication Time Simulator for Parallel STAP(Recent Accomplishments)
1. Partition STAP data cube over a 2-D process set.
2. Process the contiguous dimension.
3. Re-partition the data cube before processing the next dimension.
4. Rotate the newly distributed data to make the next dimension sequential in memory.
5. Repeat steps 1 through 4 before each processing phase.
SSUBUB--CUBE CUBE BBAR AR PPARTITIONING ARTITIONING MMETHODOLOGYETHODOLOGY
Pulse Compression Partitioningwith range dimension whole.Pulse Compression Partitioningwith range dimension whole.
Pulses Range
Cha
nnel
s
Cha
nnel
s
1 32 4
5 76 8
9 1110 12
Pulses
+
3 x 4 Process Set
Pulses
5
1
9
Range
Cha
nnel
s
Doppler Filtering Partitioningwith pulses dimension whole.Doppler Filtering Partitioningwith pulses dimension whole.
Pulses Range
Cha
nnel
s
9 10 11 12
5 6 7 8
1 2 3 4
Pulses Range
Cha
nnel
s
+
Cha
nnel
s
1 32 4
5 76 8
9 1110 12
Range
3 x 4 Process Set
STAPSTAP DDATA ATA CCUBE UBE PPARTITIONING ARTITIONING EEXAMPLESXAMPLES
Pulses
5
1
9
Range
Cha
nnel
s• Re-Partitioning involves exchanging data with the next whole dimension.
Cha
nnel
s
1 32 4
5 76 8
9 1110 12
Pulses
3 x 4 Process Set
Range Dimension is Contiguous
Cha
nnel
s
1 32 4
5 76 8
9 1110 12
Range
3 x 4 Process Set
Pulse Dimension is Contiguous
• Interprocessor Communication is required between processors in the same row.
Pulses
Range
Cha
nnel
s
9 10 11 12
5 6 7 8
1 1 1 2 1 3 1 4
STAPSTAP DDATA ATA CCUBE UBE RREPARTITIONINGEPARTITIONING
Required Data TransfersRequired Data Transfers
Network Interconnection ConfigurationNetwork Interconnection Configuration
6-PortCrossbar
CN CN CN CN
12
3
45
6 78
9
1011
12
IPC
56
78
910
1112
Cha
nnel
12
34Pulses Range
Pulse Compression
1
4CN
7
10
CN
CN
CN
CN
CN
3
4
3
3
4
3
Doppler Filtering
Pulses
Cha
nnel
Range
9 10 11 12
5 6 7 8
1 2 3 4
STAPSTAP DDATA ATA CCUBE UBE RREPARTITIONINGEPARTITIONING
Data ReData Re--distribution Mappingdistribution Mapping
• RACE Network Simulator
• Space-Time Adaptive Processing (STAP) Basics
• Mercury RACE Multicomputer
• Parallelization Approach for STAP
• Preliminary Numerical Studies
• Conclusions
Network Communication Time Simulator for Parallel STAP(Recent Accomplishments)
1. Design and implement a network simulator that models the effect data mapping and scheduling has on the performance of a STAP algorithm.
2. Key features of the network simulator include:a. Developed and implemented in an OO paradigm.
b. Implemented using a sub-cube bar partitioning scheme.
c. Models both sub-cube bar mapping strategies and communication scheduling during both phases of data re-partitioning.
d. Completely generic.
RRESEARCH ESEARCH OOBJECTIVESBJECTIVESfor for SSIMULATORIMULATOR
NetworkNetwork
ClockClock
CrossbarCrossbar Routing TableRouting Table
File OutputFile Output
Random ScanRandom Scan
Data CubeData Cube
Process SetProcess Set
1
11
1
1..*
1
1
1
Gets Data From
UML NUML NETWORK ETWORK CCLASS LASS DDIAGRAMIAGRAM
11
CrossbarCrossbar
LinkLink Compute NodeCompute Node
Message QueueMessage Queue Packet StackPacket Stack
MessageMessage PacketPacket
UML CUML CROSSBAR ROSSBAR CCLASS LASS DDIAGRAMIAGRAM
0..*0..*
1 1
1
2
1
2
2,6
11
0,4
DataData
MessageMessage PacketPacket
Header RouteList
Header RouteList
RouteRoute
Abstract ClassInheritance
UML DUML DATA ATA CCLASS LASS DDIAGRAMIAGRAM
11
1..*
1
CrossbarCrossbar CrossbarCrossbar
CrossbarCrossbar
Compute NodeProcessor InformationOutgoing and Received Message QueuesOutgoing and Received Packet Stack
Compute NodeProcessor InformationOutgoing and Received Message QueuesOutgoing and Received Packet Stack
LinkLink
Random ScanGenerates Pseudo-Random CN Scan Ordering
Random ScanGenerates Pseudo-Random CN Scan Ordering
ClockBased on Network Clock Frequency (factor of 5)Data Transfer Rate Equates to Effective Network Bandwidth
ClockBased on Network Clock Frequency (factor of 5)Data Transfer Rate Equates to Effective Network Bandwidth
Dynamic Network ConstructionDynamic Routing Table CreationDynamic CN and CE Message Traffic GenerationSimulates Packet Traffic
Dynamic Network ConstructionDynamic Routing Table CreationDynamic CN and CE Message Traffic GenerationSimulates Packet Traffic
Network Methods
NNETWORK ETWORK CCLASS LASS DDETAILSETAILS
Implements Hardware Priority Arbitration • TOP-LEVEL ALGORITHM• STANDARD ALGORITHM
Query Port StatusRoutes Packets to Next LocationAllocates and Frees Internal Port Connections and Connected Link ObjectsTransmits Packet Data
Implements Hardware Priority Arbitration • TOP-LEVEL ALGORITHM• STANDARD ALGORITHM
Query Port StatusRoutes Packets to Next LocationAllocates and Frees Internal Port Connections and Connected Link ObjectsTransmits Packet Data
Crossbar Methods
LinkConnects Crossbar Objects Link Status: Occupied or Free
LinkConnects Crossbar Objects Link Status: Occupied or Free
CrossbarTwo Parent Port ConnectionsFour Child Port ConnectionsInternal Switch ConnectionsFour CN Connections for TerminalCrossbars.
CrossbarTwo Parent Port ConnectionsFour Child Port ConnectionsInternal Switch ConnectionsFour CN Connections for TerminalCrossbars.
CCROSSBAR ROSSBAR CCLASS LASS DDETAILSETAILS
Compute Node Methods:Manages Outgoing and Received MessageQueuesManages Outgoing and Received Packet StackExplodes the Top Outgoing Message into Packets of Size 2048 or LessHandles DMA Chaining of PacketsEstablishes Path Through Network and Transmits Packet Data
Compute Node Methods:Manages Outgoing and Received MessageQueuesManages Outgoing and Received Packet StackExplodes the Top Outgoing Message into Packets of Size 2048 or LessHandles DMA Chaining of PacketsEstablishes Path Through Network and Transmits Packet Data
Outgoing Message QueueOutgoing Message Queue
Message 1
Message 2
Message 3
::
Packet StackPacket StackEXPLODE
Compute NodeProcessor InformationOutgoing and Received Message QueuesOutgoing and Received Packet Stack
• PACKETS ARE SELF-ROUTING
Compute NodeProcessor InformationOutgoing and Received Message QueuesOutgoing and Received Packet Stack
• PACKETS ARE SELF-ROUTING
::
Packet 2Packet 3Packet 4
Packet 1
CCOMPUTE OMPUTE NNODE ODE CCLASS LASS DDETAILSETAILS
SSIMULATOR IMULATOR UMLUMLSSEQUENCE EQUENCE DDIAGRAMIAGRAM
NetworkNetwork CrossbarCrossbarData CubeData Cube Process SetProcess Set CNCN<<actor>>
User<<actor>>
User ClockClock
Pass 1
Pass 2
Increment Simulation
Clock
Build Messages
R:200,P:22,C:16
CEs:48
X:6, Y:8
Routing:FCN Traffic,
Phase 1 DMA:Y
Connection/Data
Transfer
Clean Up
Message Matrices
X, Y,MappingMatrices
SimulationTime = 2 msSimulation
Time = 2 ms
Messages Time* iterative process
CCOMPUTEOMPUTE NNODEODEUML SUML STATECHARTTATECHART
Simulation PASS 1Simulation PASS 1Compute Node Subsystem
CurrentPacket
CurrentPacket
PacketStackStatus
PacketStackStatus
MessageQueueStatus
MessageQueueStatus
ExplodeTop
Message
ExplodeTop
Message
PopTop
Packet
PopTop
Packet
Simulation SubsystemSimulation Subsystem
Simulate Pass 1
Simulate Pass 1
GenerateErrorCode
GenerateErrorCode
No Packet EmptyEmpty - Done
Not Empty Not Empty
Success
ErrorError
SuccessPacketFound
CCOMPUTEOMPUTE NNODEODEUML SUML STATECHARTTATECHART
Simulation PASS 2Simulation PASS 2Compute Node Subsystem
CurrentPacket
CurrentPacket
Simulation SubsystemSimulation Subsystem
Simulate Pass 2
Simulate Pass 2
PacketFound
No Packet
PPACKETACKET UML SUML STATECHARTTATECHARTSimulation Simulation Pass 1Pass 1 and and Pass 2Pass 2
Simulation Pass Subsystem
Start UpStart Up
Waitingfor Kill
Waitingfor Kill
CompletedCompletedSuspendedSuspended
BlockedBlocked ActiveActive
ReadyReady
Pass 1
Pass 2
• Preliminary Numerical Studies
• Space-Time Adaptive Processing (STAP) Basics
• Mercury RACE Multicomputer
• Parallelization Approach for STAP
• RACE Network Simulator
• Conclusions
Network Communication Time Simulator for Parallel STAP(Recent Accomplishments)
PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
Process Set - Phase 1 (CN:8, R:800, P:32, C:22, Routing:E)
0
10
20
30
40
50
60
7 8 9 10 11
Time (ms)
Coun
t CN 8 (6x4)CN 8 (4x6)
Process Set - Phase 1 (CN:8, R:800, P:32, C:22, Routing:E)
0
10
20
30
40
50
60
7 8 9 10 11
Time (ms)
Coun
t CN 8 (6x4)CN 8 (4x6)
PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
Process Set - Phase 2 (CN:8, R:800, P:32, C:22, Routing:E)
0123456789
28 30 32 34 36 38 40
Time (ms)
Coun
t CN 8 (6x4)CN 8 (4x6)
Process Set - Phase 2 (CN:8, R:800, P:32, C:22, Routing:E)
0123456789
28 30 32 34 36 38 40
Time (ms)
Coun
t CN 8 (6x4)CN 8 (4x6)
PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
Process Set - Phase 1(CN:16, R:200, P:22, C:16, Routing:F)
02468
101214161820
0.7 0.8 0.9 1 1.1 1.2 1.3
Time (ms)
Coun
t CN 16 (12x4)CN 16 (8x6)CN 16 (4x12)
Process Set - Phase 1(CN:16, R:200, P:22, C:16, Routing:F)
02468
101214161820
0.7 0.8 0.9 1 1.1 1.2 1.3
Time (ms)
Coun
t CN 16 (12x4)CN 16 (8x6)CN 16 (4x12)
PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
Process Set - Phase 2(CN:16, R:200, P:22, C:16, Routing:F)
0
2
4
6
8
10
12
14
2.5 3 3.5 4 4.5 5 5.5
Time (ms)
Coun
t CN 16 (12x4)CN 16 (8x6)CN 16 (4x12)
Process Set - Phase 2(CN:16, R:200, P:22, C:16, Routing:F)
0
2
4
6
8
10
12
14
2.5 3 3.5 4 4.5 5 5.5
Time (ms)
Coun
t CN 16 (12x4)CN 16 (8x6)CN 16 (4x12)
PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
Process Set - Phase 1 (CN:12, R:200, P:22, C:16, Routing:F)
05
101520253035404550
0.5 1 1.5 2
Time (ms)
Coun
t
CN 12 (12x3)CN 12 (9x4)CN 12 (6x6)CN 12 (4x9)
PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
Process Set - Phase 2 (CN:12, R:200, P:22, C:16, Routing:F)
0123456789
10
3 3.5 4 4.5 5 5.5 6
Time (ms)
Coun
t
CN 12 (12x3)CN 12 (9x4)CN 12 (6x6)CN 12 (4x9)
Process Set - Phase 2 (CN:12, R:200, P:22, C:16, Routing:F)
0123456789
10
3 3.5 4 4.5 5 5.5 6
Time (ms)
Coun
t
CN 12 (12x3)CN 12 (9x4)CN 12 (6x6)CN 12 (4x9)
PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
Process Set - Phase 1 (CN:12, R:200, P:22, C:16, Routing:F)
05
101520253035404550
0 0.5 1 1.5 2
Time (ms)
Coun
t CN 12 (3x12)CN 12 (12x3)CN 12 (4x9)
Process Set - Phase 1 (CN:12, R:200, P:22, C:16, Routing:F)
05
101520253035404550
0 0.5 1 1.5 2
Time (ms)
Coun
t CN 12 (3x12)CN 12 (12x3)CN 12 (4x9)
PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
Process Set - Phase 2 (CN:12, R:200, P:22, C:16, Routing:F)
0
2
4
6
8
10
12
14
2.5 3.5 4.5 5.5 6.5
Time (ms)
Coun
t CN 12 (3x12)CN 12 (12x3)CN 12 (4x9)
Process Set - Phase 2 (CN:12, R:200, P:22, C:16, Routing:F)
0
2
4
6
8
10
12
14
2.5 3.5 4.5 5.5 6.5
Time (ms)
Coun
t CN 12 (3x12)CN 12 (12x3)CN 12 (4x9)
PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
Process Set - Phase 1(CN, R:200, P:22, C:16, Routing:F)
0
10
20
30
40
50
60
0 0.2 0.4 0.6 0.8 1
Time (ms)
Coun
t CN 12 (3x12)CN 16 (4x12)
Process Set - Phase 1(CN, R:200, P:22, C:16, Routing:F)
0
10
20
30
40
50
60
0 0.2 0.4 0.6 0.8 1
Time (ms)
Coun
t CN 12 (3x12)CN 16 (4x12)
PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
Process Set - Phase 2(CN, R:200, P:22, C:16, Routing:F)
0
2
4
6
8
10
12
14
2.6 2.8 3 3.2 3.4 3.6 3.8
Time (ms)
Coun
t CN 12 (3x12)CN 16 (4x12)
Process Set - Phase 2(CN, R:200, P:22, C:16, Routing:F)
0
2
4
6
8
10
12
14
2.6 2.8 3 3.2 3.4 3.6 3.8
Time (ms)
Coun
t CN 12 (3x12)CN 16 (4x12)
MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
Message Traffic - Phase 1 (CN:16, X:12, Y:4, R:400, P:22, C:16, Routing:EF)
0123456789
2 2.1 2.2 2.3 2.4 2.5
Time (ms)
Coun
t CN TrafficCE Traffic
Message Traffic - Phase 1 (CN:16, X:12, Y:4, R:400, P:22, C:16, Routing:EF)
0123456789
2 2.1 2.2 2.3 2.4 2.5
Time (ms)
Coun
t CN TrafficCE Traffic
MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
Message Traffic - Phase 2 (CN:16, X:12, Y:4, R:400, P:22, C:16, Routing:EF)
012345678
10 15 20 25
Time (ms)
Coun
t CN TrafficCE Traffic
Message Traffic - Phase 2 (CN:16, X:12, Y:4, R:400, P:22, C:16, Routing:EF)
012345678
10 15 20 25
Time (ms)
Coun
t CN TrafficCE Traffic
MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
Message Traffic - Phase 1 (CN:16, X:6, Y:8, R:400, P:22, C:16, Routing:EF)
0
10
20
30
40
50
60
0.85 0.851 0.852 0.853 0.854
Time (ms)
Coun
t CN TrafficCE Traffic
Message Traffic - Phase 1 (CN:16, X:6, Y:8, R:400, P:22, C:16, Routing:EF)
0
10
20
30
40
50
60
0.85 0.851 0.852 0.853 0.854
Time (ms)
Coun
t CN TrafficCE Traffic
MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
Message Traffic - Phase 2 (CN:16, X:6, Y:8, R:400, P:22, C:16, Routing:EF)
012345678
10 15 20 25
Time (ms)
Coun
t CN TrafficCE Traffic
Message Traffic - Phase 2 (CN:16, X:6, Y:8, R:400, P:22, C:16, Routing:EF)
012345678
10 15 20 25
Time (ms)
Coun
t CN TrafficCE Traffic
MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
Message Traffic - Phase 1 (CN:12, X:6, Y:6, R:800, P:32, C:22, Routing:E)
0
10
20
30
40
50
60
4.95 4.9505 4.951 4.9515 4.952
Time (ms)
Coun
t CN TrafficCE Traffic
Message Traffic - Phase 1 (CN:12, X:6, Y:6, R:800, P:32, C:22, Routing:E)
0
10
20
30
40
50
60
4.95 4.9505 4.951 4.9515 4.952
Time (ms)
Coun
t CN TrafficCE Traffic
MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
Message Traffic - Phase 2 (CN:12, X:6, Y:6, R:800, P:32, C:22, Routing:E)
0123456789
10
43 45 47 49 51
Time (ms)
Coun
t CN TrafficCE Traffic
Message Traffic - Phase 2 (CN:12, X:6, Y:6, R:800, P:32, C:22, Routing:E)
0123456789
10
43 45 47 49 51
Time (ms)
Coun
t CN TrafficCE Traffic
MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
Message Traffic - Phase 1 (CN:12, X:9, Y:4, R:800, P:32, C:22, Routing:E)
0
1
2
3
4
5
6
7
17 18 19 20 21 22 23
Time (ms)
Coun
t CN TrafficCE Traffic
Message Traffic - Phase 1 (CN:12, X:9, Y:4, R:800, P:32, C:22, Routing:E)
0
1
2
3
4
5
6
7
17 18 19 20 21 22 23
Time (ms)
Coun
t CN TrafficCE Traffic
MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
Message Traffic - Phase 2 (CN:12, X:9, Y:4, R:800, P:32, C:22, Routing:E)
0
1
2
3
4
5
6
7
41 43 45 47 49
Time (ms)
Coun
t CN TrafficCE Traffic
Message Traffic - Phase 2 (CN:12, X:9, Y:4, R:800, P:32, C:22, Routing:E)
0
1
2
3
4
5
6
7
41 43 45 47 49
Time (ms)
Coun
t CN TrafficCE Traffic
DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
DMA Chaining - Phase 1 (CE:24, X:8, Y:3, R:200, P:22, C:16, Routing:F)
0123456789
10
1.7 1.8 1.9 2 2.1
Time (ms)
Coun
t ChainingNo Chaining
DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
DMA Chaining - Phase 2 (CE:24, X:8, Y:3, R:200, P:22, C:16, Routing:F)
0123456789
2.5 2.7 2.9 3.1 3.3 3.5
Time (ms)
Coun
t ChainingNo Chaining
DMA Chaining - Phase 2 (CE:24, X:8, Y:3, R:200, P:22, C:16, Routing:F)
0123456789
2.5 2.7 2.9 3.1 3.3 3.5
Time (ms)
Coun
t ChainingNo Chaining
DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
DMA Chaining - Phase 1 (CE:24, X:8, Y:3, R:400, P:22, C:16, Routing:F)
0123456789
3.4 3.5 3.6 3.7 3.8 3.9 4 4.1
Time (ms)
Coun
t ChainingNo Chaining
DMA Chaining - Phase 1 (CE:24, X:8, Y:3, R:400, P:22, C:16, Routing:F)
0123456789
3.4 3.5 3.6 3.7 3.8 3.9 4 4.1
Time (ms)
Coun
t ChainingNo Chaining
DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
DMA Chaining - Phase 2 (CE:24, X:8, Y:3, R:400, P:22, C:16, Routing:F)
0123456789
5.2 5.7 6.2 6.7
Time (ms)
Coun
t ChainingNo Chaining
DMA Chaining - Phase 2 (CE:24, X:8, Y:3, R:400, P:22, C:16, Routing:F)
0123456789
5.2 5.7 6.2 6.7
Time (ms)
Coun
t ChainingNo Chaining
DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
DMA Chaining - Phase 1 (CE:24, X:8, Y:3, R:800, P:32, C:22, Routing:F)
0123456789
14 16 18 20 22
Time (ms)
Coun
t ChainingNo Chaining
DMA Chaining - Phase 1 (CE:24, X:8, Y:3, R:800, P:32, C:22, Routing:F)
0123456789
14 16 18 20 22
Time (ms)
Coun
t ChainingNo Chaining
DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
DMA Chaining - Phase 2 (CE:24, X:8, Y:3, R:800, P:32, C:22, Routing:F)
012345678
21 22 23 24 25 26 27
Time (ms)
Coun
t ChainingNo Chaining
DMA Chaining - Phase 2 (CE:24, X:8, Y:3, R:800, P:32, C:22, Routing:F)
012345678
21 22 23 24 25 26 27
Time (ms)
Coun
t ChainingNo Chaining
AADAPTIVEDAPTIVE RROUTINGOUTINGPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
Adaptive Routing - Phase 1 (CN:16, X:8, Y:6, R:800, P:32, C:22)
0123456789
7 8 9 10 11 12 13
Time (ms)
Coun
t Adaptive EAdaptive FAdaptive E/F
Adaptive Routing - Phase 1 (CN:16, X:8, Y:6, R:800, P:32, C:22)
0123456789
7 8 9 10 11 12 13
Time (ms)
Coun
t Adaptive EAdaptive FAdaptive E/F
AADAPTIVEDAPTIVE RROUTINGOUTINGPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
Adaptive Routing - Phase 2 ( CN:16, X:8, Y:6, R:800, P:32, C:22)
0123456789
26 31 36 41 46Time (ms)
Cou
nt
Adaptive EAdaptive FAdaptive E/F
Adaptive Routing - Phase 2 ( CN:16, X:8, Y:6, R:800, P:32, C:22)
0123456789
26 31 36 41 46Time (ms)
Cou
nt
Adaptive EAdaptive FAdaptive E/F
AADAPTIVEDAPTIVE RROUTINGOUTINGPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
Adaptive Routing - Phase 1 (CN:16, X:8, Y:6, R:400, P:22, C:16)
0
2
4
6
8
10
12
1.5 2 2.5 3 3.5
Time (ms)
Coun
t Adaptive EAdaptive FAdaptive E/F
Adaptive Routing - Phase 1 (CN:16, X:8, Y:6, R:400, P:22, C:16)
0
2
4
6
8
10
12
1.5 2 2.5 3 3.5
Time (ms)
Coun
t Adaptive EAdaptive FAdaptive E/F
AADAPTIVEDAPTIVE RROUTINGOUTINGPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
Adaptive Routing - Phase 2 (CN:16, X:8, Y:6, R:400, P:22, C:16)
0123456789
10
7 8 9 10 11 12 13
Time (ms)
Coun
t Adaptive EAdaptive FAdaptive E/F
Adaptive Routing - Phase 2 (CN:16, X:8, Y:6, R:400, P:22, C:16)
0123456789
10
7 8 9 10 11 12 13
Time (ms)
Coun
t Adaptive EAdaptive FAdaptive E/F
• Space-Time Adaptive Processing (STAP) Basics
• Mercury RACE Multicomputer
• Parallelization Approach for STAP
• RACE Network Simulator
• Preliminary Numerical Studies
• Conclusions
Network Communication Time Simulator for Parallel STAP(Recent Accomplishments)
1. Designed and implemented a platform independent simulator.
4. Communication pattern implemented for STAP but may be used for other applications with phased communication pattern.
2. Simulator demonstrates that the Process Set, the CN or CE Message Traffic, the DMA chaining, the adaptive routing, and the scheduling of the messages affects performance.
3. Allows users to experiment with possible current and future configurations.
CCONCLUSIONSONCLUSIONS
Recent Accomplishments
• Network Communication Time Simulator for Parallel STAP
• FPGA Inner-Product Co-Processor Designs for STAP Weight Solver
• Power Prediction Simulator for the Xilinx4000-Series FPGA
FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers
(Recent Accomplishments)
• Overview of STAP Weight Calculation
• Two Candidate STAP Weight Solvers: QR Versus CG
• Two FPGA Inner-Product Circuit Designs
• Numerical Accuracy Studies
References for STAP Weight Solverand FPGA Design
J. Ward, “Space-Time Adaptive Processing for Airborne Radar,” Technical Report 1015, MIT Lincoln Laboratory, Lexington, MA, 1994.
K. C. Cain, J. A. Torres, and R. T. Williams, (R. A. Games, Project Leader), “RT_STAP: Real-Time Space-Time Adaptive Processing Benchmark,” MITRE Technical Report MTR 96B0000021, Feb. 1997.
MCARM Data Files, Rome Laboratory, (http://sunrise.oc.rl.af.mil).
D. G. Luenberger, Linear and Nonlinear Programming, Addison-Wesley, Reading, MA, 1984.
WildOne Hardware Reference Manual, Number 11927-0000, Revision 0.1, Annapolis Micro Systems, Inc., MD, 1997.
Doppler Filter
Weight Computation
Steering Vector
Input Data
Pulse Compress Data Cube Data Cube
Weight Application
ThresholdDetection
Target Decision
Typical STAP Processing Flow
pulses
range
Doppler
range8%
91.5%
0.5%
CovarianceMatrix
STAP CPI Data Cube
1 M
L
1
N
1
PRI (32-128)
Channels(24)
Range(625-2500)
Principle Behind STAP
• Range gates are divided into non-overlapping blocks having a fixed number of range gates
• These blocks are referred to as the Range Segments
1 M
L
1
N
1PRI
Channels Lr
Number of Range Segments = L/Lr
Range Segment
• Works with data on all M Doppler bins and all Nchannels
• Computes and applies a separate adaptive weight to every element and Doppler bin
• The weight vector is of size MN for each range gate.
Space-Time Adaptive Processing
• Fully Adaptive STAP
Space-Time Adaptive Processing
• Characteristics of Fully Adaptive STAP• Requires solving a large system of linear equations• Size of the linear system grows with
• Array size (the number of channels)• Number of pulses
Example: for each instance, if M = 32 and N = 24 then, complexity ≈ (MN)3 = 452,984,830
• Implementation of fully adaptive STAP is impractical• Complexity of each instance is O((MN)3)• Product MN being several hundreds puts it beyond
current capabilities in real-time computing• Instances of the problem must be solved for each
range segment
• Problem is broken down into a number of smaller,more manageable adaptive problems
• STAP applied to these lower dimension problems
Space-Time Adaptive Processing
• Partially Adaptive STAP
• The partially adaptive STAP works with data on
Example: for each instance, if M = 32, N = 24 and K = 3, thencomplexity ≈ M(KN)3 = 11,943,936
Space-Time Adaptive Processing
• All N Channels
• And a few adjacent Doppler bins, denoted as K
• Complexity is reduced to O(M(KN)3), for K<< M
Space-Time Adaptive Processing
• Effective partially adaptive STAP technique
• The architecture consists of
• Doppler processing across all pulse repetition intervals
• Adaptive filtering across• all channels and• K adjacent Doppler bins
Kth- Order Doppler Factored STAP
1 31 ˆ:),(
×=× NN
rkx
r
∑+−=
=bL
rkxrkx
bkR
rLbr
H
rL 1)1(
),(),(1
),(ψ
Kth-Order Doppler Factored STAP
bth Ran
ge
Segm
ent
(with
L rce
lls)N
Cha
nnel
s
Doppler
k (k - 1)(k + 1)
Data matrix needed for calculating covariance matrix for kth Doppler Bin
and bth Range Segment using Kth-OrderDoppler Factored STAP with K = 3
Matrix-Based Derivation of
rr LNLN
bk
3 ˆ:),(
×=×
X
),(),(1
),(),(1),(1)1(
bkbk
bLrkxrkxbk
H
r
Lbr
H
r
L
LR
r
XX
ψ
=
= ∑+−=
sbkwbk =),(),(ψ
The Weight Equation:
),( bkψ
FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers
(Recent Accomplishments)
• Overview of STAP Weight Calculation
• Two Candidate STAP Weight Solvers: QR Versus CG
• Two FPGA Inner-Product Circuit Designs
• Numerical Accuracy Studies
Methods for STAP Weight Calculation
• Two approaches to solve the weight equation
• QR-decomposition method (direct)
• Conjugate Gradient method (iterative)
STAP Weight Calculation
sLbkwRR
RR
sbkwRRL
bkwRQQRL
QRbk
sbkwbkbkL
sbkwbk
rT
TT
T
r
TT
r
T
H
r
=
=
==
=
=
=
),(
]0[ that Note
),(1),(1
),( :onDecomposti QR Take
),(),(),(1
),(),(
*11
1
***
X
XX
ψ
onsubstituti backward using ),(for Solve
),(
neliminatio forward using for Solve
),(Let
*1
1
*1
bkw
pbkwR
p
sLpR
pbkwR
rT
=
=
=
sw =ψ :Equation Weight thesolve toMethodion decomposit-QR Using
Iteration
STAP Weight Calculation
Initialization
ikTi
iTi
ii
ii
ii
Ti
iTi
ii
ddd
dggd
swg
ddd
dgww
+−=
−=
−=
+++
++
+
)(1
11
11
1
ψψ
ψ
ψ
sw =ψ :Equation Weight thesolve toMethodGradient Conjugate Using
00000 ,set , Choose dgwsdw −=−= ψ
Preliminary Numerical Studies
10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1
Tolerance
Rel
ativ
e Er
ror
Lr = 25010-1
10-2
10-3
10-4
10-5
10-6
10-7
10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1
Rel
ativ
e Er
ror
100
10-1
10-2
10-3
10-4
10-5
10-6
10-7
10-8
10-9
Tolerance
Lr = 125
qr
cgqr
w
ww −=Error Relative
Preliminary Numerical Studies
Lr = 125
Flop
Cou
nt
108
109
1010
10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1
Tolerance
CGQR
10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1
Lr = 250
Tolerance
1010
109
108
Flop
Cou
nt
Tolerance
CGQR
FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers
(Recent Accomplishments)
• Overview of STAP Weight Calculation
• Two Candidate STAP Weight Solvers: QR Versus CG
• Two FPGA Inner-Product Circuit Designs
• Numerical Accuracy Studies
Motivation for FPGA Inner-Product Co-Processors
• Inner-products are a core calculation for both CG- and QR-based STAP weight solvers
• Computations are highly numeric and regular
• Opportunities to exploit reduced precision arithmetic
• Control flow of CG and QR best implemented on GPP or DSP - Inner product calculations can be offloaded to available FPGA resources
PCI BUSPCI BUS
Dual Port MemController 0
Dual Port MemController 0
Dual Port MemController 1
Dual Port MemController 1
Processing Element
1
Processing Element
1
Processing Element
0
Processing Element
0Fifo1Fifo1Fifo0Fifo0
SIMDConnector
External I/OConnector
Overview of WildOne Architecture
+
Output Register
a b
Sign+16 bitmantissa
Normalizing unit
1’s comp/register
a bsign of a
a b
b
BUFFER
X
BUFFER
FPGA
BOARD
INTERCONNECTION
BUS
HOSTPROCESSOR
• Multiply-Accumulate Pipe• Reads two operands
per cycle • Performs two operations
per cycle• Performs exponent
normalization prior to accumulation
• 2 N-vectors reduced to a constant number of partial sums
FPGA Inner Product Co-Processor:Design 1
• Multiply-Add Reduction Pipe• Reads four operands
per cycle • Performs three operations
per cycle• No normalization required• 2 N-vectors reduced to N/2 partial sums
• Basic Tradeoff: First design has lower throughput, but can perform more work
X X
1’s comp/register
Sign bSign a
+
Sign+16 bit mantissa
INTERCONNECTION
BUS
HOSTPROCESSOR
BUFFER
BUFFER
FPGA
BOARD
2 ff
Data forFirst
Multiplier
Data forSecond
Multiplier
Unitclocked
here
FPGA Inner Product Co-Processor:Design 2
Block Floating Point Unit
Inner-ProductCo-Processor
1
1
UML Description of Basic Co-Processor Design
Block Floating Point Unit
Multiplying Unit Complementor
Normalizing UnitAccumulator
1 1
1
11
1
1 1
UML Description of Block Floating Point Unit
Multiplying UnitRegister
4-Bit Adder
Multiply Stage1
1
132
4
8
UML Description of Multiplying Unit
Accumulator
4-Bit Adder Register
3-Bit Adder
1
5
1
1
124
UML Description of Accumulator Unit
Normalizing Unit
SubtractorRegister
MagnitudeComparator
*
1 1 1
1
1
UML Description of Normalizing Unit
Host ProgramWild-One
Open board
Program the board with the image
Interrupt for Exponent
Exponent written to FIFO
Interrupt for Mantissa Vectors
Mantissa Vectors written to the FIFO
Processing Done ans in FIFO/Memory
Read back the answer
Close the board
TI
ME
MESSAGES
Sequence Diagram for Interactions between Host and FPGA Board
Get Exponent Wait forExponent Int
Int Req
Int AckRead Exponent
Get Mantissa
Read Mantissa Write Mantissa
Wait forMantissa Int
Write Exponent
Int Req
Int Ack
Multiply-and-add/accumulate
Write Back
Wait for Answer Int
Read Back Answer
Ack = 1
Ack = 0
Req = 1
Req = 0
Req = 1Ack = 1
Ack = 0 Req = 0
Req = 1
Done = 1
Req = 0Processing Sub-System
FPGA
Board
Host
System
Statechart Diagram for Interactions between Host and FPGA Board
Compare Count [Count = Threshold]
Read Two Operands
Multiply
Accumulate
[Count ≠ Threshold]
Write to MemoryFeedback SumIncrement Count
Set Done flag
Circuit Activity Diagram:Design 1
Compare Count [Count = Threshold]
Read Two Operands
Multiply
Add
[Count ≠ Threshold]
Read Next Two Operands
Multiply
Write to Memory
Increment Count Set Done flag
Circuit Activity Diagram:Design 2
FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers
(Recent Accomplishments)
• Overview of STAP Weight Calculation
• Two Candidate STAP Weight Solvers: QR Versus CG
• Two FPGA Inner-Product Circuit Designs
• Numerical Accuracy Studies
Setup for Numerical Accuracy Studies
• Randomly generated, 512-element test vectors processed by both designs
• Range of vectors’ data values controlled to study effect dynamic range has on accuracy
• Output of each circuit compared to corresponding results calculated on host (using IEEE 32-bit floating point arithmetic)
• Accuracy metric is ratio of obtained values to corresponding IEEE floating point value
Zero Order of Magnitude Experiment
Data Histogram
05
101520253035404550
0.00
1
0.06
3
0.12
6
0.18
8
0.25
1
0.31
3
0.37
6
0.43
8
0.50
0
0.56
3
0.62
5
0.68
8
0.75
0
0.81
3
0.87
5
0.93
8
1.00
0
Freq
uenc
y
Exponent Histogram
0
100
200
300
400
500
600
114
116
118
120
122
124
126
128
130
132
134
Freq
uenc
y
Accuracy HistogramDesign 2
020406080
100120140160180
0.99
84
0.99
85
0.99
86
0.99
87
0.99
88
0.99
89
0.99
90
0.99
91
0.99
92
0.99
93
0.99
94
0.99
95
0.99
96
0.99
97
0.99
98
0.99
99
1.00
00
Freq
uenc
y
Accuracy HistogramDesign 1
0
1
2
3
4
5
6
7
8
0.999855 0.99986375 0.9998725 0.99988125 0.99989
Freq
uenc
y
Two Orders of Magnitude Experiment
Accuracy HistogramDesign 1
0
1
2
3
4
5
6
7
0.999893 0.9999015 0.99991 0.9999185 0.999927
Freq
uenc
y
Data Histogram
05
101520253035404550
0 7 14 21 27 34 41 48 55 62 69 76 82 89 96 103
110
Freq
uenc
y
Exponent Histogram
050
100150200250300350400450500
119
121
123
125
127
129
131
133
135
137
139
141
143
145
Freq
uenc
y
Accuracy HistogramDesign 2
0
50
100
150
200
250
0.99
399
0.99
436
0.99
474
0.99
511
0.99
549
0.99
586
0.99
624
0.99
661
0.99
699
0.99
736
0.99
774
0.99
811
0.99
849
0.99
886
0.99
924
0.99
961
0.99
999
Freq
uenc
y
Four Orders of Magnitude Experiment
Accuracy HistogramDesign 1
0
1
2
3
4
5
6
7
8
9
0.999889 0.99989925 0.9999095 0.99991975 0.99993
Freq
uenc
y
Data Value Histogram
05
1015
2025
3035
4045
50
0
687
1373
2060
2747
3434
4120
4807
5494
6180
6867
7554
8241
8927
9614
1030
1
1098
5
Freq
uenc
y
Exponent Histogram
0
50
100
150
200
250
300
350
400
450
119
121
123
125
127
129
131
133
135
137
139
141
143
145
Freq
uenc
y
Accuracy HistogramDesign 2
0
50
100
150
200
250
300
0.46
7
0.50
0
0.53
4
0.56
7
0.60
0
0.63
4
0.66
7
0.70
0
0.73
3
0.76
7
0.80
0
0.83
3
0.86
7
0.90
0
0.93
3
0.96
7
1.00
0
Freq
uenc
y
Five Orders of Magnitude Experiment
Accuracy HistogramDesign 1
0
1
2
3
4
5
6
7
8
0.999912 0.99991875 0.9999255 0.99993225 0.999998
Freq
uenc
y
Data Value Histogram
05
101520253035404550
0
6867
1373
4
2060
2
2746
9
3433
6
4120
3
4807
0
5493
7
6180
5
6867
2
7553
9
8240
6
8927
3
9614
1
1030
08
Freq
uenc
y
Exponent Histogram
0
100
200
300
400
500
600
700
800
119 121 123 125 127 129 131 133 135 137 139 141 143
Freq
uenc
y
Accuracy HistogramDesign 2
0
50
100
150
200
250
300
0.00
000
0.06
250
0.12
500
0.18
750
0.25
000
0.31
249
0.37
499
0.43
749
0.49
999
0.56
249
0.62
499
0.68
749
0.74
999
0.81
249
0.87
499
0.93
748
0.99
998
Freq
uenc
y
“Outlyer” Experiment
Accuracy HistogramDesign 2
0
5
10
15
20
25
30
35
40
45
50
0.00
0.06
0.12
0.17
0.23
0.29
0.35
0.40
0.46
0.52
0.58
0.64
0.69
0.75
0.81
0.87
0.92
Freq
uenc
y
Exponent Histogram
0
100
200
300
400
500
600
114
116
118
120
122
124
126
128
130
132
134
136
138
Freq
uenc
y
Data Value Histogram
0
200
400
600
800
1000
1200
0.00
09
62.5
008
125.
0007
187.
5007
250.
0006
312.
5006
375.
0005
437.
5005
500.
0004
562.
5004
625.
0003
687.
5003
750.
0002
812.
5002
875.
0001
937.
5001
1000
.000
0
Freq
uenc
y
Accuracy HistogramDesign 1
0
2
4
6
8
10
12
0.593067 0.6398925 0.686718 0.7335435 0.78369
Freq
uenc
y
outlyeroutlyer
Conclusions
• CG weight solver provides tradeoff between accuracy and required FLOPs(compared to QR weight solver)
• Tradeoff between two FPGA designs: Design 1 (Mult & Accum) has lower peak throughput, but can perform more total work than Design 2
• Block floating point provides acceptable accuracy for uniformly distributed data over reasonable dynamic ranges
• Block floating point accuracy breaks down when there are a few large outlyers in the data set
Recent Accomplishments
• Network Communication Time Simulator for Parallel STAP
• FPGA Inner-Product Co-Processor Designs for STAP Weight Solver
• Power Prediction Simulator for the Xilinx4000-Series FPGA
Power Prediction Simulator for the Xilinx 4000-Series FPGA
(Recent Accomplishments)
• CMOS Power Consumption and Past Research
• Design and Implementation of the Power Prediction Simulator
• Preliminary Experimental Results
• Conclusions and Current Work, Demo
References for FPGA Power Prediction
K. P. Parker and E. J. McCluskey, “Probabilistic Treatment of General Combinatorial Networks,” IEEE Trans. Computers, Vol. C-24, June 1975, pp. 668-670.
Kaushik Roy and Sharat Prasad, “Circuit Activity Based LogicSynthesis for Low Power Reliable Operations,” IEEE Trans. VLSI Systems, Vol. 1, No. 4, Dec.1993, pp.
Kaushik Roy, “Power Dissipation Driven FPGA Place and Route under Timing Constraints,” School of Electrical and Computer Engineering, Purdue University.
“XC4000 Series Field Programmable Gate Arrays,” Xilinx, Inc., September 18, 1996.
Leakage CurrentDynamic Capacitance Charging Current
Most important for CMOSDependant on clock frequency
Power Dissipation in CMOS
Transient Current
Dependant on signal activityDependant on signal activity
Power Equations
Equivalent model of a transistor’s gate...
( )
−=
−RC
teVtvc 1
( ) RCt
VetvR
−=
( )ReVtp
RCt
R
22
−
=
∫∫−
−
−−
==ττ
ττ0
22
0
22 2
21 dte
RCCVdt
ReVp RC
tRCt
avg
222
21
2CVeCVp
o
RCt
avg ττ
τ
≈−
=−
( ) 50.0=clockp
( ) 88.01 =xp
( ) 29.02 =xp
( ) 69.03 =xp ( ) 27.03 =xA
( ) 0.1=clockA
( ) 10.01 =xA
( ) 17.02 =xA
p(s): the probability that signal sattains a logical value of true at any given clock cycle.
A(s): the probability that signal stransitions at any given clock cycle.
Probabilistic Modeling
Probabilistic Modeling
x3
x2
x1
y
y
x3
x2
x1
:)(1 tx:)(2 tx:)(3 tx
:)(21 txx:)(321 txxx
p=0.88, A=0.10
p=0.29, A=0.17
p=0.69, A=0.27
p=0.83, A=0.17
p=0.10, A=0.13
Calculation of average power:
∑∈
=gates all
2
21
ggavg ACVP
Probabilistic Equations
( )
( )1 where,)(1
1
===
=
∏∑
∑ ∏
=
=
ii
k
ii
k
ii
Pyp
f
ππ
( ) ( )
( ) ( ){ }
( ) ( ){ }
∑∑ ∏
∑ ∏
∑ ∏
+
−⊕+
−⊕+
−⊕
⋅=
===≠≠ ∉
==≠ ∉
= ≠
X n
kjikji kjil
llkkjjiikji
n
jiji jik
kkjjiiji
n
i ijjjiii
xzPxzPxzPxzPzzzXfXf
xzPxzPxzPzzXfXf
xzPxzPzXfXf
XPyA
K
1,1,1,,
1,1,
1
)(1)()()(),,;()(31
)(1)()(),;()(21
)(1)();()(
)()(
*
* Probabilistic Treatment of General Combinatorial Networks† Estimation of Circuit Activity Considering Signal Correlations and Simultaneous Switching
Signal probability transformations...
Signal activity transformations...†
Power Prediction Simulator for the Xilinx 4000-Series FPGA
(Recent Accomplishments)
• CMOS Power Consumption and Past Research
• Design and Implementation of the Power Prediction Simulator
• Preliminary Experimental Results
• Conclusions and Current Work, Demo
FPGA Design
FPGA internal structure design...
CLB
IOB BUF
Routing Fabric Design
Example routings...
Xilinx 4000 series routing fabric is very intricate.
Xilinx synthesis tools use shortest path routing where possible.
The distance the signal travels is the metric considered in this model.
Signal Design
Symbolic Probability
Numeric Probability
Numeric Activity
Signal Reference
Manhattan Distance
CLBCLB
R
L
Local Signal Remote Signal
Iteration Example
4
4 InterconnectionLUT
LUT
LUT
LUT
LUT
LUT
Iteration Example
R
R
R
R
R
R
R
R
L
L
L
RRRR
RRRR
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
L
L
L
RRRR
RRRR
R
R
R
R
R
R
R
R
LUT
LUT
LUT
LUT
LUT
LUT
L
L
L
L
Probabilistic Feedback Example
ab
d
ec pe
pa
pb
• Feedback Circuits Require Symbolic Iteration of Probability Expressions
• Assume pa , pb , pe are known; then pd and pc are determined using iteration
pd
d = a + bc
dc
pc
c = d e
Iteration 1:
pd = pa
pc = pa pe
Iteration 2:
pd = pa + pa pb pe
pc = (pa pe + pa pb pe) pe = pa pe
Iteration 3:
pd = pa + pa pb pe
pc = pa pe
Power Prediction Simulator for the Xilinx 4000-Series FPGA
(Recent Accomplishments)
• CMOS Power Consumption and Past Research
• Design and Implementation of the Power Prediction Simulator
• Preliminary Experimental Results
• Conclusions and Current Work, Demo
Experimental Results
Probabilistic signals are correctly propagated through combinational and sequential logic.
Configurations making use of feedback converge for all test cases.
Probabilistic modeling is more than an order of magnitude faster than time-domain modeling techniques.
Convergence of Probabilistic Signals
Probability Convergence
020406080
100120
0 5 10
Iterations
% C
onve
rgen
ce
Adder4FIFOPipeAdderMult32
All test cases converged in the following manner:Steep Slope: Signals not involved with feedback rapidly
propagated through the FPGA.Plateau: Signals dependent on feedback converge slowly.
Symbolic Term Explosion
Mixing 12 signals this way...
…gives 6 signals with at most 4 terms.
Mixing 12 signals this way...
…gives 1 signal with at most 4096 terms.
Power Measurements
• Heat Measurements
• Developed hardware instrumentation to measure surface temperature of FPGA
• Thermistor attached to FPGA with heat conductive epoxy
• Instrumentation accurate to within 0.1 degrees F
Frequency Response of the FPGA
• The FPGA consumes more power as its clock frequency rises.• The simulator gives 125mW +43.6mW/MHz for this situation.
120135150165180
0 10 20 30 40 50Frequency (MHz)
Tem
pera
ture
(F)
Surface Temperature versus Frequency
Power Prediction Simulator for the Xilinx 4000-Series FPGA
(Recent Accomplishments)
• CMOS Power Consumption and Past Research
• Design and Implementation of the Power Prediction Simulator
• Preliminary Experimental Results
• Conclusions and Current Work, Demo
Conclusions and Current Work
• Designed and Implemented power prediction simulator for Xilinx 4000 series FPGAs.
• Inputs to simulator:• Place & Route bit stream (from Xilinx Tool)• Activity and Probability factors for pin signals
• Simulator calculates probabilities and activities for all internal signals
• Tool outputs power consumption of FPGA chip
• Currently calibrating/tuning simulator using both heat and DC current measurement cross-calibration methods
OutlineOutline
• Program Overview and Introduction (Quad Chart)
• Program Management Status
• Recent Accomplishments
• Status of Deliverable Checklist
Deliverables
• Prototype VME-Based GPP/DSP/FPGA platform– 20 Slot Chassis with SPARC 5V Host– 9U VME RACE Board– 2 SHARC Daughtercards:12 SHARCs, 48MB – 2 PowerPC Daughtercards: 4 PowerPCs, 64MB– VME WILDFIRE Array Card (16 Xilinx 4028EX-3s)
√
√
√
Deliverables
• FPGA Power Prediction Simulator– Simulator Input: Probabilistic Input Data
Characteristics; FPGA configuration data file– Simulator Output: Power Prediction to within 10%
relative accuracy (expected)– Will demonstrate fidelity across different applications
and even different implementations of the same design– Will operate at interactive speeds – Completely Portable Java Implementation
√
√
√
√
√
Deliverables
• Network Simulator for Parallel STAP– Network Feature Inputs: number and types of
switching elements; interconnection scheme; number and types of processors at each network port, etc.
– Data Mapping Input: Data layout across the processors for each phase of processing
– Data Ordering Input: Order in which data items at each network port are to be transmitted
– Simulator Output: Number of network cycles required for all phases of STAP communication
– Relative accuracy of simulator 10% (expected)– Will operate at interactive speeds – Completely Portable Java Implementation
√
√
√
√
√
√
√
Deliverables
• Linear Filtering Implementation on FPGA– Investigation of different data formats and arithmetic
approaches for FPGA calculations– Demonstrate performance improvement (throughput
and/or power) over GPP/DSP implementation
• STAP Weight Equation Solver on GPP/DSP/FPGA System– Investigation of different data formats and arithmetic
approaches for FPGA calculations– Demonstrate performance improvement (throughput
and/or power) over GPP/DSP implementation
√
√
√
Deliverables
• Optimal configuration techniques for executing SAR on GPP/DSP/FPGA system– Based on optimally balancing memory and processor
utilization, selection of most appropriate data formats and arithmetic techniques, etc.
– Will utilize the FPGA power prediction simulator– Will optimally integrate most appropriate FPGA
circuit implementations and GPP/DSP algorithms– Optimization techniques based on proven
mathematical programming methods– Will demonstrate 2 to 10 times power savings over
nominal configurations of GPP/DSP systems
√
√
√
Deliverables
• Optimal configuration techniques for executing STAP on GPP/DSP/FPGA system– Techniques based on optimal data layout to minimize
latency through interconnection network, optimal combined use of processors and FPGAs for intensive weight calculation, will include desired numerical accuracy as an input parameter
– Will utilize the FPGA power prediction simulator and the network simulator for parallel STAP
– Will demonstrate 2 to 10 times power savings over nominal configurations of GPP/DSP systems
– Optimization techniques based on proven mathematical programming methods
√
√
Deliverables
• Optimal configuration techniques for SAR and STAP on GPP/DSP/FPGA system– Will generalize the SAR-only and STAP-only
configuration techniques– Will consider how to best configure the
GPP/DSP/FPGA to simultaneously satisfy both the SAR and STAP requirements and minimize power consumption
– Will demonstrate 2 to 10 times power savings over nominal configurations of GPP/DSP systems
– Optimization techniques based on proven mathematical programming methods