indexing stream register files
DESCRIPTION
Indexing Stream Register Files. Nuwan Jayasena 10/8/2002. Indexing Stream Register Files. Motivation Architecture overview Usage examples Language and compiler issues Implementation issues. Stream Memory Hierarchy. Memory Sys. Roughly order of magnitude increase in BW at each level - PowerPoint PPT PresentationTRANSCRIPT
Indexing Stream Register Files
Nuwan Jayasena10/8/2002
04/19/23 NSJ 2
Indexing Stream Register Files
• Motivation
• Architecture overview
• Usage examples
• Language and compiler issues
• Implementation issues
04/19/23 NSJ 3
Stream Memory Hierarchy
• Roughly order of magnitude increase in BW at each level
• Maximize data reuse at each level
• Focus on Stream Register File (SRF) for this talk
Memory Sys
Stream RF
Local Registers
Compute Units
Streams
Records
Localvariables
04/19/23 NSJ 4
SRF Data Reuse
• Current SRF only supports in-order reuse
Indexed access to SRF allows reordered reuse
In-order reuse Reordered reuse
Temporal or producer-consumer locality
App-dependentreordering
Data-dependentreordering
04/19/23 NSJ 5
SRF-Memory Stream Transfers
• Types of stream transfers– Compulsory: application I/O
– Capacity: due to SRF capacity pressure
– Reordering: re-ordering of data already in SRF
• SRF indexing…– Eliminates most reordering transfers
– Reduces data replication in SRF• Eliminates some capacity transfers
04/19/23 NSJ 6
Architecture Overview
• High-level view of SRF indexing implementation
• Mostly to highlight capabilities and limitations of SRF indexing
• More detailed view of hardware and mechanisms later
04/19/23 NSJ 7
Current Stream Processor Arch.
• N “lanes” each with SRF bank and compute cluster• Cross-lane communication via inter-cluster switch
SRF Bank 0
SRFBank 1
SRFBank N
Cluster 0 Cluster 1 Cluster N
Stream buffers
Inter-cluster switch
04/19/23 NSJ 8
In-lane SRF Indexing
• Each cluster can index in to its own bank of the SRF
• Address queue between cluster and SRF bank• Sequence of steps for indexed read:
– Cluster places index in address queue– Bank read using index– Result placed in stream buffer– Cluster reads data from stream buffer
(+)High bandwidth indexed accesses(+)Few changes to exiting architecture(–) Only 1/N of data structure visible within each
cluster
SRFBank X
Cluster X
04/19/23 NSJ 9
Cross-lane SRF Indexing• Any cluster can access any SRF location• Adds interconnect between clusters for address communication• Data return takes place over existing inter-cluster network
SRF address switch
SRFBank 0
SRFBank 1
SRFBank 7
Cluster 0 Cluster 1 Cluster 7
Inter-cluster switch
04/19/23 NSJ 10
Cross-lane SRF Indexing (Contd.)
• Sequence of steps– Clusters place indices in their own index queues
– Indices broadcast on address switch
– Arbitrate to resolve bank conflicts
– Access SRF banks and return data via inter-cluster network
– Write data in to requesting clusters’ stream buffer
– Clusters read data from stream buffers
(+) Entire data structure visible to all clusters
(–) Low bandwidth (1 word/cycle/cluster peak)
(–) Extra hardware for cross-lane index issue
04/19/23 NSJ 11
Usage Examples
• Application-specific uses– Efficient access to application data structures
• System-level uses– Hide hardware limitations
04/19/23 NSJ 12
Multidimensional Data w/o SRF Indexing
• 90º rotation (“corner-turn”) between accesses along different dimensions
Memory
SRF
Clusters Compute
Rotate
Computetime
04/19/23 NSJ 13
Multidimensional Data w/ SRF Indexing
• Accesses along 2nd dimension can typically use in-lane indexing
• Eliminates data reordering through memory reduce reordering stream transfers to/from memory system
Memory
SRF
Clusters Compute Compute
04/19/23 NSJ 14
Regular Grid Stencils w/o SRF Indexing
• Each row is a different stream, all streams consumed at same rate
• Values from adjacent columns communicated among neighbor lanes
• 3 streams for 2D grid with 1-wide stencil• Many streams for higher dimension grids and/or wider
stencils– Number of streams currently limited by hardware resources
04/19/23 NSJ 15
Regular Grid Stencils w/ SRF Indexing
• Primary stream consumed sequentially
• Accesses within vertical planes use in-lane indexing
• Values from adjacent vertical planes communicated among neighbor lanes (same as unindexed case)
• Reduces number of streams needed– May reduce reordering and/or redundant transfers
04/19/23 NSJ 16
Arbitrary Stencils w/o SRF Indexing
• Repeated accesses to same node leads to data replication in SRF
Memory
SRF
ClustersIndex Gen
Lookup
Compute
04/19/23 NSJ 17
Arbitrary Stencils w/ SRF Indexing
• Cross-lane indexing supports arbitrary access pattern
• Eliminates data replication in SRF – May reduce capacity stream
transfers
– Increases strip size
• Reduce redundant transfers from memory system
Memory
SRF
ClustersCompute
04/19/23 NSJ 18
Sub-stream Extraction w/o SRF Indexing
• Splitting records require pass through memory or passing useless data through clusters
• Same for selecting subset of records
Memory
SRF
ClustersComputeCompute
Extract
04/19/23 NSJ 19
Sub-stream Extraction w/ SRF Indexing
• In-lane indexing to select words from records
• Selecting subset of records may require cross-lane indexing to preserve ordering
Memory
SRF
ClustersComputeCompute
04/19/23 NSJ 20
Virtual Streams
• Current SRF has hard limit on number of streams used by a kernel– Imposed by hardware constraints
– Exceeding limit requires merging streams, splitting kernels or other workarounds
• Indexing in to SRF provides a mechanism to access any number of sequences– Essentially multiplex multiple logical streams on to one
hardware stream
04/19/23 NSJ 21
Other Uses
• Space allocation for variable length streams– Current SRF requires space allocation for worst case stream
size for variable length streams
– Indexing can be used to allocate for common case and gracefully degrade if overflows
• Spill local variables from kernels– Reduce register pressure for large kernels
• Etc.
04/19/23 NSJ 22
Summary of Benefits
• Reduce memory system bandwidth demands– Most reordering transfers and some capacity transfers
• Reduce SRF capacity pressure by eliminating replication– Increases strip sizes
• Collapsing/eliminating index generation and/or reordering steps at stream level potentially shortens software pipeline length– Increases strip size
• Flexible stream control– More streams per kernel than hardware supports– Efficient SRF allocation for variable length streams
04/19/23 NSJ 23
Language & Compiler Issues
• System-level issues should clearly be handled by compiler/scheduler– Virtual streams
– SRF allocation for variable length streams
– Register spilling etc.
• How much of the application-level uses can be inferred by compiler?– Substream extraction, regular stencils etc. can be inferred w/o
programmer help?
– Multi-dimensional data structures, irregular stencils etc. need programmer help?
• If so, what should the API be?
04/19/23 NSJ 24
Implementation Issues
• Hiding indexed SRF access latency
• Merging scratchpad and SRF
• SRF access arbitration
• Memory array implementation
04/19/23 NSJ 25
Hiding SRF Access Delay
• Kernels are statically scheduled
• SRF access by streams is dynamically arbitrated– Allows optimal run-time allocation of SRF BW to cluster and
memory streams
– Address generation for sequential streams can run arbitrarily ahead to hide arbitration delay
• Indexed accesses are treated much like another stream for arbitration purposes– In order to hide arbitration and access delay for reads, SRF
indices must be issued early and data read a few cycles later
– Breaks indexed accesses in to two distinct ops at machine level
04/19/23 NSJ 26
Hiding SRF Access Delay (Contd.)
• Split read operation example:
• Address/data separation is not critical for writes
User pseudocode:
Kernel XYZ(…, idx_istream<int> S1, …) {
int a, b, R, S;loop(…) {
Independent_ops;a = addr_compute1();S1[a] >> R;b = addr_compute2();S1[b] >> S;Use(R, S);
}}
User pseudocode:
Kernel XYZ(…, idx_istream<int> S1, …) {
int a, b, R, S;loop(…) {
Independent_ops;a = addr_compute1();S1[a] >> R;b = addr_compute2();S1[b] >> S;Use(R, S);
}}
Post-compile pseudocode:
loop(…) {a = addr_compute();S1.index(a);S1.index(b);Independent_ops;S1 >> R;S1 >> S;Use(R, S);
}
Post-compile pseudocode:
loop(…) {a = addr_compute();S1.index(a);S1.index(b);Independent_ops;S1 >> R;S1 >> S;Use(R, S);
}
04/19/23 NSJ 27
Merging Scratchpad w/ Indexable SRF
• Data structures in SRF are typically read-only or write-only
• Scratchpad needs to support read/write data– Pending writes are matched against new reads and multiple
writes to same location are collapsed
• Special high-priority reads that preempt all other SRF accesses and completes within a fixed latency– Reads are performed immediately after matching with
pending writes (if no match found) to avoid ordering problems
• Must sustain at least the current scratchpad bandwidth – one read and one write every cycle
04/19/23 NSJ 28
SRF Memory Array Implementation
• SSS SRF:– 64K word total 4K words per cluster
• Non-indexable bank can be implemented as a single 512x512 bit macro
• Indexing requires some form of banking to sustain few words/cycle bandwidth for scratchpad + SRF accesses
04/19/23 NSJ 29
SRF Memory Array Implementation (Contd.)
• Non-indexed SRF bank
• 512x512 macro
• 4x4 array of blocks assuming 128x128 blocks
• 2:1 column decode to sustain 4 words/cycle peak BW
SRAMArray
Row
Dec.
Col. Dec.Rd/Wr Circuits
• Key is to support word granularity indexed access w/o losing implementation and power efficiency for wide sequential reads
04/19/23 NSJ 30
SRF Memory Array Implementation (Contd.)
• One word per cycle per bank
• All accesses are one word wide– Best BW utilization for mixed
indexed and stream accesses
• High area overhead due to replicated row decoders
• No replication of column decoders and rd/wr circuits
• Power in SRAM array(s) comparable to non-banked memory
Col.Rd/Wr Circuits
• Option 1: Multiple narrow columns
Row
Dec.
Col.
Row
Dec.
Col.
Row
Dec.
Col.R
ow D
ec.
04/19/23 NSJ 31
SRF Memory Array Implementation (Contd.)
• Leverage hierarchical bitlines with additional muxing
• With appropriate data interleaving, mux area fairly small
• Low area overhead
• Low power for wide accesses only
• BW utilization may be suboptimal for mixed stream and indexed accesses
Row
Rd/Wr Circuits
• Option 2: Multiple banks along rows of blocks
Mux Row
Mux Row
Mux Row
Mux