indexing stream register files

Indexing Stream Register Files

Nuwan Jayasena10/8/2002

04/19/23 NSJ 2

Indexing Stream Register Files

• Motivation

• Architecture overview

• Usage examples

• Language and compiler issues

• Implementation issues

04/19/23 NSJ 3

Stream Memory Hierarchy

• Roughly order of magnitude increase in BW at each level

• Maximize data reuse at each level

• Focus on Stream Register File (SRF) for this talk

Memory Sys

Stream RF

Local Registers

Compute Units

Streams

Records

Localvariables

04/19/23 NSJ 4

SRF Data Reuse

• Current SRF only supports in-order reuse

Indexed access to SRF allows reordered reuse

In-order reuse Reordered reuse

Temporal or producer-consumer locality

App-dependentreordering

Data-dependentreordering

04/19/23 NSJ 5

SRF-Memory Stream Transfers

• Types of stream transfers– Compulsory: application I/O

– Capacity: due to SRF capacity pressure

– Reordering: re-ordering of data already in SRF

• SRF indexing…– Eliminates most reordering transfers

– Reduces data replication in SRF• Eliminates some capacity transfers

04/19/23 NSJ 6

Architecture Overview

• High-level view of SRF indexing implementation

• Mostly to highlight capabilities and limitations of SRF indexing

• More detailed view of hardware and mechanisms later

04/19/23 NSJ 7

Current Stream Processor Arch.

• N “lanes” each with SRF bank and compute cluster• Cross-lane communication via inter-cluster switch

SRF Bank 0

SRFBank 1

SRFBank N

Cluster 0 Cluster 1 Cluster N

Stream buffers

Inter-cluster switch

04/19/23 NSJ 8

In-lane SRF Indexing

• Each cluster can index in to its own bank of the SRF

• Address queue between cluster and SRF bank• Sequence of steps for indexed read:

– Cluster places index in address queue– Bank read using index– Result placed in stream buffer– Cluster reads data from stream buffer

(+)High bandwidth indexed accesses(+)Few changes to exiting architecture(–) Only 1/N of data structure visible within each

cluster

SRFBank X

Cluster X

04/19/23 NSJ 9

Cross-lane SRF Indexing• Any cluster can access any SRF location• Adds interconnect between clusters for address communication• Data return takes place over existing inter-cluster network

SRF address switch

SRFBank 0

SRFBank 1

SRFBank 7

Cluster 0 Cluster 1 Cluster 7

Inter-cluster switch

04/19/23 NSJ 10

Cross-lane SRF Indexing (Contd.)

• Sequence of steps– Clusters place indices in their own index queues

– Indices broadcast on address switch

– Arbitrate to resolve bank conflicts

– Access SRF banks and return data via inter-cluster network

– Write data in to requesting clusters’ stream buffer

– Clusters read data from stream buffers

(+) Entire data structure visible to all clusters

(–) Low bandwidth (1 word/cycle/cluster peak)

(–) Extra hardware for cross-lane index issue

04/19/23 NSJ 11

Usage Examples

• Application-specific uses– Efficient access to application data structures

• System-level uses– Hide hardware limitations

04/19/23 NSJ 12

Multidimensional Data w/o SRF Indexing

• 90º rotation (“corner-turn”) between accesses along different dimensions

Memory

SRF

Clusters Compute

Rotate

Computetime

04/19/23 NSJ 13

Multidimensional Data w/ SRF Indexing

• Accesses along 2nd dimension can typically use in-lane indexing

• Eliminates data reordering through memory reduce reordering stream transfers to/from memory system

Memory

SRF

Clusters Compute Compute

04/19/23 NSJ 14

Regular Grid Stencils w/o SRF Indexing

• Each row is a different stream, all streams consumed at same rate

• Values from adjacent columns communicated among neighbor lanes

• 3 streams for 2D grid with 1-wide stencil• Many streams for higher dimension grids and/or wider

stencils– Number of streams currently limited by hardware resources

04/19/23 NSJ 15

Regular Grid Stencils w/ SRF Indexing

• Primary stream consumed sequentially

• Accesses within vertical planes use in-lane indexing

• Values from adjacent vertical planes communicated among neighbor lanes (same as unindexed case)

• Reduces number of streams needed– May reduce reordering and/or redundant transfers

04/19/23 NSJ 16

Arbitrary Stencils w/o SRF Indexing

• Repeated accesses to same node leads to data replication in SRF

Memory

SRF

ClustersIndex Gen

Lookup

Compute

04/19/23 NSJ 17

Arbitrary Stencils w/ SRF Indexing

• Cross-lane indexing supports arbitrary access pattern

• Eliminates data replication in SRF – May reduce capacity stream

transfers

– Increases strip size

• Reduce redundant transfers from memory system

Memory

SRF

ClustersCompute

04/19/23 NSJ 18

Sub-stream Extraction w/o SRF Indexing

• Splitting records require pass through memory or passing useless data through clusters

• Same for selecting subset of records

Memory

SRF

ClustersComputeCompute

Extract

04/19/23 NSJ 19

Sub-stream Extraction w/ SRF Indexing

• In-lane indexing to select words from records

• Selecting subset of records may require cross-lane indexing to preserve ordering

Memory

SRF

ClustersComputeCompute

04/19/23 NSJ 20

Virtual Streams

• Current SRF has hard limit on number of streams used by a kernel– Imposed by hardware constraints

– Exceeding limit requires merging streams, splitting kernels or other workarounds

• Indexing in to SRF provides a mechanism to access any number of sequences– Essentially multiplex multiple logical streams on to one

hardware stream

04/19/23 NSJ 21

Other Uses

• Space allocation for variable length streams– Current SRF requires space allocation for worst case stream

size for variable length streams

– Indexing can be used to allocate for common case and gracefully degrade if overflows

• Spill local variables from kernels– Reduce register pressure for large kernels

• Etc.

04/19/23 NSJ 22

Summary of Benefits

• Reduce memory system bandwidth demands– Most reordering transfers and some capacity transfers

• Reduce SRF capacity pressure by eliminating replication– Increases strip sizes

• Collapsing/eliminating index generation and/or reordering steps at stream level potentially shortens software pipeline length– Increases strip size

• Flexible stream control– More streams per kernel than hardware supports– Efficient SRF allocation for variable length streams

04/19/23 NSJ 23

Language & Compiler Issues

• System-level issues should clearly be handled by compiler/scheduler– Virtual streams

– SRF allocation for variable length streams

– Register spilling etc.

• How much of the application-level uses can be inferred by compiler?– Substream extraction, regular stencils etc. can be inferred w/o

programmer help?

– Multi-dimensional data structures, irregular stencils etc. need programmer help?

• If so, what should the API be?

04/19/23 NSJ 24

Implementation Issues

• Hiding indexed SRF access latency

• Merging scratchpad and SRF

• SRF access arbitration

• Memory array implementation

04/19/23 NSJ 25

Hiding SRF Access Delay

• Kernels are statically scheduled

• SRF access by streams is dynamically arbitrated– Allows optimal run-time allocation of SRF BW to cluster and

memory streams

– Address generation for sequential streams can run arbitrarily ahead to hide arbitration delay

• Indexed accesses are treated much like another stream for arbitration purposes– In order to hide arbitration and access delay for reads, SRF

indices must be issued early and data read a few cycles later

– Breaks indexed accesses in to two distinct ops at machine level

04/19/23 NSJ 26

Hiding SRF Access Delay (Contd.)

• Split read operation example:

• Address/data separation is not critical for writes

User pseudocode:

Kernel XYZ(…, idx_istream<int> S1, …) {

int a, b, R, S;loop(…) {

Independent_ops;a = addr_compute1();S1[a] >> R;b = addr_compute2();S1[b] >> S;Use(R, S);

}}

User pseudocode:

Kernel XYZ(…, idx_istream<int> S1, …) {

int a, b, R, S;loop(…) {

Independent_ops;a = addr_compute1();S1[a] >> R;b = addr_compute2();S1[b] >> S;Use(R, S);

}}

Post-compile pseudocode:

loop(…) {a = addr_compute();S1.index(a);S1.index(b);Independent_ops;S1 >> R;S1 >> S;Use(R, S);

}

Post-compile pseudocode:

loop(…) {a = addr_compute();S1.index(a);S1.index(b);Independent_ops;S1 >> R;S1 >> S;Use(R, S);

}

04/19/23 NSJ 27

Merging Scratchpad w/ Indexable SRF

• Data structures in SRF are typically read-only or write-only

• Scratchpad needs to support read/write data– Pending writes are matched against new reads and multiple

writes to same location are collapsed

• Special high-priority reads that preempt all other SRF accesses and completes within a fixed latency– Reads are performed immediately after matching with

pending writes (if no match found) to avoid ordering problems

• Must sustain at least the current scratchpad bandwidth – one read and one write every cycle

04/19/23 NSJ 28

SRF Memory Array Implementation

• SSS SRF:– 64K word total 4K words per cluster

• Non-indexable bank can be implemented as a single 512x512 bit macro

• Indexing requires some form of banking to sustain few words/cycle bandwidth for scratchpad + SRF accesses

04/19/23 NSJ 29

SRF Memory Array Implementation (Contd.)

• Non-indexed SRF bank

• 512x512 macro

• 4x4 array of blocks assuming 128x128 blocks

• 2:1 column decode to sustain 4 words/cycle peak BW

SRAMArray

Row

Dec.

Col. Dec.Rd/Wr Circuits

• Key is to support word granularity indexed access w/o losing implementation and power efficiency for wide sequential reads

04/19/23 NSJ 30


• One word per cycle per bank

• All accesses are one word wide– Best BW utilization for mixed

indexed and stream accesses

• High area overhead due to replicated row decoders

• No replication of column decoders and rd/wr circuits

• Power in SRAM array(s) comparable to non-banked memory

Col.Rd/Wr Circuits

• Option 1: Multiple narrow columns

Row

Dec.

Col.

Row

Dec.

Col.

Row

Dec.

Col.R

ow D

ec.

04/19/23 NSJ 31


• Leverage hierarchical bitlines with additional muxing

• With appropriate data interleaving, mux area fairly small

• Low area overhead

• Low power for wide accesses only

• BW utilization may be suboptimal for mixed stream and indexed accesses

Row

Rd/Wr Circuits

• Option 2: Multiple banks along rows of blocks

Mux Row

Mux Row

Mux Row

Mux

indexing stream register files

Documents

stream register file

different stream

srf locationadds interconnect

srf banksequence of

bank conflictsaccess

levelmaximize data reuse

reordering of data

clusters stream bufferclusters