1 chapter 2 dataflow processors. 2 dataflow processors recall from basic processor pipelining:...
TRANSCRIPT
![Page 1: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/1.jpg)
1
Chapter 2Chapter 2
Dataflow ProcessorsDataflow Processors
![Page 2: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/2.jpg)
2
Dataflow Dataflow pprocessorsrocessors
Recall from basic processor pipelining: Hazards limit performance.– Structural hazards– Data hazards due to
• true dependences or • name (false) dependences: anti and output dependences
– Control hazards Name dependences can be removed by:
– compiler (register) renaming– renaming hardware advanced superscalars– single-assignment rule dataflow computers
Data hazards due to true dependences and control hazards can be avoided if succeeding instructions in the pipeline stem from different contexts dataflow computers, multithreaded processors
![Page 3: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/3.jpg)
3
Dataflow vs. Dataflow vs. ccontrol-ontrol-fflowlow
von Neumann or control flow computing model: – a program is a series of addressable instructions, each of which either
• specifies an operation along with memory locations of the operands or • it specifies (un)conditional transfer of control to some other
instruction. – Essentially: the next instruction to be executed depends on what happened
during the execution of the current instruction. – The next instruction to be executed is pointed to and triggered by the PC. – The instruction is executed even if some of its operands are not available
yet (e.g. uninitialized). Dataflow model: the execution is driven only by the availability of operand!
– no PC and global updateable store– the two features of von Neumann model that become bottlenecks in
exploiting parallelism are missing
![Page 4: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/4.jpg)
4
Dataflow Dataflow mmodel of odel of ccomputationomputation
Enabling rule: An instruction is enabled (i.e. executable) if all operands are available.– Von Neumann model: an instruction is enabled if it is pointed to by PC.
The computational rule or firing rule, specifies when an enabled instruction is actually executed.
Basic instruction firing rule: An instruction is fired (i.e. executed) when it becomes enabled.– The effect of firing an instruction is the consumption of its input data
(operands) and generation of output data (results).
– Where are the structural hazards? Answer: ignored in traditional dataflow literature!!
![Page 5: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/5.jpg)
5
Dataflow languagesDataflow languages
Main characteristic: The single-assignment rule: A variable may appear on the left side of an assignment only once within the area of the program in which it is active.
Examples: VAL, Id, LUCID
A dataflow program is compiled into a dataflow graph which is a directed graph consisting of named nodes, which represent instructions, and arcs, which represent data dependences among instructions. – The dataflow graph is similar to a dependence graph used in intermediate
representations of compilers. During the execution of the program, data propagate along the arcs in data
packets, called tokens. This flow of tokens enables some of the nodes (instructions) and fires them.
![Page 6: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/6.jpg)
6
Dataflow Dataflow aarchitectures - Overviewrchitectures - Overview
Pure dataflow computers:– static, – dynamic, – and the explicit token store architecture.
Hybrid dataflow computers:– Augmenting the dataflow computation model with control-flow
mechanisms, such as • RISC approach, • complex machine operations, • multi-threading, • large-grain computation, • etc.
![Page 7: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/7.jpg)
7
Pure Pure ddataflowataflow A dataflow computer executes a program by receiving, processing and sending
out tokens, each containing some data and a tag. Dependences between instructions are translated into tag matching and tag
transformation. Processing starts when a set of matched tokens arrives at the execution unit. The instruction which has to be fetched from the instruction store (according
to the tag information) contains information about – what to do with data – and how to transform the tags.
The matching unit and the execution unit are connected by an asynchronous pipeline, with queues added between the stages.
Some form of associative memory is required to support token matching. – a real memory with associative access, – a simulated memory based on hashing, – or a direct matched memory.
![Page 8: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/8.jpg)
8
Static Static ddataflowataflow
A dataflow graph is represented as a collection of activity templates, each containing:– the opcode of the represented instruction, – operand slots for holding operand values, – and destination address fields, referring to the operand slots in sub-
sequent activity templates that need to receive the result value.
Each token consists only of a value and a destination address.
![Page 9: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/9.jpg)
9
Dataflow graph and Dataflow graph and aactivity templatectivity template
*
data token
acknowledge signal
data arc
acknowledgement arc
sqrt
x y
z
ni
n j
32xy
z
sqrt
*ni
ni
nj
nj
xy
z
23
![Page 10: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/10.jpg)
10
Acknowledgement signalsAcknowledgement signals
Notice, that different tokens destined for the same destination cannot be distinguished.
Static dataflow approach allows at most one token on any one arc.
Extending the basic firing rule as follows: An enabled node is fired if there is no token on any of its output arcs.
Implementation of the restriction by acknowledge signals (additional tokens ), traveling along additional arcs from consuming to producing nodes.
Using acknowledgement signals, the firing rule can be changed to its original form: A node is fired at the moment when it becomes enabled.
Again: structural hazards are ignored assuming unlimited resources!
![Page 11: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/11.jpg)
11
MIT Static Dataflow MachineMIT Static Dataflow Machine
C o m m u n ic a tio n
N e tw o rk
P E
P E
. . .
A c tiv ityS to re
In s tru c tio nQ u e u e
F e tchU n it
U p d a teU n it
S U
R U
O p e ra tio n U n it(s )
lo c a lc o m m u n ic a tio n
to /fro m th eC o m m u n ic a tio nN e tw o rk
P ro cess in g E lem e n t
![Page 12: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/12.jpg)
12
Deficiencies of Deficiencies of sstatic tatic ddataflowataflow
Consecutive iterations of a loop can only be pipelined.
Due to acknowledgment tokens, the token traffic is doubled.
Lack of support for programming constructs that are essential to modern programming language– no procedure calls, – no recursion.
Advantage: simple model
![Page 13: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/13.jpg)
13
Dynamic Dynamic ddataflowataflow
Each loop iteration or subprogram invocation should be able to execute in parallel as a separate instance of a reentrant subgraph.
The replication is only conceptual. Each token has a tag:
– address of the instruction for which the particular data value is destined– and context information
Each arc can be viewed as a bag that may contain an arbitrary number of tokens with different tags.
The enabling and firing rule is now: A node is enabled and fired as soon as tokens with identical tags are present on all input arcs.
Structural hazards ignored!
![Page 14: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/14.jpg)
14
The U-interpreterThe U-interpreter (U = unraveling)(U = unraveling)
Each token consists of an activity name and data– the activity name comprises the tag.
the tag has – an instruction address n,– the context field c that uniquely identifies the context in which the
instruction is to be invoked, – and the initiation number i that identifies the loop iteration in which this
activity occurs. Note, that c is itself an activity name. Since the destination instruction may require more than one input, each token
also carries the number of its destination port p. We represent a token by p
datanic ,..
![Page 15: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/15.jpg)
15
The U-interpreterThe U-interpreter
If the node ni performs a dyadic function f, and if the port p of nj is the destination of ni, then we have:
executionni
nj
f
c.i.n , x1i c.i.n , y
2
21
i
ni
nj
f
c.i.n , f(x,y)j p
p
},..,,..{:21
ynicxnicin ii}),(,..{:
pj yxnicout
![Page 16: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/16.jpg)
16
MERGE and SWITCH nodesMERGE and SWITCH nodes
MERGE SWITCH
execution
executionX X
XX
T
T
F
F
T
T
F
F
T
F
execution
executionSWITCH SWITCH
SWITCH SWITCH
![Page 17: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/17.jpg)
17
Branch ImplementationsBranch Implementations
X
f g
T F
x
bni
nj
nk
SWITCH P
f g
T F
x
bn
i
nj
nk
CHOOSE
COPY
P
Branch Speculative branchevaluation
![Page 18: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/18.jpg)
18
Basic Basic lloop oop iimplementationmplementation
X
f
T F
L
D
D -1
L-1
new x
ni
nj
nl
nm
nk SWITCH P
L: initiation, new loop context D: increments loop iteration number D-1: reset loop iteration number to 1 L-1: restore original context
![Page 19: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/19.jpg)
19
Function applicationFunction application
ni
ni
nend
nbegin
nj
q a
A
A -1
q a
q
BEGIN
END
APPLY
A: create new context BEGIN: replicate tokens for each fork END: return results, unstack return address A-1: replicate output for successors
![Page 20: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/20.jpg)
20
I-structuresI-structures (I = incremental)(I = incremental)
Problem: Single-assignment rule and complex data structures – each update of a data structure consumes the structure and the value
producing a new data structure. – awkward or even impossible to implement.
Solution: concept of I-structure:– a data repository obeying the single-assignment rule – each element of the I-structure may be written only once but it may be read
any number of times
The basic idea is to associate with each element status bits and a queue of deferred reads.
![Page 21: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/21.jpg)
21
I-structuresI-structures
The status of each element of the I-structure can be:– present: the element can be read but not written,– absent: a read request has to be deferred but a write operation into this
element is allowed,– waiting: at least one read request of the element has been deferred.
![Page 22: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/22.jpg)
22
MIT Tagged-Token Dataflow ArchitectureMIT Tagged-Token Dataflow Architecture
Processing Element
Communication
Network
PE
PEI-Structure
Storage
I-Structure
Storage
. . . . . .
localcommunication
SU
RU
FormTokenUnit
InstructionFetchUnit
ALU
TokenQueue
ProgramStore
& Form Tag
& ConstantStore
Wait-Match Unit & Waiting Token Store
to/from theCommunicationNetwork
![Page 23: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/23.jpg)
23
Manchester Dataflow MachineManchester Dataflow Machine
Processing Element
Switch Switch
Switch
StructureStorage
StructureStorage
PE
Switch
MatchingUnit
InstructionStore
TokenQueue
Processing Unit
ALU ALU
outputinput
. . .
Host
![Page 24: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/24.jpg)
24
Advantages and Advantages and ddeficiencies of eficiencies of ddynamic ynamic ddataflowataflow
Major advantage: better performance (compared with static) because it allows multiple tokens on each arc thereby unfolding more parallelism.
Problems:– efficient implementation of the matching unit that collects tokens with
matching tags. • Associative memory would be ideal. • Unfortunately, it is not cost-effective since the amount of memory
needed to store tokens waiting for a match tends to be very large. • All existing machines use some form of hashing techniques.
– bad single thread performance (when not enough workload is present)– dyadic instructions lead to pipeline bubbles when first operand tokens
arrive– no instruction locality no use of registers
![Page 25: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/25.jpg)
25
Explicit Token Store (ETS) Explicit Token Store (ETS) aapproachpproach
Target: efficient implementation of token matching.
Basic idea: allocate a separate frame in a frame memory for each active loop iteration or subprogram invocation.
A frame consists of slots; each slot holds an operand that is used in the corresponding activity.
Access to slots is direct (i.e. through offsets relative to the frame pointer) no associative search is needed.
![Page 26: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/26.jpg)
26
Explicit Explicit ttoken oken sstoretore
2.34
presencebit value
Frame Memory
FP
FP + 2
IP
*
+
-
sqrt
<FP, IP, 3.01>
sqrt*
35
+22 +1 +2
+1 +5+3 +2
+
op-code
offsetin theactivationframe
destinationsleft right
Instruction Memory
-
![Page 27: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/27.jpg)
27
Monsoon, an Monsoon, an eexplicit xplicit ttoken oken sstore tore mmachineachine
Processing Element
MultistagePacketSwitchingNetwork
PE
PE
I-Structure
Storage
I-Structure
Storage
. . . . . .
Fra
me
Mem
ory
FormToken
Use
r Q
ueue
Sys
tem
Que
ue
InstructionMemory
ALU
InstructionFetch
EffectiveAddressGeneration
PresenceBitOperation
FrameOperation
to/from theCommunicationNetwork
![Page 28: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/28.jpg)
28
Monsoon, an Monsoon, an eexplicit xplicit ttoken oken sstore tore mmachineachine Each PE is using an eight-stage pipeline
– instruction fetch --- precedes token matching (in contrast to dynamic dataflow processors with associative matching units)!
– token matching1 - effective address generation: explicit token address is computed from the frame address and operand offset
– token matching2 - presence bit operation: a presence bit is accessed to find out if the first operand of a dyadic operation has already arrived
• not arrived presence bit is set and the current token is stored into the frame slot of the frame memory
• arrived presence bit is reset and the operand can be retrieved from the slot of the frame memory in next stage
– token matching3 - frame operation stage: Operand storing or retrieving. – Next three stages - execution stages in the course of which the next tag is also
computed concurrently.– Eighth stage - form-token: forms one or two new tokens that are sent to the
network, stored in a user token queue, a system token queue, or directly recirculated to the instruction fetch stage of the pipeline.
![Page 29: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/29.jpg)
29
Monsoon Monsoon pprototyperototype
16 prototypes at beginning of 90ies!
Processing element:– 10 MHz clock– 56 kW Instruction Memory (32 bit wide)– 256 kW Frame Memory (word + 3 presence bits, word size: 64 bit data + 8
bit tag)– Two 32 k token queues (system, user)
I-structure storage:– 4MW (word + 3 presence bits)– 5 M requests/sec
Network– Multistage, pipelined– Packet Routing Chips (PaRC, 4 x 4 crossbar)– 4 M tokens/s/link (100 MB/s)
![Page 30: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/30.jpg)
30
DDataflow ataflow processors - Hybridsprocessors - Hybrids
Poor sequential code performance by dynamic dataflow computers– Why?– an instruction of the same thread is issued to the dataflow pipeline after the
completion of its predecessor instruction.– In the case of an 8-stage pipeline, instructions of the same thread can be issued at
most every eight cycles. – Low workload: the utilization of the dataflow processor drops to one eighth of its
maximum performance. Another drawback: the overhead associated with token matching.
– before a dyadic instruction is issued to the execution stage, two result tokens have to be present.
– The first token is stored in the waiting-matching store, thereby introducing a bubble in the execution stage(s) of the dataflow processor pipeline.
– measured pipeline bubbles on Monsoon: up to 28.75 % No use of registers possible!
![Page 31: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/31.jpg)
31
Augmenting Augmenting ddataflow with ataflow with ccontrol-ontrol-fflowlow
Solution: combine dataflow with control-flow mechanisms.
– threaded dataflow,
– large-grain dataflow,
– dataflow with complex machine operations,
– further hybrids.
![Page 32: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/32.jpg)
32
Threaded Threaded ddataflowataflow
Threaded dataflow: the dataflow principle is modified so that instructions of certain instruction streams are processed in succeeding machine cycles.
A subgraph that exhibits a low degree of parallelism is transformed into a sequential thread.
The thread of instructions is issued consecutively by the matching unit without matching further tokens except for the first instruction of the thread.
Threaded dataflow covers– the repeat-on-input technique used in Epsilon-1 and Epsilon-2 processors, – the strongly connected arc model of EM-4, and– the direct recycling of tokens in Monsoon.
![Page 33: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/33.jpg)
33
Threaded Threaded ddataflow (continued)ataflow (continued)
Data passed between instructions of the same thread is stored in registers instead of written back to memory.
Registers may be referenced by any succeeding instruction in the thread. – Single-thread performance is improved. – The total number of tokens needed to schedule program instructions is
reduced which in turn saves hardware resources. – Pipeline bubbles are avoided for dyadic instructions within a thread.
Two threaded dataflow execution techniques can be distinguished:– direct token recycling (Monsoon),– consecutive execution of the instructions of a single thread (Epsilon & EM).
![Page 34: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/34.jpg)
34
Direct token recycling of MonsoonDirect token recycling of Monsoon
Cycle-by-cycle instruction interleaving of threads similar to multithreaded von Neumann computers!
8 register sets can be used by 8 different threads. Dyadic instructions within a thread (except for the start instruction!) refer to at
least one register need only a single token to be enabled. A result token of a particular thread is recycled ASAP in the 8-stage pipeline,
i. e., every 8th cycle the next instruction of a thread is fired and executed. This implies that at least 8 threads must be active for a full pipeline utilization. Threads and fine-grain dataflow instructions can be mixed in the pipeline.
![Page 35: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/35.jpg)
35
Epsilon and EM-4Epsilon and EM-4
Instructions of a thread are executed consecutively.
The circular pipeline of fine-grain dataflow is retained.
The matching unit is enhanced with a mechanism that, after firing the first
instruction of a thread, delays matching of further tokens in favor of consecutive issuing of all instructions of the started thread.
Problem: implementation of an efficient synchronization mechanism
![Page 36: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/36.jpg)
36
Large-Large-ggrain (coarse-grain) rain (coarse-grain) ddataflowataflow
A dataflow graph is enhanced to contain fine-grain (pure) dataflow nodes and macro dataflow nodes.
– A macro dataflow node contains a sequential block of instructions.
A macro dataflow node is activated in the dataflow manner, its instruction sequence is executed in the von Neumann style!
Off-the-shelf microprocessors can be used to support the execution stage.
Large-grain dataflow machines typically decouple the matching stage (sometimes called signal stage, synchronization stage, etc.) from the execution stage by use of FIFO-buffers.
Pipeline bubbles are avoided by the decoupling and FIFO-buffering.
![Page 37: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/37.jpg)
37
Dataflow with Dataflow with ccomplex omplex mmachine achine ooperationsperations
Use of complex machine instructions, e.g. vector instructions– ability to exploit parallelism at the subinstruction level– Instructions can be implemented by pipeline techniques as in vector
computers. – The use of a complex machine operation may spare several nested loops.
Structured data is referenced in block rather than element-wise and can be supplied in a burst mode.
![Page 38: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/38.jpg)
38
Dataflow with Dataflow with ccomplex omplex mmachine achine ooperationsperationsand combined with LGDFand combined with LGDF
Often use of FIFO-buffers to decouple the firing stage and the execution stage – bridges different execution times within a mixed stream of simple and
complex instructions. – Major difference to pure dataflow: tokens do not carry data (except for the
values true or false). – Data is only moved and transformed within the execution stage. – Applied in: Decoupled Graph/Computation Architecture, the Stollmann
Dataflow Machine, and the ASTOR architecture. – These architectures combine complex machine instructions with large-grain
dataflow.
![Page 39: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/39.jpg)
39
Augmenting dataflow with control-flowAugmenting dataflow with control-flow
![Page 40: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/40.jpg)
40
Lessons Lessons llearned from earned from ddataflowataflow
Superscalar microprocessors display an out-of-order dynamic execution that is referred to as local dataflow or micro dataflow.
Colwell and Steck 1995, in the first paper on the PentiumPro:“The flow of the Intel Architecture instructions is predicted and these instructions are decoded into micro-operations (ops), or series of ops, and these ops are register-renamed, placed into an out-of-order speculative pool of pending operations, executed in dataflow order (when operands are ready), and retired to permanent machine state in source program order.”
State-of-the-art microprocessors typically provide 32 (MIPS R10000), 40 (Intel PentiumPro) or 56 (HP PA-8000) instruction slots in the instruction window or reorder buffer.
Each instruction is ready to be executed as soon as all operands are available.
![Page 41: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/41.jpg)
41
Comparing dataflow computers with superscalar Comparing dataflow computers with superscalar microprocessorsmicroprocessors Superscalar microprocessors are von Neumann based:
(sequential) thread of instructions as input not enough fine-grained parallelism to feed the multiple functional units speculation
Dataflow approach resolves any threads of control into separate instructions that are ready to execute as soon as all required operands become available.
The fine-grained parallelism generated by dataflow principle is far larger than the parallelism available for microprocessors.
However, locality is lost no caching, no registers
![Page 42: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/42.jpg)
42
Lessons Lessons llearned from earned from ddataflow (Pipeline ataflow (Pipeline iissues)ssues)
Microprocessors: Data and control dependences potentially cause pipeline hazards that are handled by complex forwarding logic.
Dataflow: Due to the continuous context switches, pipeline hazards are avoided; disadvantage: poor single thread performance.
Microprocessors: Antidependences and output dependences are removed by register renaming that maps the architectural registers to the physical registers.
Thereby the microprocessor internally generates an instruction stream that satisfies the single assignment rule of dataflow.
The main difference between the dependence graphs of dataflow and the code sequence in an instruction window of a microprocessor:branch prediction and speculative execution.
Microprocessors: rerolling execution in case of a wrongly predicted path is costly in terms of processor cycles.
![Page 43: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/43.jpg)
43
Lessons Lessons llearned from earned from ddataflow (Continued)ataflow (Continued)
Dataflow: The idea of branch prediction and speculative execution has never been evaluated in the dataflow environment.
Dataflow was considered to produce an abundance of parallelism while speculation leads to speculative parallelism which is inferior to real parallelism.
Microprocessors: Due to the single thread of control, a high degree of data and instruction locality is present in the machine code.
Microprocessors: The locality allows to employ a storage hierarchy that stores the instructions and data potentially executed in the next cycles close to the executing processor.
Dataflow: Due to the lack of locality in a dataflow graph, a storage hierarchy is difficult to apply.
![Page 44: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/44.jpg)
44
Lessons Lessons llearned from earned from ddataflow (Continued)ataflow (Continued)
Microprocessors: The operand matching of executable instructions in the instruction window is restricted to a part of the instruction sequence.
Because of the serial program order, the instructions in this window are likely to become executable soon. The matching hardware can be restricted to a small number of slots.
Dataflow: the number of tokens waiting for a match can be very high. A large waiting-matching store is required.
Dataflow: Due to the lack of locality, the likelihood of the arrival of a matching token is difficult to estimate, caching of tokens to be matched soon is difficult.
![Page 45: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/45.jpg)
45
Lessons Lessons llearned from earned from ddataflow (Memory ataflow (Memory llatency)atency)
Microprocessors: An unsolved problem is the memory latency caused by cache misses.
Example: SGI Origin 2000:– latencies are 11 processor cycles for a L1 cache miss, – 60 cycles for a L2 cache miss, – and can be up to 180 cycles for a remote memory access. – In principle, latencies should be multiplied by the degree of superscalar.
Microprocessors: Only a small part of the memory latency can be hidden by out-of-order execution, write buffer, cache preload hardware, lockup free caches, and a pipelined system bus.
Microprocessors often idle and are unable to exploit the high degree of internal parallelism provided by a wide superscalar approach.
Dataflow: The rapid context switching avoids idling by switching execution to another context.
![Page 46: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/46.jpg)
46
Lessons Lessons llearned from earned from ddataflow (Continued)ataflow (Continued)
Microprocessors: Finding enough fine-grain parallelism to fully exploit the processor will be the main problem for future superscalars.
Solution: enlarge the instruction window to several hundred instruction slots; two draw-backs– Most of the instructions in the window will be speculatively assigned with a
very deep speculation level (today's depth is normally four at maximum). most of the instruction execution will be speculative. The principal problem here arises from the single instruction stream that feeds the instruction window.
– If the instruction window is enlarged, the updating of the instruction states in the slots and matching of executable instructions lead to more complex hardware logic in the issue stage of the pipeline thus limiting the cycle rate.
![Page 47: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/47.jpg)
47
Lessons Lessons llearned from earned from ddataflow (Continued)ataflow (Continued)
Solutions:– the decoupling of the instruction window with respect to different instruction
classes,– the partitioning of the issue stage into several pipeline stages, – and alternative instruction window organizations.
Alternative instruction window organization: the dependence-based microprocessor: – Instruction window is organized as multiple FIFOs.– Only the instructions at the heads of a number of FIFO buffers can be issued
to the execution units in the next cycle. – The total parallelism in the instruction window is restricted in favor of a less
costly issue that does not slow down processor cycle rate. – Thereby the potential fine-grained parallelism is limited
somewhat similar to the threaded dataflow approach.
![Page 48: 1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56649f2f5503460f94c48853/html5/thumbnails/48.jpg)
48
Lessons Lessons llearned from earned from ddataflow ataflow (alternative instruction window organizations)(alternative instruction window organizations)
Look at dataflow matching store implementations Look into dataflow solutions like threaded dataflow
(e.g. repeat-on-input technique or strongly-connected arcs model) Repeat-on-input strategy issues compiler-generated code sequences serially
(in an otherwise fine-grained dataflow computer). Transferred to the local dataflow in an instruction window:– an issue string might be used; – a serie of data dependent instructions is generated by a compiler and
issued serially after the issue of the leading instruction. However, the high number of speculative instructions in the instruction
window remains.