topics on compilers – spring semester...
TRANSCRIPT
Topics on Compilers – Spring Semester 2011Christine Wagner – 2011/06/08
Introduction
Modulo Scheduling Challenges
Core Concepts
Implementation
Experimental Results
Conclusion
2011/06/08 2Edge-centric Scheduling
Embedded computing systems in today’s portable devices
demand high performance and energy efficiency
Traditional application specific hardware: ASICs
Different functionalities on a single device (voice/data
communication, high definition video, digital photography)
High non-recurring costs for designing ASICs
Programmable hardware solutions
2011/06/08 3Edge-centric Scheduling
Coarse-Grained Reconfigurable Architectures (CGRA)
offer high computation throughput, scalability, low cost and
energy efficiency
consist of an array of FU and register files often organized as a
two dimensional grid
need a compiler to efficiently map implementations of
compute intensive loops onto the array and to exploit all
available resources
2011/06/08 4Edge-centric Scheduling
Challenge: sparse connectivity and distributed register files
Values must be explicitly routed between producing and
consuming operations
No dedicated routing resources
FU serves either as compute resource or as routing resource
Approach of this paper: edge-centric modulo scheduling
2011/06/08 5Edge-centric Scheduling
Modulo Scheduling exposes parallelism by overlapping
successive iterations of a loop
Goal: Find a valid schedule with minimal initiation interval (II)
Factors that complicate CGRA scheduling:
1. Explicit routing
VLIW: routing implicitly guaranteed by storing inter-
mediate values in a multi-ported, centralized register file
CGRA: sparse connectivity and distributed register files
2011/06/08 6Edge-centric Scheduling
2. Intelligent routing
FU for computation and routing
scheduling can easily fail due to poor routing choices
minimizing routing resources
3. Heterogeneous nodes
Inexpensive and expensive nodes
Avoid scheduling inexpensive operations on expensive
nodes
2011/06/08 7Edge-centric Scheduling
4. Modulo constraint
Resources used in periodic fashion as loop kernel
repeats every II cycles
Not possible to guarantee routability by extending the
schedule
schedule can easily fail due to previously scheduled
operations
2011/06/08 8Edge-centric Scheduling
2011/06/08Edge-centric Scheduling 9
CGRA scheduling consists of two tasks:
• Placement of operations into computation slots (FU and time)
• Routing of operands
Node-centric scheduling:
• Operations are placed first and then the routing is done
• Slot by slot is visited until a solution is found
• Scheduler does not consider routing information when placing
operations
Unnecessary visits to empty
slots
Redundant routings
2011/06/08Edge-centric Scheduling 10
Example:
Assumption: C can only be
placed in (4,2) and (2,4)
(3,1): only remaining
memory access slot
Difficult to find the right
slot for placing an operation
2011/06/08Edge-centric Scheduling 11
2011/06/08Edge-centric Scheduling 12
Edge-centric scheduling:
• Operation placement integrated into the routing function
• Scheduler starts with routing the edge instead of placing the operation
up front
• When empty slot is found, scheduler places operation temporarily and
checks if other edges connected to the consumer exist
• If so, those edges are routed recursively
• If this routing fails, the routing resumes from the current slot and not
from the starting slot
Only one routing call is required
Cost assignment to slots to avoid wasting expensive nodes
Faster performance and better results
2011/06/08Edge-centric Scheduling 13
2011/06/08Edge-centric Scheduling 14
Final schedule formed by calling a routing function for each
edge of the DFG
Order in which the router visits each slot determined by a
routing cost assigned to each slot
Two main objectives when routing a single edge:
• Minimizing number of routing resources used
• Proactively avoiding routing failure: avoid using resources that will block
future routes and reserve slots for expensive operations
2011/06/08Edge-centric Scheduling 15
Recurrence edges:
Edges in a recurrence cycle
Schedule them ahead of other operations, especially when II
is close to the length of the recurrence
Edges with the highest priority
2011/06/08Edge-centric Scheduling 16
Simple edges:
Outgoing edge of an operation that has only one consumer
High-fanout edges:
Outgoing edge of an operation with multiple consumers
Priority to simple edges over high-fanout edges
Non-critical and critical edges:
Multiple disjoint paths between two nodes
in the DFG
Dependencies between edges in different paths
Edges on critical path are scheduled first
Example:
Recurrence cycle (5, 6, 8) scheduled first, then 0
2011/06/08Edge-centric Scheduling 17
Non-critical and critical edges:
Multiple disjoint paths between two nodes
in the DFG
Dependencies between edges in different paths
Edges on critical path are scheduled first
Example:
Recurrence cycle (5, 6, 8) scheduled first, then 0
2011/06/08Edge-centric Scheduling 18
Non-critical path
Non-critical and critical edges:
Multiple disjoint paths between two nodes
in the DFG
Dependencies between edges in different paths
Edges on critical path are scheduled first
Example:
Recurrence cycle (5, 6, 8) scheduled first, then 0
2011/06/08Edge-centric Scheduling 19
Non-critical path
Critical path
2011/06/08Edge-centric Scheduling 20
Generation of reduced DFG
• Conversion of DFG into reduced form by collapsing nodes
• Operation is collapsible if inexpensive and has only one producer and
one consumer
• Remove node and draw edge from producer to consumer
• New edge annotated with number of collapsed nodes
Clustering of reduced DFG by ignoring high-fanout edges
Prioritize edges
2011/06/08Edge-centric Scheduling 21
Operation scheduling by calling either placement or routing
function
• Placement function only called if target operation has no placed
producers or consumers
• Routing function: decision which edge to route first
• Decision based on factors like schedule time, state-changeability of
producers or consumers and how many routing options are available
• Forward or backward routing
2011/06/08Edge-centric Scheduling 22
Routing cost calculation
• Routing cost for each available slot
• Used by router to determine the order in which to explore slots
• Three primary components:
1. Static cost: fixed cost assigned to each slot
2. Affinity cost: based on a slot’s distance from placed producers and
given to two operations that have common consumers
3. Probability cost: probability of a slot to be required in the future
2011/06/08Edge-centric Scheduling 23
Finding a target
• After updating all routing costs, router starts finding a path from the
source to the target operation
• Router visits neighboring slots in order of their assigned costs
• When routing collapsed edges, the path goes through at least as many
FUs as the number of collapsed nodes, so that they can be expanded
later without problems
• After slot is found, scheduler checks for other edges connected to the
target and recurses to route those edges
2011/06/08Edge-centric Scheduling 24
After finding a legal schedule, collapsed nodes are expanded
onto the found FU slots
Generation of configuration memories for each component
(e.g. control bits)
If scheduling fails, scheduler increases II and repeats
scheduling
2011/06/08Edge-centric Scheduling 25
Benchmarks: media applications from embedded domain
(H.264 encoder, 3D graphics, AAC decoder, MP3 decoder)
CGRA Architecture: 4x4 heterogeneous array, 4 MEM and 6
MULT FUs, central RF and each FU has its own local RF
Loops with varying size mapped onto different configurations
Comparison with traditional, node-centric and simulated
annealing based modulo scheduling
2011/06/08Edge-centric Scheduling 26
2011/06/08Edge-centric Scheduling 27
Performance improvement of 25% over traditional modulo
scheduling
10-13% increased performance and reduced compile time of
27-46% compared to node-centric scheduling
Simulated annealing most effective strategy, but its high
performance results in slow compile time (EMS: 18x speedup)
EMS showed competitive performance results to simulated
annealing
2011/06/08Edge-centric Scheduling 28
Edge-centric modulo scheduling for CGRAs
Focus on routing process with operation placement as a product
Performance improvement of 25% over traditional modulo scheduling
Reduced compilation time (18x compared to simulated annealing)
Performance heavily depends on characteristics of loop structure and underlying CGRA architecture
Thank you for listening!
Please feel free to ask questions!
2011/06/08 29Edge-centric Scheduling
Park, H., Fan, K., Mahlke, S., Oh, T., Kim, H., Kim, H.: Edge-centric Modulo
Scheduling for Coarse-Grained Reconfigurable Architectures. Proceedings
of PACT ’08, ACM New York, pp. 166–176.
2011/06/08Edge-centric Scheduling 30