a study on intelligent run-time resource management techniques for large tiled multi-core...
TRANSCRIPT
A Study on Intelligent Run-time Resource Management Techniques
for Large Tiled Multi-Core Architectures
Dong-In Kang, Jinwoo Suh, Janice O. McMahon, and Stephen P. Crago
University of Southern CaliforniaInformation Sciences Institute
Oct. 25th, 2008This effort was sponsored by Defense Advanced Research Projects Agency (DARPA) through the Dept. of Interior National Business Center (NBC), under grant number NBCH1050022. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsement, either expressed or implied, of the Defense Advanced Research Projects Agency (DARPA), Dept. of Interior, NBC, or the U.S. Government.
2
Outline
Motivation and System Framework
Simulation Model
Applications and Results Mapping Performance
Future Work and Conclusion
3
Motivation
Rising processor density, multi-core architecture Tilera TILE64 processor Intel 80-core Teraflops prototype processor In the future: 32x32 or larger
Dynamic applications With large-scale architectures, multiple applications will share tile
array; need to arbitrate resources between applications Requirements are unpredictable and changing in response to
environment Need for dynamic mapping, allocation/deallocation, resource
management
Need for more intelligent run-time systems Better resource allocation for efficiency New system capabilities to alleviate programming burden Dynamic reactivity for changing scenarios Complex optimization space
4
Research Goal: Intelligent Run-Time Resource Allocation
Application of knowledge-based and learning techniques to the problem of dynamic resource allocation in a tiled architecture Planning and optimization of resource usage given application constraints Knowledge derived from dynamic performance profiling, modeling, and
prediction Run-time learning of resource allocation trade-offs
Proof-of-concept demonstration: simulation of application workload to compare performance of manual and automatic resource allocations Define representative application Develop scalable simulation Define resource mapping problem and apply intelligent technique
Current work: definition of problem, construction of simulator, and preliminary study to quantify performance effects
5
System Framework
Machine-level Resource Allocation
Application-level Task Composition
Knowledge-base
Application Environment
Generate tasks that meet application goals in current environment
Task graph , performance requirements
Performancemonitoring/knowledge buildup
ProcessingEnvironment
Taskassignmentto maximizemachineperformance
Profiling
Prototype experiments in machine-level resource allocation
Prototype experiments in machine-level resource allocation
Simulation environment
Research focus
6
Outline
Motivation and System Framework
Simulation Model
Applications and Results Mapping Performance
Future Work and Conclusion
7
ISI Simulator for Multi-Core Architecture
Event-driven simulator In SystemC Based on ARMn multiprocessor simulator from Princeton Univ.
Generic processor module Proportional share resource scheduler Memory latency is implicit in computation time Parameterized specification: CPU speed, CPU-network interface speed, etc.
Network One network, 2-D mesh 5x5 crossbar switch Packet switching model Store and forward routing Parameterized specification: speed, buffer size, packet overhead, etc.
PE PESwitch Switch
Tile Tile
Profile info: Per link: Data amount Blocking time Per packet: Latency Blocking time Per task: Event log Blocking time Per CPU: Utilization
8
Outline
Motivation and System Framework
Simulation Model
Applications and Results Mapping Performance
Future Work and Conclusion
9
Example Application: Radar Resource Management using IRT
Drive IRT spreadsheets and morphing scenarios with parameters from a notional radar resource manager
GMTI Task Graph input to simulator
FAT requirements modeled with statistically varying processor loads, message traffic
SubbandAnalysis
Time Delay & Equalization146MFLOPs
Time Delay&
Equalization
P = C = T = (NsrgxNchxNpri)
AdaptiveBeamforming
42MFLOPs
AdaptiveBeamforming
P = C = T = (NchxNsrgxNpri)
Pulse Compression112MFLOPs
Pulse Compression
P = C = T = (NbmxNsrgxNpri)
DopplerFiltering
32.4MFLOPs
DopplerFiltering
P = C = T = (NbmxNsrgxNpri)
STAP47MFLOPs
STAP
SubbandSynthesis
SubbandSynthesis
Data Combination
TargetDetection
4.7MFLOPs
3DGrouping
ParameterEstimation
189MFLOPs
P = C = T = (Ndop x (Nbm *Nstage)xNsrg)
P = Ncnb x Nsrg x Ndop
C = T = (Ncnb x Nsrg x Ndop)
P = C = T = (Ncnb x Nsrg x Ndop)
P = C = T = (Ncnb x Nrg x Ndop)
Data Independent
Nsub
10
Outline
Motivation and System Framework
Simulation Model
Applications and Results Mapping Performance
Future Work and Conclusion
11
Example: GMTI Application Mapping
Task Graph (GMTI)
SubbandAnalysis
Time Delay & Equalization146MFLOPs
AdaptiveBeamforming
42MFLOPs
Pulse Compression112MFLOPs
DopplerFiltering
32.4MFLOPs
STAP47MFLOPs
SubbandSynthesis
Data Combination
Time Delay & Equalization146MFLOPs
AdaptiveBeamforming
42MFLOPs
Pulse Compression112MFLOPs
DopplerFiltering
32.4MFLOPs
STAP47MFLOPs
SubbandSynthesis
SubbandAnalysis
(a) Clustered (b) Fragmented
Interferencefrom other application is modeled by random traffic generator
Parameter Values
GMTI resource requirements
128 tiles 868 tiles
Free tiles Fragmented Clustered
Mapping technique Random Heuristic
* Total 16 cases
12
128-Tile GMTI Performance
With lower GMTI allocation, latency is: - more sensitive to background network load when resources are fragmentedWith lower GMTI allocation, latency is: - more sensitive to background network load when resources are fragmented
End- to- end Latency (128 Core GMTI)
0
2000
4000
6000
8000
0% 2.50% 5% 7.50% 10%
Background Random Network Traffic
Late
ncy
Nonfragmentedand heuristicmapping
Fragmentedand randommapping
Fragmented butheuristicmapping
Nonfragmentedand randommapping
FragmentedResources
Clustered Resources
13
868-Tile GMTI Performance
With higher GMTI allocation, latency is: - less sensitive to background network load - more sensitive to task mapping algorithm than fragmentation of the resources
With higher GMTI allocation, latency is: - less sensitive to background network load - more sensitive to task mapping algorithm than fragmentation of the resources
End to end latency (868 Core GMTI)
0
2000
4000
6000
8000
10000
0% 2.50% 5% 7.50% 10%
Background Random Network Traffic
Late
ncy
Nonfragmentedand heuristicmapping
Fragmented andrandommapping
Fragmented butheuristicmapping
Nonfragmentedand randommappingHeuristic
Mapping
Random Mapping
14
Run-time Improvement
Start with random mapping
KnowledgeThrough run-time profilingCommunication amount at a network portWaiting time per flit at a network portPoint-to-point communication between two tasksEstimated flow graphRouting path
Two algorithmsHot-spot removerGenetic algorithm
Network-oriented knowledge Per-flit waiting time, Set of p2p communications through the port X-Y routing is assumed
Only top 100 hot-spots are keptFlow-graph is not derived Room for future improvement
15
Knowledge Usage
Hot-spot remover algorithmRelocate a communication going through a hotspotKnowledge is used to estimate if the relocation makes another
hotspotIf there is no corresponding information in the knowledge store,
simply use the aggregated network load going through the hotspot
Genetic algorithmPrevious mapping is included in the initial populationFitness function is the inverse of the estimated end-to-end latencyEnd-to-end latency is estimated using knowledge and heuristic wayAlgorithm
Initial population Calculate fitness values Cross-over Mutate Repeat until pre-set condition Choose a mapping with the highest fitness
16
128 Core GMTI, Latency Change(Hotspot Remover)
2000
3000
4000
5000
6000
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49
Iteration Count
Late
ncy
0%2.5%5%7.50%10%
128 Core GMTI, Latency Change(Genetic Algorithm)
2000
3000
4000
5000
6000
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49
Iteration Count
Late
ncy
0%2.5%5%7.50%10%
128 Core GMTI, Latency Change(Hotspot Remover with Knowledge)
2000
3000
4000
5000
6000
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49
Iteration Count
Late
ncy
0%2.5%5%7.50%10%
128 Core GMTI, Latency Change(Genetic Algorithm with Knowledge)
2000
3000
4000
5000
6000
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49
Iteration Count
Late
ncy 0%
2.5%5%7.50%10%
Results (128 core GMTI)
17
868 Core GMTI, Latency Change(Hotspot Remover)
3000
4000
5000
6000
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49
Iteration Count
Late
ncy 0%
2.5%5%7.50%10%
868 Core GMTI, Latency Change(Genetic Algorithm)
3000
4000
5000
6000
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49
Iteration Count
Late
ncy 0%
2.5%5%7.50%10%
868 Core GMTI, Latency Change(Hotspot Remover with Knowledge)
2000
3000
4000
5000
6000
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49
Iteration Count
Late
ncy
0%2.5%5%7.50%10%
868 Core GMTI, Latency Change(Genetic Algorithm with Knowledge)
2000
3000
4000
5000
6000
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49
Iteration Count
Late
ncy
0%2.5%5%7.50%10%
Results (868 core GMTI)
18
Outline
Introduction
Simulation Model
Applications and Results Mapping Performance
Future Work and Conclusion
19
Future Work
Run-time profiling techniqueLow overheadDetecting network load imbalanceDetecting communication patterns, dependenciesDetecting processor load imbalance
Gradual morphing techniqueIntelligent partial remapping of the application
Hot spot removal, load balancingAdaptive changes of parallelism of the application
Adaptive to the status of free resources
Run-time support for gradual morphingTask migrationDynamic changes of parallelism of the application
Dynamic adaptation of codeExpression of dynamic changes of code at run-timeDynamic code generation
20
Conclusion
Large multi-tiled architectures will require intelligent run-time systems for resource allocation and dynamic mapping Future applications will be complex and will need to respond to
unexpected events Future, large-scale multi-tiled architectures will require application
sharing and arbitration of resources
In this research, we have explored the effect of dynamic application variations on multi-tiled performance Dynamic IRT with representative mode changes and load variations
causes resource fragmentation over time Application mappings must be done carefully in order to provide robust
behavior
Future work will involve the application of knowledge-based and machine learning algorithms to the profile-based dynamic resource management