a study on intelligent run-time resource management techniques for large tiled multi-core...

A Study on Intelligent Run-time Resource Management Techniques

for Large Tiled Multi-Core Architectures

Dong-In Kang, Jinwoo Suh, Janice O. McMahon, and Stephen P. Crago

University of Southern CaliforniaInformation Sciences Institute

Oct. 25th, 2008This effort was sponsored by Defense Advanced Research Projects Agency (DARPA) through the Dept. of Interior National Business Center (NBC), under grant number NBCH1050022. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsement, either expressed or implied, of the Defense Advanced Research Projects Agency (DARPA), Dept. of Interior, NBC, or the U.S. Government.

2

Outline

Motivation and System Framework

Simulation Model

Applications and Results Mapping Performance

Future Work and Conclusion

3

Motivation

Rising processor density, multi-core architecture Tilera TILE64 processor Intel 80-core Teraflops prototype processor In the future: 32x32 or larger

Dynamic applications With large-scale architectures, multiple applications will share tile

array; need to arbitrate resources between applications Requirements are unpredictable and changing in response to

environment Need for dynamic mapping, allocation/deallocation, resource

management

Need for more intelligent run-time systems Better resource allocation for efficiency New system capabilities to alleviate programming burden Dynamic reactivity for changing scenarios Complex optimization space

4

Research Goal: Intelligent Run-Time Resource Allocation

Application of knowledge-based and learning techniques to the problem of dynamic resource allocation in a tiled architecture Planning and optimization of resource usage given application constraints Knowledge derived from dynamic performance profiling, modeling, and

prediction Run-time learning of resource allocation trade-offs

Proof-of-concept demonstration: simulation of application workload to compare performance of manual and automatic resource allocations Define representative application Develop scalable simulation Define resource mapping problem and apply intelligent technique

Current work: definition of problem, construction of simulator, and preliminary study to quantify performance effects

5

System Framework

Machine-level Resource Allocation

Application-level Task Composition

Knowledge-base

Application Environment

Generate tasks that meet application goals in current environment

Task graph , performance requirements

Performancemonitoring/knowledge buildup

ProcessingEnvironment

Taskassignmentto maximizemachineperformance

Profiling

Prototype experiments in machine-level resource allocation

Prototype experiments in machine-level resource allocation

Simulation environment

Research focus

6

Outline


Simulation Model



7

ISI Simulator for Multi-Core Architecture

Event-driven simulator In SystemC Based on ARMn multiprocessor simulator from Princeton Univ.

Generic processor module Proportional share resource scheduler Memory latency is implicit in computation time Parameterized specification: CPU speed, CPU-network interface speed, etc.

Network One network, 2-D mesh 5x5 crossbar switch Packet switching model Store and forward routing Parameterized specification: speed, buffer size, packet overhead, etc.

PE PESwitch Switch

Tile Tile

Profile info: Per link: Data amount Blocking time Per packet: Latency Blocking time Per task: Event log Blocking time Per CPU: Utilization

8

Outline


Simulation Model



9

Example Application: Radar Resource Management using IRT

Drive IRT spreadsheets and morphing scenarios with parameters from a notional radar resource manager

GMTI Task Graph input to simulator

FAT requirements modeled with statistically varying processor loads, message traffic

SubbandAnalysis

Time Delay & Equalization146MFLOPs

Time Delay&

Equalization

P = C = T = (NsrgxNchxNpri)

AdaptiveBeamforming

42MFLOPs

AdaptiveBeamforming

P = C = T = (NchxNsrgxNpri)

Pulse Compression112MFLOPs

Pulse Compression

P = C = T = (NbmxNsrgxNpri)

DopplerFiltering

32.4MFLOPs

DopplerFiltering

P = C = T = (NbmxNsrgxNpri)

STAP47MFLOPs

STAP

SubbandSynthesis

SubbandSynthesis

Data Combination

TargetDetection

4.7MFLOPs

3DGrouping

ParameterEstimation

189MFLOPs

P = C = T = (Ndop x (Nbm *Nstage)xNsrg)

P = Ncnb x Nsrg x Ndop

C = T = (Ncnb x Nsrg x Ndop)

P = C = T = (Ncnb x Nsrg x Ndop)

P = C = T = (Ncnb x Nrg x Ndop)

Data Independent

Nsub

10

Outline


Simulation Model



11

Example: GMTI Application Mapping

Task Graph (GMTI)

SubbandAnalysis


AdaptiveBeamforming

42MFLOPs


DopplerFiltering

32.4MFLOPs

STAP47MFLOPs

SubbandSynthesis

Data Combination


AdaptiveBeamforming

42MFLOPs


DopplerFiltering

32.4MFLOPs

STAP47MFLOPs

SubbandSynthesis

SubbandAnalysis

(a) Clustered (b) Fragmented

Interferencefrom other application is modeled by random traffic generator

Parameter Values

GMTI resource requirements

128 tiles 868 tiles

Free tiles Fragmented Clustered

Mapping technique Random Heuristic

* Total 16 cases

12

128-Tile GMTI Performance

With lower GMTI allocation, latency is: - more sensitive to background network load when resources are fragmentedWith lower GMTI allocation, latency is: - more sensitive to background network load when resources are fragmented

End- to- end Latency (128 Core GMTI)

0

2000

4000

6000

8000

0% 2.50% 5% 7.50% 10%

Background Random Network Traffic

Late

ncy

Nonfragmentedand heuristicmapping

Fragmentedand randommapping

Fragmented butheuristicmapping

Nonfragmentedand randommapping

FragmentedResources

Clustered Resources

13

868-Tile GMTI Performance

With higher GMTI allocation, latency is: - less sensitive to background network load - more sensitive to task mapping algorithm than fragmentation of the resources

With higher GMTI allocation, latency is: - less sensitive to background network load - more sensitive to task mapping algorithm than fragmentation of the resources

End to end latency (868 Core GMTI)

0

2000

4000

6000

8000

10000

0% 2.50% 5% 7.50% 10%

Background Random Network Traffic

Late

ncy

Nonfragmentedand heuristicmapping

Fragmented andrandommapping

Fragmented butheuristicmapping

Nonfragmentedand randommappingHeuristic

Mapping

Random Mapping

14

Run-time Improvement

Start with random mapping

KnowledgeThrough run-time profilingCommunication amount at a network portWaiting time per flit at a network portPoint-to-point communication between two tasksEstimated flow graphRouting path

Two algorithmsHot-spot removerGenetic algorithm

Network-oriented knowledge Per-flit waiting time, Set of p2p communications through the port X-Y routing is assumed

Only top 100 hot-spots are keptFlow-graph is not derived Room for future improvement

15

Knowledge Usage

Hot-spot remover algorithmRelocate a communication going through a hotspotKnowledge is used to estimate if the relocation makes another

hotspotIf there is no corresponding information in the knowledge store,

simply use the aggregated network load going through the hotspot

Genetic algorithmPrevious mapping is included in the initial populationFitness function is the inverse of the estimated end-to-end latencyEnd-to-end latency is estimated using knowledge and heuristic wayAlgorithm

Initial population Calculate fitness values Cross-over Mutate Repeat until pre-set condition Choose a mapping with the highest fitness

16

128 Core GMTI, Latency Change(Hotspot Remover)

2000

3000

4000

5000

6000

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

Iteration Count

Late

ncy

0%2.5%5%7.50%10%

128 Core GMTI, Latency Change(Genetic Algorithm)

2000

3000

4000

5000

6000

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

Iteration Count

Late

ncy

0%2.5%5%7.50%10%

128 Core GMTI, Latency Change(Hotspot Remover with Knowledge)

2000

3000

4000

5000

6000

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

Iteration Count

Late

ncy

0%2.5%5%7.50%10%

128 Core GMTI, Latency Change(Genetic Algorithm with Knowledge)

2000

3000

4000

5000

6000

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

Iteration Count

Late

ncy 0%

2.5%5%7.50%10%

Results (128 core GMTI)

17

868 Core GMTI, Latency Change(Hotspot Remover)

3000

4000

5000

6000

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

Iteration Count

Late

ncy 0%

2.5%5%7.50%10%

868 Core GMTI, Latency Change(Genetic Algorithm)

3000

4000

5000

6000

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

Iteration Count

Late

ncy 0%

2.5%5%7.50%10%

868 Core GMTI, Latency Change(Hotspot Remover with Knowledge)

2000

3000

4000

5000

6000

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

Iteration Count

Late

ncy

0%2.5%5%7.50%10%

868 Core GMTI, Latency Change(Genetic Algorithm with Knowledge)

2000

3000

4000

5000

6000

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

Iteration Count

Late

ncy

0%2.5%5%7.50%10%

Results (868 core GMTI)

18

Outline

Introduction

Simulation Model



19

Future Work

Run-time profiling techniqueLow overheadDetecting network load imbalanceDetecting communication patterns, dependenciesDetecting processor load imbalance

Gradual morphing techniqueIntelligent partial remapping of the application

Hot spot removal, load balancingAdaptive changes of parallelism of the application

Adaptive to the status of free resources

Run-time support for gradual morphingTask migrationDynamic changes of parallelism of the application

Dynamic adaptation of codeExpression of dynamic changes of code at run-timeDynamic code generation

20

Conclusion

Large multi-tiled architectures will require intelligent run-time systems for resource allocation and dynamic mapping Future applications will be complex and will need to respond to

unexpected events Future, large-scale multi-tiled architectures will require application

sharing and arbitration of resources

In this research, we have explored the effect of dynamic application variations on multi-tiled performance Dynamic IRT with representative mode changes and load variations

causes resource fragmentation over time Application mappings must be done carefully in order to provide robust

behavior

Future work will involve the application of knowledge-based and machine learning algorithms to the profile-based dynamic resource management

a study on intelligent run-time resource management techniques for large tiled multi-core...

Documents