shared cache aware task mapping for wcrt minimization · 2013. 4. 24. · t4 t5 . previous task ......
TRANSCRIPT
-
Shared Cache Aware Task Mapping for WCRT Minimization
Huping Ding & Tulika Mitra School of Computing, National University of Singapore
Yun Liang Center for Energy-efficient Computing and Applications, School of EECS, Peking University
-
Multi-core Processor in Embedded Systems
Embedded systems adopt multi-core processors due to performance, thermal and power constraints
Embedded systems have real-time constraints, e.g., time deadline of tasks
Shared Resources, like shared cache, make the timing difficult to estimate, e.g. Worst-Case Response Time (WCRT) analysis
Core 1 Core 2 Core n ….
Shared Resources
-
Worst-Case Response Time
Tasks can be mapped to any core in a multi-core systems
The maximal latest finish time among all these tasks is the Worst-Case Response Time (WCRT) of the application
An important metric for schedulability analysis
Our purpose is to minimize WCRT in multi-core systems with shared cache via task mapping
A multitasking application
t1
t2 t3
t4 t5
-
Previous Task Mapping Approaches
They usually focus on balancing the workload
They are agnostic to the shared cache behavior – State-of-the-art works on shared cache usually fix task
mapping before shared L2 cache analysis
-
Motivating Example
adpcm
ndes minver
crc compress
Task graph Configuration: 2-core, private L1 cache: 256 bytes, shared L2 cache: 2KB w/o L2 cache modeling: Assume cache miss (hit) for each L2 access in the worst (best) case
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
WC
RT
(mill
ion
cyc
les)
Task mappings
w/o L2 cache modeling with L2 cache modeling
It is imperative to design a shared cache aware task mapping solution!
-
Our Architecture
A private L1 cache for each core
All cores share a L2 cache
Tasks execute in a non-preemptive fashion
Shared L2 cache
CPU
L1 cache
Core1 CPU
L1 cache
Core 2 CPU
L1 cache
Core n
…
Main memory
-
Private Cache Systems
Usually only the intra-task analysis is required to estimate the WCET (Worst-Case Execution Time)
No inter-core cache conflicts
Cache
Off-chip
Main Memory
CPU
-
Shared Cache Multi-core Systems
Inter-core cache conflicts
– Lead to extra L2 cache misses
This makes the timing analysis complex
m1 m2
…
…
young old
Access m2
Hit
LRU replacement policy
Shared L2 cache
Core 2 Core 1
Task T
m1 m2
…
…
young old
Access m2
Miss
LRU replacement policy
Shared L2 cache
Core 2 Core 1
Task T
m1’ m1
Access m1’
Task T’
-
Task Interference
Task lifetime : [Earliest Ready, Latest Finish]
Interfering tasks : Two tasks mapped to different cores with lifetime overlapped
Two tasks with dependence between them can never interfere with each other
Time
t1
t2 t3
t4
Interfering tasks: t2 and t3, t2 and t4
Core 1 Core 2 t1
t2
t3
t4
-
Current Efforts on Shared Cache
They focus on modeling cache and inter-core cache conflicts (Yan and Zhang RTAS 2008, Hardy and Puaut RTSS 2008 and Liang et al. RTS 2012)
Task mapping is fixed (known) before analysis, and agnostic to the shared cache conflicts
-
Impact of Task Mapping
Task interference – Two tasks can interfere with each other only when they are
mapped to different cores
Workload balance among cores – Workload balance influences the total execution time of a
multitasking application, thus impacts its WCRT
Task interference
Inter-core cache conflict
Execution time of task
WCRT
-
Our Aims
We target the task mapping problem in multi-core systems with shared cache, in order to optimize the WCRT
We not only consider workload balance, but also minimize inter-core cache conflicts
-
Challenges
Exhaustive enumeration of all mappings is impossible as # of cores and # of tasks increase
Significant complexity due to inter-dependency between task mapping and task execution time
Task mapping
Task execution time
Task interference
Cache conflict
Workload balancing
-
Our Approach
An approach based on ILP (integer linear programming) formulation
Intra-task cache
analysis
Task mapping modeling
WCET computation
modeling
Shared cache
modeling
WCRT computation
modeling
ILP problem Objective: minimize WCRT
ILP solver Result: a task mapping
Accurate WCRT analysis framework
Final WCRT
-
Intra-task Analysis
Must analysis – Captures the memory blocks that are guaranteed in the
cache
May analysis – Captures the memory blocks that are never in the cache
Memory block access classification – Always hit
– Always miss
– Non-classified
-
Task Mapping Modeling
Suppose there are N tasks and M cores in the multi-core system. 𝑴𝑨𝑷𝒊𝑘 is a 0-1 decision variable to indicate if task i is mapped to core k. Then for any task i (0
-
Shared Cache Modeling
a
b
m
c
d
a
b
m
Cache state before considering cache conflicts
d
e
a
b
Possible cache states after considering cache conflicts
(𝑎𝑔𝑒𝑚+ 𝑐𝑜𝑛𝑓𝑙𝑖𝑐𝑡𝑗𝑠
𝑗 ∈ 𝑖𝑛𝑡𝑓(𝑖)
) ≥ 𝐴
{m, a, b, c, d, e} are mapped to the same set s in shared L2 cache. m, a, b and c are from task i, while d and e are tasks u and v respectively
Hit
Miss
young
old
LRU replacement policy
-
WCET and WCRT Computation
WCET – Intra-task analysis and shared cache modeling
WCRT – WCRT is the summation of WCET value on the critical path
-
Approximations in ILP
Interfering tasks – Two tasks i and j have no dependence and are mapped to
different cores
WCET = Original WCET + Cache conflict penalty – Intra-task cache analysis Original WCET
– Shared cache modeling Cache conflict penalty
Why approximations ? – Make the problem easy to solve
Estimate accurate WCRT with the mapping released by ILP formulation by a WCRT analysis framework
𝑖𝑛𝑡𝑒𝑟𝑓𝑒𝑟𝑒𝑖𝑗 = 𝐷𝐶𝑖𝑗
-
Iterative Analysis Framework
Yun Liang, Huping Ding, Tulika Mitra, Abhik Roychoudhury, Yan Li, Vivy Suhendra Timing Analysis of Concurrent Programs Running on Shared Cache Multi-Cores. In Real-Time System Journal 2012
L1 cache analysis
L1 cache analysis
Filter Filter
L2 cache analysis
L2 cache analysis
L2 cache Conflict analysis
WCRT analysis
Interference changes?
yes no
Estimated WCRT
Core 1 Core 2
Initial task interference
Modified task interference
-
Experimental Evaluation
Real-world benchmark: DEBIE benchmark
Synthetic benchmarks: MRTC benchmark suite
Comparison among different approaches
Enumerate all Task mappings
Evaluate WCRT with L2 cache modeling
Select the best mapping
ILP formulation
Solve ILP problem
Iterative analysis framework
Enumerate all Task mappings
Evaluate WCRT w/o L2 cache modeling
Task mapping released
Select the best mapping
Optimal mapping Our mapping Mapping w/o L2 cache modeling
-
DEBIE Benchmark Results
0
200
400
600
800
1000
1200
1400
1600
1800
2000
L1:1KB L2:16KB
WC
RT
(mill
ion
cyc
les)
Configuration: 4-core
Optimal
Our approach
w/o L2 cache modeling
-
Synthetic Benchmarks Results
0.000.200.400.600.801.001.201.401.601.802.00
Taskgraph 1
Taskgraph 2
Taskgraph 3
Taskgraph 4
Taskgraph 5
Taskgraph 6
Taskgraph 7
Taskgraph 8
Taskgraph 9
Re
lati
ve W
CR
T
Configurations: 4-core, L1:512B, L2:4KB
Optimal Our approach w/o L2 cache modeling
-
Runtime for Synthetic Benchmarks
Task graph # of tasks Our approach (seconds)
Optimal (seconds)
1 6 6.59 3.98
2 7 33.37 1.42
3 8 8.77 4.93
4 9 4.22 15.05
5 10 17.20 67.54
6 11 9.27 269.29
7 12 396.33 1089.35
8 13 480.62 3,883.00
9 14 539.31 22,794.00
-
Conclusions
Tasking mapping has a great impact on WCRT
An ILP formulation approach that is aware of the shared cache conflict
Our approach not only considers the workload balance but also minimizes inter-core cache conflict
Significant reduction in WCRT compared to traditional approach agnostic to shared cache conflicts
-
Q & A