shared cache aware task mapping for wcrt minimization · 2013. 4. 24. · t4 t5 . previous task ......

Shared Cache Aware Task Mapping for WCRT Minimization

Huping Ding & Tulika Mitra School of Computing, National University of Singapore

Yun Liang Center for Energy-efficient Computing and Applications, School of EECS, Peking University

Multi-core Processor in Embedded Systems

Embedded systems adopt multi-core processors due to performance, thermal and power constraints

Embedded systems have real-time constraints, e.g., time deadline of tasks

Shared Resources, like shared cache, make the timing difficult to estimate, e.g. Worst-Case Response Time (WCRT) analysis

Core 1 Core 2 Core n ….

Shared Resources

Worst-Case Response Time

Tasks can be mapped to any core in a multi-core systems

The maximal latest finish time among all these tasks is the Worst-Case Response Time (WCRT) of the application

An important metric for schedulability analysis

Our purpose is to minimize WCRT in multi-core systems with shared cache via task mapping

A multitasking application

t1

t2 t3

t4 t5

Previous Task Mapping Approaches

They usually focus on balancing the workload

They are agnostic to the shared cache behavior – State-of-the-art works on shared cache usually fix task

mapping before shared L2 cache analysis

Motivating Example

adpcm

ndes minver

crc compress

Task graph Configuration: 2-core, private L1 cache: 256 bytes, shared L2 cache: 2KB w/o L2 cache modeling: Assume cache miss (hit) for each L2 access in the worst (best) case

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

WC

RT

(mill

ion

cyc

les)

Task mappings

w/o L2 cache modeling with L2 cache modeling

It is imperative to design a shared cache aware task mapping solution!

Our Architecture

A private L1 cache for each core

All cores share a L2 cache

Tasks execute in a non-preemptive fashion

Shared L2 cache

CPU

L1 cache

Core1 CPU

L1 cache

Core 2 CPU

L1 cache

Core n

…

Main memory

Private Cache Systems

Usually only the intra-task analysis is required to estimate the WCET (Worst-Case Execution Time)

No inter-core cache conflicts

Cache

Off-chip

Main Memory

CPU

Shared Cache Multi-core Systems

Inter-core cache conflicts

– Lead to extra L2 cache misses

This makes the timing analysis complex

m1 m2

…

…

young old

Access m2

Hit

LRU replacement policy

Shared L2 cache

Core 2 Core 1

Task T

m1 m2

…

…

young old

Access m2

Miss


Shared L2 cache

Core 2 Core 1

Task T

m1’ m1

Access m1’

Task T’

Task Interference

Task lifetime : [Earliest Ready, Latest Finish]

Interfering tasks : Two tasks mapped to different cores with lifetime overlapped

Two tasks with dependence between them can never interfere with each other

Time

t1

t2 t3

t4

Interfering tasks: t2 and t3, t2 and t4

Core 1 Core 2 t1

t2

t3

t4

Current Efforts on Shared Cache

They focus on modeling cache and inter-core cache conflicts (Yan and Zhang RTAS 2008, Hardy and Puaut RTSS 2008 and Liang et al. RTS 2012)

Task mapping is fixed (known) before analysis, and agnostic to the shared cache conflicts

Impact of Task Mapping

Task interference – Two tasks can interfere with each other only when they are

mapped to different cores

Workload balance among cores – Workload balance influences the total execution time of a

multitasking application, thus impacts its WCRT

Task interference

Inter-core cache conflict

Execution time of task

WCRT

Our Aims

We target the task mapping problem in multi-core systems with shared cache, in order to optimize the WCRT

We not only consider workload balance, but also minimize inter-core cache conflicts

Challenges

Exhaustive enumeration of all mappings is impossible as # of cores and # of tasks increase

Significant complexity due to inter-dependency between task mapping and task execution time

Task mapping

Task execution time

Task interference

Cache conflict

Workload balancing

Our Approach

An approach based on ILP (integer linear programming) formulation

Intra-task cache

analysis

Task mapping modeling

WCET computation

modeling

Shared cache

modeling

WCRT computation

modeling

ILP problem Objective: minimize WCRT

ILP solver Result: a task mapping

Accurate WCRT analysis framework

Final WCRT

Intra-task Analysis

Must analysis – Captures the memory blocks that are guaranteed in the

cache

May analysis – Captures the memory blocks that are never in the cache

Memory block access classification – Always hit

– Always miss

– Non-classified

Task Mapping Modeling

Suppose there are N tasks and M cores in the multi-core system. 𝑴𝑨𝑷𝒊𝑘 is a 0-1 decision variable to indicate if task i is mapped to core k. Then for any task i (0

Shared Cache Modeling

a

b

m

c

d

a

b

m

Cache state before considering cache conflicts

d

e

a

b

Possible cache states after considering cache conflicts

(𝑎𝑔𝑒𝑚+ 𝑐𝑜𝑛𝑓𝑙𝑖𝑐𝑡𝑗𝑠

𝑗 ∈ 𝑖𝑛𝑡𝑓(𝑖)

) ≥ 𝐴

{m, a, b, c, d, e} are mapped to the same set s in shared L2 cache. m, a, b and c are from task i, while d and e are tasks u and v respectively

Hit

Miss

young

old


WCET and WCRT Computation

WCET – Intra-task analysis and shared cache modeling

WCRT – WCRT is the summation of WCET value on the critical path

Approximations in ILP

Interfering tasks – Two tasks i and j have no dependence and are mapped to

different cores

WCET = Original WCET + Cache conflict penalty – Intra-task cache analysis Original WCET

– Shared cache modeling Cache conflict penalty

Why approximations ? – Make the problem easy to solve

Estimate accurate WCRT with the mapping released by ILP formulation by a WCRT analysis framework

𝑖𝑛𝑡𝑒𝑟𝑓𝑒𝑟𝑒𝑖𝑗 = 𝐷𝐶𝑖𝑗

Iterative Analysis Framework

Yun Liang, Huping Ding, Tulika Mitra, Abhik Roychoudhury, Yan Li, Vivy Suhendra Timing Analysis of Concurrent Programs Running on Shared Cache Multi-Cores. In Real-Time System Journal 2012

L1 cache analysis

L1 cache analysis

Filter Filter

L2 cache analysis

L2 cache analysis

L2 cache Conflict analysis

WCRT analysis

Interference changes?

yes no

Estimated WCRT

Core 1 Core 2

Initial task interference

Modified task interference

Experimental Evaluation

Real-world benchmark: DEBIE benchmark

Synthetic benchmarks: MRTC benchmark suite

Comparison among different approaches

Enumerate all Task mappings

Evaluate WCRT with L2 cache modeling

Select the best mapping

ILP formulation

Solve ILP problem

Iterative analysis framework

Enumerate all Task mappings

Evaluate WCRT w/o L2 cache modeling

Task mapping released

Select the best mapping

Optimal mapping Our mapping Mapping w/o L2 cache modeling

DEBIE Benchmark Results

0

200

400

600

800

1000

1200

1400

1600

1800

2000

L1:1KB L2:16KB

WC

RT

(mill

ion

cyc

les)

Configuration: 4-core

Optimal

Our approach

w/o L2 cache modeling

Synthetic Benchmarks Results

0.000.200.400.600.801.001.201.401.601.802.00

Taskgraph 1

Taskgraph 2

Taskgraph 3

Taskgraph 4

Taskgraph 5

Taskgraph 6

Taskgraph 7

Taskgraph 8

Taskgraph 9

Re

lati

ve W

CR

T

Configurations: 4-core, L1:512B, L2:4KB

Optimal Our approach w/o L2 cache modeling

Runtime for Synthetic Benchmarks

Task graph # of tasks Our approach (seconds)

Optimal (seconds)

1 6 6.59 3.98

2 7 33.37 1.42

3 8 8.77 4.93

4 9 4.22 15.05

5 10 17.20 67.54

6 11 9.27 269.29

7 12 396.33 1089.35

8 13 480.62 3,883.00

9 14 539.31 22,794.00

Conclusions

Tasking mapping has a great impact on WCRT

An ILP formulation approach that is aware of the shared cache conflict

Our approach not only considers the workload balance but also minimizes inter-core cache conflict

Significant reduction in WCRT compared to traditional approach agnostic to shared cache conflicts

shared cache aware task mapping for wcrt minimization · 2013. 4. 24. · t4 t5 . previous task ......

Documents