shouqing hao institute of computing technology, chinese academy of sciences [email protected]

Shouqing Hao

Institute of Computing Technology, Chinese Academy of Sciences

[email protected]

Processes Scheduling on Heterogeneous Multi-core Architecture with Hardware

Support

Contents

Introduction

Hardware support for LLC-miss latency

LA-ACMP scheduling algorithm

Evaluation and analysis

Introduction

Heter-CMP: Heterogeneous Chip Multi-Processor

−Composed with some big cores and some small coresBig cores: large area, high power, high performance

• Adapted to CPU-bound programs, serial programs, ……

Small cores: Small area, low power, low performance• Adapted to memory-bound programs, parallel programs,

……

−AdvantageMake good use of chip resourcesReduce power and performance waste

−Challenge Identify applications’ behaviors when executingSchedule proper programs to proper cores

Hardware Support (1)

Identify programs’ behaviors−Last level cache (LLC) miss latency

LLC miss Memory access • Memory accesses induce high latency• Affect programs’ efficiency when executed• Can not make full use of cores’ performance

Schedule rules• Programs with high LLC miss latency should be scheduled to

small cores• Programs with low LLC miss latency should be scheduled to

big cores


Identify programs’ behaviors−Last level Cache (LLC) miss latency

Mechanism• LLC miss delay is the period between miss request and

miss response– UN-Overlapped, Overlapped

•Record LLC miss latency for each core, with hardware support

Mem-access request

Mem-access response

Mem-access requeset

Mem-access response

t1 t2

delay = t1 + t2 + ...

Mem-access request

Mem-access response

request2 response2

t1

t2

delay


−Implemented based on Godson-3A Record LLC miss request and response for each core, with

hardware support

L1 miss requestDDR

controllerL2 CACHEL1 CACHE

L1 miss response

mem access request

mem access response

LLC_miss_request_times_core0LLC_miss_request_times_core1LLC_miss_request_times_core2LLC_miss_request_times_core3

LLC_miss_response_times_core0LLC_miss_response_times_core1LLC_miss_response_times_core2LLC_miss_response_times_core3


id=0?

request_id

Y

L2_miss_req_0

+

id=1? Y

L2_miss_req_1

+

mem_req_valid

id=2? Y

L2_miss_req_2

+

id=3? Y

L2_miss_req_3

+

id=0?

response_id

Y

L2_miss_res_0

+

id=1?Y

L2_miss_res_1

+

mem_res_valid

id=2?Y

L2_miss_res_2

+

id=3?Y

L2_miss_res_3

+

Equal?

Y L2_miss_ok_0<= 1

Equal?

YL2_miss_ok_1<= 1

Equal?

YL2_miss_ok_2<= 1

Equal?

YL2_miss_ok_3<= 1

LA-ACMP Schedule Algorithm(1)

LA-ACMP ： Latency-Aware Asymmetry CMP−Identify heterogeneity of cores

Based on Linux kernel 2.6.18Calculate BogoMIPS value of each core, evaluate each core’s

performance

−Workload assignment balanceUsing Scaled Load method

• L=N/P: each core’s scaled load– N: number of workloads being in queue– P: processor’s performance

• If Lmax – Lmin <= 1, workload assignment balance


−LLC-delay buffer Append each run-queue with a LLC-delay buffersave each task’s LLC miss latency

thread0准备好 Run-queue

0

x

LLC-delaybuffer

thread0

idle

Run-queue

x

x

LLC-delaybuffer

idle

idle

(a)


−Update LLC-delay bufferWhen running, clear thread’s

LLC-delay valueWhen exhausting time slice,

save thread’s LLC-delay value

When migrate thread from queue-A to queue-B, also migrate LLC-delay value

Run-queue

0->delay0

0

LLC-delaybuffer

thread0

idle

processorAfter executing,

save thread’s LLC-delay value

When executing clear thread’s

LLC-delay value

Run-queueA

delay0->0

delay1

LLC-delaybuffer

thread0->idle

thread1

Run-queueB

0->delay0

0

LLC-delaybuffer

idle->thread0

idle


−LA-ACMP algorithmExecuted when judging balanceDon’t destroy balance

Y

YN

processor-bound thread on

slower- core

Y

Exchange pairs ofthreads

N

NY

YN N

Time slice over

Load imbalanceUpdatebuffer

busiest-idlest-heter

proper thread exists

Thread migration

do balance

do balance

memory-boundthread onfast-core

do nothing

do nothing

Evaluate and analysis(1)

Platform−Godson-3A-heter

Four cores: one works with 1GHz, three work with 500MHz

Using asynchronous FIFO for synchronization

Benchmark−SPEC CPU2000

m0

m3m2m1 m4

m5

s0

s3s2s1 s4

s5X1 Switch

p0 P2 P3P1

s0 s3s2s1

DM

A C

ontroller

HT

DM

A C

ontroller

HT

X2 Switch

MC0 MC1

Asynchronous FIFO

Evaluate and analysis(2)

Applications’ executing speedup−Compared to original OS−LLC miss rate: with 15.4% performance improvement−LLC miss delay: with 19.8% performance improvement−Application groups with higher heterogeneity get higher

performance improvementThe third group, with highest improvementThe second group, with lowest improvement

Thanks ！

shouqing hao institute of computing technology, chinese academy of sciences [email protected]

Documents

miss request

high llc

low llc

llcdelay buffersave

tasks llc

overlappedrecord llc

llcdelay valuerunqueue0

parallel programs