lcu14-410: how to build an energy model for your soc

1

How to build an energy model for your SoCLinaro Connect CLU14, Burlingame,CA.

Morten Rasmussen

2

Why do you need an energy model? Most of the Linux kernel is blissfully unaware of SoC power

management features: P-states, clock domains, C-states, power domains, ...

Only largely autonomous subsystems are aware of some of these details (cpufreq, cpuidle, …)

The plan is to change that by coordinating task scheduling, frequency scaling, and idle-state selection to improve power management.

Energy saving techniques must be applied under the right circumstances which vary between SoCs.

The kernel must therefore have a better understanding of power(energy)/performance trade-offs for the particular SoC to make the right decisions.

An energy model can provide that information.

As a bonus, the energy model may also be used by tools to quick energy estimates based on execution traces.

3

Modelling limitations

Model are never accurate, but we only need enough detail to make the right decisions most of the time.

The model will be used by critical code paths in the kernel, so it has to be as simple as possible.

Only considers cpus, no memory or peripherals.

4

A simplified system view

cpu0 cpu1

Shared HW

G G

G

cpu2 cpu3

Shared HW

G G

G

Power

Clock source Clock source

GG GG

G Clock gating

G Power gating

Power domain

5

Px

Energy consumption simplified

time

power

Py

Cz

BusyTransitionIdle

Busy energy Busy energy

Idle energy

Transition

energy

6

Scheduler Topology Hierarchy

0 1 2 3

Disclaimer: This a simplified view of the sched_domain hiearchy.

Struct sched_group

Energy model tables Per-core C-states

Cluster/package C-states

Cluster/package P-states

7

Energy model data P-states:

Compute capacity: Performance score normalize to highest P-state of fastest cpu in the system (1024). Choose benchmark carefully. Preferably use a suite of benchmarks.

Power: Busy power = energy/second. Normalized to any reference, but must be consistent across all cpus.

C-states: Power: Idle power = energy/second. Normalized.

Wake-up energy. Energy consumed during P->C + C->P state transitions. Unit must be consistent with power numbers.

Note: Power numbers should only include power consumption associated

with the group where the tables are attached, i.e. per-core P-state power should only include power consumed by the core itself, shared HW is accounted for in the table belonging to the level above.

8

Energy model data

0 1

power wu (state)

0 0 (WFI)

... ... ...

power wu (state)

10 6 (C1)

... ... ...

C-states

C-statesP-states

capacity power (freq)

358 2967 (350)

... ... ...

1024 4905 (1000)

capacity power (freq)

358 187 (350)

... ... ...

1024 1024 (1000)

P-statesCluster

CPU

9

Energy model algorithm

for_each_domain(cpu, sd) {

sg = sched_group_of(cpu)

energy_before = curr_util(sg) * busy_power(sg)

+ (1-curr_util(sg)) * idle_power(sg)

energy_after = new_util(sg) * busy_power(sg)

+ (1-new_util(sg)) * idle_power(sg)

+ (1-new_util(sg)) * wakeups * wakeup_energy(sg)

energy_diff += energy_before - energy_after

if (energy_before == energy_after)

break;

}

return energy_diff

10

Backups

11

Platform performance/energy data/model in scheduler or user-spaceEnergy-Aware Workshop @ Kernel Summit 2014, Chicago

Morten Rasmussen

12

Sub-topics Techniques for reducing energy consumption vary between

platforms: Race-to-idle

Task packing

P- and C-state constraints (Turbo Mode, package C-states, …)

… but they are not universally all good. Most likely only to a certain extend.

We need to know when to apply each of the techniques for a particular platform.

Proposals: Tunable heuristics for each technique that can controlled by somebody

else (user-space?), basically passing the problems to others.

Provide in-kernel performance/energy model that can estimate the impact of scheduling decisions.

13

Backup/More stuff

14

Model Validation: ARM TC2, sysbench

Correlation (Pearson):

A15 = 0.93

A7 = 0.96

15

Model Validation: ARM TC2, periodic


A15 = 0.17

A7 = -0.01

16

Model Validation: ARM TC2, Android audio


A15 = 0.03

A7 = 0.48

17

Model Validation: ARM TC2, Android bbench


A15 = 0.67

A7 = 0.80

18

Old slides

19

Motivation Energy cost driven task placement (load-balancing)

Focus on the actual goal of the energy-aware scheduling activities:

Saving energy while achieving (near) optimum performance.

Energy benefit of scheduling decision clear when made.

Assuming energy cost estimates are fairly accurate.

Introduce a simple energy model to estimate costs and guide scheduling decisions. Requested by maintainers at the KS workshop.

Gives the right amount of packing and spreading.

May simplify balancing decision logic.

Strong focus on saving energy in load balancing algorithms.

big.LITTLE support comes naturally and almost for free.

This just one part of the energy efficiency work. Several related sessions this week.

20

Energy Load Balancing

The idea (a bit simplified): Let the resulting energy consumption guide all balancing decisions:

if (energy_diff(task, src_cpu, dst_cpu) > 0) {move_task(task, src_cpu, dst_cpu);

} else {/* Try some other task */

} Ideally, we should get the optimum balance if we try all combinations

of tasks and cpus.

In reality it is not that simple. We can't try all combinations, but we can get fairly close for most scenarios.

If the energy model is accurate enough we get packing and spreading implicitly and only when it saves energy

Should work for any system. SMP and big.LITTLE (with a few extensions).

21

Power and Energy

Goal: Save energy, not power. Power

Time

Energy

ecpu=P⋅t , t=instcc

ecpu=P (cc)instcc

ecpu=P (cc)(inst task

cc+

inst idlecc

)

ecpu=etask+eidle

Compute capacity (~ freq * uarch)

= Energy/inst: This is what we try to minimize.

ecpu=Pbusy(cc)inst task

cc+Pidle

inst idlecc

If we have cpuidle support we get:

We have to add an additional leakage energy term to reflect that it is better not wake cpus

unnecessarily.

~ utilization

Tracked load

TimeTime in runnable state

~ utilization*

Work

22

Simple Energy Model cpu_energy = power(cc) * util/cc

+ idle_power * (1-(util/cc))+ leakage_energy

cluster_energy =c_active_power * c_util+ c_idle_power * (1-c_util)

util = Scale invariant cpu utilization (Tracked load).

cc = Current compute capacity (depends on freq and uarch).

power(cc) = Busy power (fully loaded) at current capacity from table.

idle_power = Idle power consumption (~WFI).

leakage_energy = Constant representing the cost of waking the cpu.

c_util = Cluster utilization. Depends on max(util/cc) ratio of its cpus.

c_active_power = Cluster active power.

c_idle_power = Cluster idle power.

23

Compute Capacity and Power

Processor specific table expressing power and compute capacity at each P-state. The sched domain hierarchy is in a good position to hold this type of

information.

Example (entirely made up):

Capacity Power

0.2 0.4

0.4 0.9

0.6 1.5

0.8 2.2

1.0 3.2

Capacity Power

0.4 1.6

0.8 4.4

1.2 9.0

1.6 15.0

2.0 23.0

Little Big

Equal compute capacity

idle 0.1

leakage 0.1

idle 0.3

leakage 0.5

Little Big

active 2.4 6.0

idle 0.0 0.0

cluster

24

energy_diff()

def energy_diff(tload, scpu, dcpu): # Estimate the next compute capacity (P-state) s_new_cc = find_cpu_cap(scpu, cpu_util(scpu)) # energy model cost for task on source cpu s_task_energy = tload/s_new_cc * cpu_cc_power(scpu, s_new_cc) if nr_running(scpu) == 1: s_task_energy += cpu_leakage_energy[cpu_type[scpu]] # Estimate destination cpu cc after adding the task d_new_cc = find_cpu_cc(dcpu, cpu_util(dcpu)+tload) # energy model cost for task on destination cpu d_task_energy = tload/d_new_cc * cpu_cc_power(dcpu, d_new_cc) if nr_running(dcpu) == 0: d_task_energy += cpu_leakage_energy[cpu_type[dcpu]] return s_task_energy - d_task_energy

Balancing two cpus:

Balancing sched domains is slightly more complicated as it involves cluster power as well.

25

Examplecpu rq util cap cc_power leak power

0 {0.2} 0.2 0.2 0.4 0.1 0.5

1 {0.1} 0.1 0.2 0.4 0.1 0.35

2 {} 0.0 0.2 0.4 0.1 0.1

cluster - 1.0 - 2.4 - 2.4

Total 3.35

energy_diff()

= 0.075*

* energy_diff() ignores cluster power and other tasks to keep computations cheap and simple.Better accuracy can be added if necessary.

0.55

saved

cpu rq util cap cc_power leak power

0 {0.2, 0.1} 0.3 0.4 0.9 0.1 0.8

1 {} 0.0 0.4 0.9 0.1 0.1

2 {} 0.0 0.4 0.9 0.1 0.1

cluster - 0.75 - 2.4 - 1.8

Total 2.8

After EA load balance:

26

Is the energy model too simple? It is essential that the energy model is fast and is easy to use for load-

balancing. The scheduler is a critical path and already complex enough.

Python model tests Disclaimer: These numbers have not been validated in any way.

Test configuration: 3+3 big.LITTLE, 1000 random balance scenarios.

Rand/Opt: Random balance energy (starting point) worse than best possible balance energy (brute-force).

EA/Opt: Energy model based balance energy worse than best possible balance energy.

EA == Opt: Scenarios where EA found best possible balance.

Tasks Rand/Opt EA/Opt EA == Opt

2 7.86% 0.09% 72.60%

3 7.79% 0.15% 64.80%

4 9.39% 0.45% 62.00%

5 10.02% 1.15% 51.10%

6 11.44% 2.23% 38.30%

27

What is next?

Early prototype to validate the idea. Initial focus getting energy_diff() working on simple SMP system. Post on LKML very soon.

Open Issues Exposing power/capacity tables to kernel. Essential to make the right

decisions.

Plumbing: Where do the tables come from? DT?

Next steps: Scale invariance: Requirement for the energy model to work.

Fix cpu_power/compute capacity use in scheduler.

Tooling and benchmarks (covered in another session)

Idle integration (covered in another session)

28

Questions?

lcu14-410: how to build an energy model for your soc

Software