chapter 7 performance analysis techniques. outline 1.real-time performance analysis 2.applications...

Chapter 7Performance Analysis Techniques

2

Outline

1. Real-time performance analysis

2. Applications of Queue Theory

3. Input / Output Performance

4. Analysis of memory requirements

3

7.1 REAL-TIME PERFORMANCE ANALYSIS

4

Theoretical preliminaries

• Complexity classes P, NP, NP-complete, NP-hard– P: the class of problems that can be solved by an algorithm

that runs in polynomial time on a deterministic computing machine.

– NP: the class of all problems that can not be solved in polynomial time by any deterministic machine

• But can verify a candidate solution be correct or not by a P-class algorithm.

– NP-complete: belongs to the class NP and all other problems in NP are polynomial transformable to it.

– NP-hard: if all problems in NP are polynomial transformable to this problem, but it has not been shown that it belongs to the class NP.

5

Examples

• Boolean satisfiability problem (N-Sat, N Boolean variables) is NP-complete.– But, the SAT problem involving 2 or 3 number of

Boolean variables is in P.– Can arise for requirements consistency checking

• In general, NP-complete problems in RTS tend to be those relating to resource allocation occurring in multitask scheduling situation.– Implies no easy way to find the solutions

6

More examples

• The problem of deciding whether it is possible schedule a set of periodic tasks that use only semaphores to enforce mutual exclusion is NP-hard.

• The multiprocessor scheduling problem with two processors, no resources, arbitrary partial-order relations, and every task having a 1-unit computation time is polynomial.

• The multiprocessor scheduling problem with two processors, no resources, independent tasks, and arbitrary task computation times is NP-complete.

• The multiprocessor scheduling problem with two processors, no resources, independent tasks, arbitrary partial-order, and task computation times of either 1 or 2 units of time is NP-complete.

– Partial-order: any task can call itself, A calls B, the reverse it not possible; if A calls B, B calls C, then A can calls C.

7

Arguments related to parallelization

• Amdahl’s law– Statement: For a constant problem size, the

incremental speedup approaches zero as the number of processor elements grows.

– Formalism: Let N be # of equal processors available for parallel processing; Let S(0≤s≤1) be the fraction of program code that is of serial nature (cannot be parallelized)

The speedup saturates to a limit value

8

Some discussion

• Amdahl’s pessimistic law is cited as an argument against parallel systems and in particular, against massively parallel processors.– Taken as an insurmountable bottleneck that

limited the efficiency and application of parallelism to various problems.

• Later research provided new sights into Amdahl’s law and its relation to large-scale parallelism.

9

Flaws of Amdahl’s law

• Key assumption of Amdahl’s law– “Problem size remains constant”, but the problem

size tends to scale with the size of a parallel system

– Items that scale with the problem size• parallel or vector part of a program

– Items that do not grow with the problem size• Inherent time for vector start-up, program loading, serial

bottlenecks, I/O that make up the serial component

10

Gustafson’s law

• Definition: If the firmly serial code fragment, S, and the parallelized fragment, (1-S) are processed by a parallel computer system with N equal processor, the achievable speedup is

– Does not saturate as N approaches infinity– Provides a more optimistic picture of speedup– The current “multi-core era” could be viewed as a

partial consequence of Gustafson’s Law.“A more efficient way to use a parallel computer is to have each processor perform similar work, but on a different section of the data .. where large computations are concerned” (Hills, 1998)

11

Gustafson vs. Amdahl

• Gustafson’s unbound speedup compared with Amdahl’s saturating speedup when 50% of code is suitable for parallelization

2 4 8 16 32 640

5

10

15

20

25

30

35

AmdahlGustafson

Sp

eed

up

12

Execution time estimation from program code• Analyzing RTSs to see if they meet their critical deadlines is

– rarely possible exactly due to NP-completeness of most scheduling problems.

– But possible to get a handle on the system’s behavior through approximate analysis

• First step in performing schedulability analysis is to predict, estimate or measure the execution time of essential code units.

• Methods to decide the task’s execution time ei

– Using logical analyzer (most accurate, employed in the final stages during system integration)

– Counting CPU-specific instructions manually or using automated tools

– Reading system clock before and after executing the particular program code.

13

Example: instruction counting app

• Given that a certain program module converts raw sensor pulses into the actual acceleration components that are later compensated for temperature and other effects.– The module is just to decide if the aircraft is still on

the ground, in which case only a small acceleration reading for each of the XYZ components is allowed (represented by the symbol constant PRE_TAKE)

– C code with assembly instruction are given.

14

Example1

• Tracing the worst-case execution path and counting the instructions shows that – 12 integer (7.2µs) and – 15 floating-point (75µs) instructions for a total

execution time of 82.2 µs.– Since this sequence of code runs in a 5ms cycle,

the corresponding time-loading is only

82.2/5000

15

Example 2: Estimation in non-pipelined CPU platform

• All execution paths– Path 1: 1-4, 9-10, 12

• 7 instructions @0.6µs each -> 4.2µs (BCET)

– Path 2: 1-7, 11-12– Path 3: 1-8, 12

• 9 instructions @ 0.6µs each-> 5.4µs (WCET)

16

Example 2: Estimation in pipelined CPU platform

• Assume a three-stage pipeline – Fetch (F), decode (D), execute (E)– Each stage takes 0.6µs/3=0.2µs

Go to Figure 7.2, 7.3, 7.4– Execution time of all three paths is 2.6µs

17

Some discussions

• RTSs designers frequently use special software to estimate instruction execution times and CPU throughput. Users– can typically input

• CPU type• Memory speeds for different address ranges• Instruction mix

– and can compute total instruction times and throughput

18

Example 3: Timing accuracy with a 60-kHz system clock

• Suppose – 2000 repetitions of the program code take 450

ms– Clock granularity of 16.67µs. – Hence, the execution time measurement has a

high accuracy as

19

C code to compute the time of instruction execution

• API functions– current_clock_time(): a system function that

returns the current time– function_bo_be_timed(): the actual code to be

timed.

Go to timer Code

20

Analysis of polled-loop systems

• The response time consists of three components1. The cumulative hardware delays involved in setting the

software flag by some external device

2. The time for the polled loop to test the flag

3. The time needed to process the event associated with the flag– Assumption: sufficient processing time is available between

consecutive events

Flag-settingDelay

(Nanoseconds)

Flag-TestingDelay

(Microseconds)

ProcessingDelay

(Milliseconds?)

Excitation ResponseResponse Time

21

Analysis of polled-loop systems

• If events overlap each other, – A new event is initiated while a previous one is still being

processed

• Then, the response time becomes worse, the time for the Nth overlapping event is bounded by

– tF: the time to check the flag

– tP: the time to process the event

– Ignore the time for the external device to set the flag.

• In practice, we place some limit on N– N is the number of events that are allowed to overlap. – Overlapping events may not be desirable at all in certain situations.

22

Review:Coroutine

void task_a() {

for(;;) {

switch(state_a) {

case 1: phase_a1(); break;



}

}}

void task_b() {

for(;;) {

switch(state_b) {

case 1: phase_b1(); break;



}

}}

Two tasks task_a and task_b are executing in parallel and in isolation.State_a and state_b are global variables managed by the dispatcher tomaintain synchronization and inter-task communication.

centraldispatcher

1

2 3

4

phase_a1();phase_b1();phase_a2();phase_b2();….

23

Analysis of coroutine systems

• The absence of interrupts in coroutine systems makes the determination of response time easy. – Time obtained by tracing the worst-case execution path through

all tasks.– Must first determine the execution time of each phase

void task_1()…task_1a();return;

task_1b();return;

task_1c();return;

void task_2()…task_2a();return;

task_2b();return;

Begin Here

Repeat the Sequence

Tracing the execution path in a two-task coroutine system. A central dispatch callsTask_1 and task_2 by turns, and a switch statement is not shown here.

24

Review: Round-robin scheduling is simple and predictable.

• Achieve fair allocation of CPU resources to tasks of the same priority by time multiplexing– Each executable task is assigned a fixed time

quantum or time slice to execute– A fixed-rate clock is used to initiate an interrupt at

a rate corresponding to the time slice

Task A

Task B

Task A Task C

C runs its slice

A preempted

B takes over

Begin A resumes

B completes

A finishes its slice

25

Analysis of round-robin systems

• Assumptions and definitions– n tasks in the ready queue, no new ones arrive

after scheduling, none terminates prematurely– Let q be constant timeslice for each task– Possible slack times in timeslice are not utilized

– Let c=max{c1,…,cn} be maximum execution time

Thus, the worst-case time T from readiness to completion for any task (upper bound)

26

Example: turnaround time calculation without context switching overhead

1. Suppose only one task with a maximum execution time of 500ms, and the time quantum is 100ms, thus

2. Suppose five equally important tasks, each with a maximum execution time 500ms, time quantum is 100ms

27

Non-negligible context switching overhead

• Let o be context switching overhead with task switching.– Thus, each task waits no longer than (n-1)q until

its next time slice, plus – an inherent overhead of n*o time units each time

around for context switching.

28

Examples

1. Suppose one task with a maximum execution time of 500ms, time quantum is 40ms, and context switch time is 1ms, thus

2. Suppose six equally important tasks, each with a maximum execution time 600ms, time quantum is 40ms, context switch costs 2ms

29

Selection of time quantum q

• In terms of the time quantum, it is desirable that q < c to achieve fair behavior for the round-robin system.

• If q is very large, the round-robin algorithm is in fact the first-come, first-served algorithm, in that each task will execute to its completion within the very large time quantum.

30

Review: Fixed-priority scheduling: rate-monotonic approach

• TheoremGiven a set of periodic tasks and preemptive priority scheduling, assigning interrupt priorities such that the tasks with shorter periods have higher priorities, yields an optimal scheduling algorithm.

Optimality implies: if a schedule that meets all the deadlines exists with fixed priorities, the RM algorithm will produce a feasible schedule.

31

Analysis of fixed-period/priority systems

• For any task with an execution time of ei time units, the response time Ri is

where is the max possible delay in execution (caused by higher priority tasks) during [t, t+Ri)

– The most critical time instance, when all higher-priority tasks are released along to the task , Ii has the maximum contribution for Ri.

(7.7)

32

Analysis of fixed-period systems

• Consider a task of higher priority than . – Within the interval [0, Ri), the number of releases

of will be where is the execution period of .

• Each release of contributes to the amount of interference from other tasks of higher priority that will suffer.

(7.8)

33

A recursive solution to response time

• Each task of higher priority is interfering with task , Hence

where HRP(is) is the set of higher-priority tasks w.r.t.

• Substitute this into equation 7.2. yields

(7.9)

𝑹𝒊=𝒆𝒊+ ∑𝒋∈𝑯𝑷𝑹(𝒊)

⌈𝑹𝒊

𝒑 𝒋⌉𝒆 𝒋 (7.10)

𝑹𝒊(𝒏+𝟏)=𝒆𝒊+ ∑𝒋∈𝑯𝑷𝑹(𝒊 )

⌈𝑹𝒊(𝒏)𝒑 𝒋

⌉𝒆 𝒋 (7.11)

34

A recursive solution to response time

• Due to inconvenience ceiling function, it is difficult to solve for Ri directly. A net recursive solution is provided as following:

– Compute the consecutive values of iteratively until the first value of m is found such that =

– If the recursive equation does not have a solution, the value of will continue grow.

• As in the overloaded case: a tasks set has CPU utilization factor greater than 100%.

𝑹𝒊(𝒏+𝟏)=𝒆𝒊+ ∑𝒋∈𝑯𝑷𝑹(𝒊 )

⌈𝑹𝒊(𝒏)𝒑 𝒋

⌉𝒆 𝒋 (7.11)

35

Example: compute response time in a rate-monotonic case

• Consider a task set to be scheduled rate - monotonically as shown below

• Let first calculate the CPU utilization factor, U to make sure the RTS is not overload.

36

Example: compute response time in a rate-monotonic case

• The highest priority task has a response time equal to its execution time, so R1 = 3

• The medium or lowest priority taskand has its response time iteratively computed according to equation 7.11.

37

Analysis of non-periodic systems

• In practice, a RTS having one or more aperiodic or sporadic cycles could be modeled as a rate-monotonic system, – the non-periodic tasks is approximated as having

a period equal to their worst-case expected inter-arrival time.

– If this rough approximation leads to unacceptable high utilizations, use some heuristic analysis instead. (Queuing theory)

38

Response times for interrupt-driven systems

• The calculation depends on several factors– Interrupt latency– Scheduling/dispatching times

• Negligible when CPU uses a separate interrupt controller supporting multiple interrupts

• Can compute using simple instruction counting when a single interrupt is supported with an interrupt controller

– Context switch times• Determination of context save/restore times is similar to

execution time estimation for any application code

39

Interrupt latency

• It is a varying period defined between– a device requests an interrupt and– the first instruction for the associated interrupt service

routine executes.

• Worst case interrupt latency– Occur when all possible interrupts in the system are

requested simultaneously

• Main contributors1. Number of tasks, as RTOS needs to disable interrupts

while it processing lists of blocked or waiting tasks• Perform some latency analysis to verify that OS is not disabling

interrupts for an unacceptably long tome. • In hard RTSs, keep tasks # as low as possible

40

2. Time needed to complete the execution of a particular ML instruction being interrupted.• Find the WCET of every ML instruction by

measurement, simulation or manufacturer’s datasheet• Instruction with the longest execution time will

maximize the contribution to interrupt latency if it just begun executing when the interrupt request arrives.

In a certain 32-bit MCU, - all fixed-point instructions take 2µs- Floating point instructions take 10µs- Special instructions like trigonometric function take 50µs

41

3. Deliberate disabling of interrupts by RT software• Interrupts are disabled for a number of reasons

– Protection of critical regions– Buffering routines– Context switching

• So, allow interrupt disabling by system software, not application software.

42

Architecture enhancements render a system unanalyzable for RT performance

• Instruction and data cache– Fetch instructions from slower main memory– A time-consuming replacement algorithm tor bring the missing

instructions into cache

• Instruction pipelines– Assume that at every possible opportunity, the pipeline needs to

be flushed.

• Direct memory access (DMA)– Assume that cycle stealing is occurring at every chance, inflating

instruction fetch times

They improve average computing performance, destroy determinism, and thus make prediction troublesome.

43

Review: DMA controller

• During a DMA transfer, the ordinary CPU data transfer process cannot proceed.– CPU proceeds only with nonbus-related activities – CPU cannot provide service for any interrupts

until the DMA cycle is over

• Cycle stealing mode– No more than a few bus cycles are used at a time

for DMA transfer– Thus, a single transfer cycle of a large data block

is split to several shorter transfer cycles

44

Discussions

• Traditional worst-case analysis leads to impractically pessimistic outcomes.

• Sol: Use probabilistic performance model for caches, pipelines and DMA.– Definitely meet all the required deadlines, but it is

sufficient to have a probabilistic guarantee very close to 100% instead of an absolute guarantee

– Practical relaxation dramatically reduces the WCET to be considered, as in schedulability analysis

• But in hard RTSs, it remains problematic to use the advanced CPU and memory architectures.

chapter 7 performance analysis techniques. outline 1.real-time performance analysis 2.applications...

Documents