performance issues: algorithm

Performance Issues:AlgorithmPerformance Issues:Algorithm

Peng-Sheng Chen

1

2

AlgorithmAlgorithm

A good algorithm solves a problem in a fast and efficient manner

A poor algorithm, no matter how well implemented, is never as fast

Algorithm performance can be evaluated by computational complexity, Big-O

3

Computational Complexity (1)Computational Complexity (1)

Complexity 256 elements1000 elements

10000 elements

Bubble Sort n^265,000

operations1,000,000 100,000,000

QuickSortn log n

(average)2,048 9,965 133,000

4

Computational Complexity (2)Computational Complexity (2)

Computational complexity only takes into account

loop iterations, and not all of the factors affecting an

algorithm performance

You need to consider additional things to compare

algorithms, especially when the computational

complexity are similar

5

Choice of Instructions (1)Choice of Instructions (1)

The instructions needed to implement an algorithm have a big impact on performance

Ex: integer addition => 1 clock

Ex: integer division => 68 clocks

How to evaluate the speed of an instruction?

Instruction latency

Instruction throughput

6

Choice of Instructions (2)Choice of Instructions (2)

Instruction latency

The number of clocks required to complete one

instruction after the instruction’s inputs are ready

and execution begins

Ex: integer multiplication => 9 clocks

7

Choice of Instructions (3)Choice of Instructions (3) Instruction throughput

The number of clocks that the processor is

required to wait before starting the execution of

an identical instruction

Instruction throughput is always less than or

equal to instruction latency

Instruction pipelining causes the differences

Ex: integer multiplication throughput => 4 clocks

A new multiply can begin execution every 4

clocks

8

Algorithm SelectionAlgorithm Selection

Computational complexity

Take latency and throughput into account

Ex:

Algorithm 1

10 additions

Algorithm 2

1 divide

Algorithm 1 will be faster because divides take

60 times longer than additions to execution

9

Example: Finding 最大公因數Example: Finding 最大公因數

Basic steps

Factor each number

Find the factors that are common between

both numbers

Multiply the common factors together to get

the greatest common divisor

10

Finding 最大公因數Finding 最大公因數

Ex: two numbers 40, 48

Basic steps

Factor each number

40 = 2 * 2 * 2 * 5

48 = 2 * 2 * 2 * 2 * 3

Find the common factors

2 * 2 * 2

Multiply the common factors to get the greatest common divisor

GCD = 2 * 2 * 2 = 8

It will take a long time

11

Euclid’s Algorithm for GCD (1)Euclid’s Algorithm for GCD (1)

Basic steps

Larger number = larger number – smaller number

If the numbers are the same, it is the greatest

common divisor, otherwise go to step 1

12


Two numbers 48, 40

Basic steps

48, 40 → 48 – 40 = 8 40

8 != 40, so repeat step 1

8, 40 → 40 – 8 = 32 8


8, 32 → 32 – 8 = 24 8


8, 24 → 24 – 8 = 16 8


8, 16 → 16 – 8 = 8 8

8 = 8, so 8 is the GCD

13


int find_gcf(int a, int b)

{

/* assumes both a and b are greater than 0 */

while (1) {

if (a > b)

a = a – b;

else if (a < b)

b = b – a;

else /* they are equal */

return a;

}

For a=48, and b=40,

5 compares, 14 branches, and 5 subtracts => a total of 24

instructions

14

Euclid’s Algorithm for GCD (4)Euclid’s Algorithm for GCD (4)int find_gcd(int a, int b)

{


while (1) {

a = a % b;

if (a == 0) return b;

if (a == 1) return 1;

b = b % a;

if (b == 0) return a;

if (b == 1) return 1;

}

} Variation of Euclid algorithm

For a=48, and b=40,

2 divides, 3 compares, 3 branches, 4 moves, and 2 cdq

instructions => a total of 14 instructions

cdq => convert double-

word to quad-word

15

A Variation of Euclid’s Algorithm for GCDA Variation of Euclid’s Algorithm for GCD

Two numbers a = 48, b = 40

Basic steps

a = 48 % 40 = 8

Test a equal to 0?

Test a equal to 1?

b = 40 % 8 = 0

Test b equal to 0?

Modulo version is faster than subtraction version?

16

A Rough ComparisonA Rough Comparison

Instruction Quantity Latency Total clocks

Subtractions 5 1 5

Compares 5 1 5

Branches 14 1 14

Other 0 1 0

Totals 24 24

Instruction Quantity Latency Total clocks

Modulo 2 68 136

Compares 3 1 3

Branches 3 1 3

Other 6 1 6

Totals 14 148

Repetitive

subtraction

version

Modulo

version

Which is Better?Which is Better?

From previous comparison

Repetitive subtraction version is better

Example: a = 1000, b = 1

Repetitive subtraction => 999 iterations

(~5000 cycles)

Modulo => 1 iteration (~74 cycles)

17

Blended VersionBlended Version

18

int find_gcd(int a, int b)

{


while (1) {

if (a > (b*4)) {

a = a % b;



} else if (a >= b) {

a = a – b;



}

if (b > (a*4)) {

b = b % a;



} else if (b >= a) {

b = b – a;



}

}

}

EvaluationEvaluation

Run for all combinations of values a and b in

[1 .. 9999]

Pentium 4 with 3.6-GHZ

19

Repetitive

Subtraction Version

Modulo Version Blended Version

14.56 sec. 18.55 sec. 12.14 sec.

20

Data dependencies and Instruction Parallelism (1)Data dependencies and Instruction Parallelism (1)

Data dependencies affect the processor’s ability to execute instruction simultaneously

Ex:

a = u * v

b = w * x

c = y * z

clock 0

5 clock

throughput15 clock latency

21

Data dependencies and Instruction Parallelism (2)Data dependencies and Instruction Parallelism (2)

Ex: a = w * x * y * z

w * x

y * z

wx * yz

clock 0

5 clock

throughput15 clock latency

data dependency

Instruction parallelism limited by data dependencies and

latencies is a key limiting factor to algorithm

performance

22

ExampleExample

a = 0;

for (x = 0; x < 1000; x ++)

a += buffer[x];

a = b = c = d = 0;

for (x = 0; x < 250; x ++)

{

a += buffer[x];

b += buffer[x+250];

c += buffer[x+500];

d += buffer[x+750];

}

a = a + b + c + d;

More arithmetic operations can occur on each

clock due to fewer data dependencies

Memory RequirementsMemory Requirements

Fetching main memory is almost the slowest

operations for a processor

An algorithm that uses smaller amounts of memory

is usually faster

Any benefit that an algorithm gains by using extra

memory might be lost due to the speed of the

memory accesses involved

Treat memory accesses as high-latency instructions

23

Generality of AlgorithmsGenerality of Algorithms

Readily available algorithms often solve general

problems

It may be that the specific problem can be solved

more efficiently than solving a more general problem

Example: determine whether the string is empty

strlen function => O(n)

Test the first element of the string to see if it is ‘\0’

or not => O(1)

24

25

Detecting Algorithm IssuesDetecting Algorithm Issues

Using call graph feature in the VTune analyzer

Optimizing the complete algorithm instead of the

individual functions will result in more optimization

opportunities and higher performance

Check CPI by sampling

Clockticks event

Instructions Retired event

26

Key Points (1)Key Points (1)

Selecting the right algorithm is absolutely critical to great performance

Computational complexity

Instruction selections

Memory accesses

Avoiding processor issues

Performance issues on algorithm

Instruction latency, instruction throughput, data dependencies, memory accesses

Key Points (2)Key Points (2)

Keep data dependencies low enough that the

processor is able to execute at least four or more

operations at the same time

Choose algorithm that allow much of the

computation to be done in parallel or to be done

using vector instructions

Tailor algorithms to the problem to eliminate

inefficiency

Use the VTune’s call graph analysis to detect

algorithm’s hotspots

27

BackupBackup

28

29

Approximate Instruction Performance on Pentium 4Approximate Instruction Performance on Pentium 4

Instruction Latency Throughput

Addition, subtraction, increment, decrement, logic, …

0.5 0.5

Push, pop, rotate, shift SIMD memory moves, …

1 1

128-bit SIMD integer operations, … 2 2

Integer multiplication 15 4

Integer division, … 23 23

SIMD single-precision floating-point divide

32 32

Double-precision (64-bit) floating-point division

38 38

Memory operations may take much longer depending upon the state of the cache

performance issues: algorithm

Documents