performance issues: algorithm
TRANSCRIPT
![Page 1: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/1.jpg)
Performance Issues:AlgorithmPerformance Issues:Algorithm
Peng-Sheng Chen
1
![Page 2: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/2.jpg)
2
AlgorithmAlgorithm
A good algorithm solves a problem in a fast and efficient manner
A poor algorithm, no matter how well implemented, is never as fast
Algorithm performance can be evaluated by computational complexity, Big-O
![Page 3: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/3.jpg)
3
Computational Complexity (1)Computational Complexity (1)
Complexity 256 elements1000 elements
10000 elements
Bubble Sort n^265,000
operations1,000,000 100,000,000
QuickSortn log n
(average)2,048 9,965 133,000
![Page 4: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/4.jpg)
4
Computational Complexity (2)Computational Complexity (2)
Computational complexity only takes into account
loop iterations, and not all of the factors affecting an
algorithm performance
You need to consider additional things to compare
algorithms, especially when the computational
complexity are similar
![Page 5: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/5.jpg)
5
Choice of Instructions (1)Choice of Instructions (1)
The instructions needed to implement an algorithm have a big impact on performance
Ex: integer addition => 1 clock
Ex: integer division => 68 clocks
How to evaluate the speed of an instruction?
Instruction latency
Instruction throughput
![Page 6: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/6.jpg)
6
Choice of Instructions (2)Choice of Instructions (2)
Instruction latency
The number of clocks required to complete one
instruction after the instruction’s inputs are ready
and execution begins
Ex: integer multiplication => 9 clocks
![Page 7: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/7.jpg)
7
Choice of Instructions (3)Choice of Instructions (3) Instruction throughput
The number of clocks that the processor is
required to wait before starting the execution of
an identical instruction
Instruction throughput is always less than or
equal to instruction latency
Instruction pipelining causes the differences
Ex: integer multiplication throughput => 4 clocks
A new multiply can begin execution every 4
clocks
![Page 8: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/8.jpg)
8
Algorithm SelectionAlgorithm Selection
Computational complexity
Take latency and throughput into account
Ex:
Algorithm 1
10 additions
Algorithm 2
1 divide
Algorithm 1 will be faster because divides take
60 times longer than additions to execution
![Page 9: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/9.jpg)
9
Example: Finding 最大公因數Example: Finding 最大公因數
Basic steps
Factor each number
Find the factors that are common between
both numbers
Multiply the common factors together to get
the greatest common divisor
![Page 10: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/10.jpg)
10
Finding 最大公因數Finding 最大公因數
Ex: two numbers 40, 48
Basic steps
Factor each number
40 = 2 * 2 * 2 * 5
48 = 2 * 2 * 2 * 2 * 3
Find the common factors
2 * 2 * 2
Multiply the common factors to get the greatest common divisor
GCD = 2 * 2 * 2 = 8
It will take a long time
![Page 11: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/11.jpg)
11
Euclid’s Algorithm for GCD (1)Euclid’s Algorithm for GCD (1)
Basic steps
Larger number = larger number – smaller number
If the numbers are the same, it is the greatest
common divisor, otherwise go to step 1
![Page 12: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/12.jpg)
12
Euclid’s Algorithm for GCD (2)Euclid’s Algorithm for GCD (2)
Two numbers 48, 40
Basic steps
48, 40 → 48 – 40 = 8 40
8 != 40, so repeat step 1
8, 40 → 40 – 8 = 32 8
8 != 32, so repeat step 1
8, 32 → 32 – 8 = 24 8
8 != 24, so repeat step 1
8, 24 → 24 – 8 = 16 8
8 != 16, so repeat step 1
8, 16 → 16 – 8 = 8 8
8 = 8, so 8 is the GCD
![Page 13: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/13.jpg)
13
Euclid’s Algorithm for GCD (3)Euclid’s Algorithm for GCD (3)
int find_gcf(int a, int b)
{
/* assumes both a and b are greater than 0 */
while (1) {
if (a > b)
a = a – b;
else if (a < b)
b = b – a;
else /* they are equal */
return a;
}
For a=48, and b=40,
5 compares, 14 branches, and 5 subtracts => a total of 24
instructions
![Page 14: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/14.jpg)
14
Euclid’s Algorithm for GCD (4)Euclid’s Algorithm for GCD (4)int find_gcd(int a, int b)
{
/* assumes both a and b are greater than 0 */
while (1) {
a = a % b;
if (a == 0) return b;
if (a == 1) return 1;
b = b % a;
if (b == 0) return a;
if (b == 1) return 1;
}
} Variation of Euclid algorithm
For a=48, and b=40,
2 divides, 3 compares, 3 branches, 4 moves, and 2 cdq
instructions => a total of 14 instructions
cdq => convert double-
word to quad-word
![Page 15: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/15.jpg)
15
A Variation of Euclid’s Algorithm for GCDA Variation of Euclid’s Algorithm for GCD
Two numbers a = 48, b = 40
Basic steps
a = 48 % 40 = 8
Test a equal to 0?
Test a equal to 1?
b = 40 % 8 = 0
Test b equal to 0?
Modulo version is faster than subtraction version?
![Page 16: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/16.jpg)
16
A Rough ComparisonA Rough Comparison
Instruction Quantity Latency Total clocks
Subtractions 5 1 5
Compares 5 1 5
Branches 14 1 14
Other 0 1 0
Totals 24 24
Instruction Quantity Latency Total clocks
Modulo 2 68 136
Compares 3 1 3
Branches 3 1 3
Other 6 1 6
Totals 14 148
Repetitive
subtraction
version
Modulo
version
![Page 17: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/17.jpg)
Which is Better?Which is Better?
From previous comparison
Repetitive subtraction version is better
Example: a = 1000, b = 1
Repetitive subtraction => 999 iterations
(~5000 cycles)
Modulo => 1 iteration (~74 cycles)
17
![Page 18: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/18.jpg)
Blended VersionBlended Version
18
int find_gcd(int a, int b)
{
/* assumes both a and b are greater than 0 */
while (1) {
if (a > (b*4)) {
a = a % b;
if (a == 0) return b;
if (a == 1) return 1;
} else if (a >= b) {
a = a – b;
if (a == 0) return b;
if (a == 1) return 1;
}
if (b > (a*4)) {
b = b % a;
if (b == 0) return a;
if (b == 1) return 1;
} else if (b >= a) {
b = b – a;
if (b == 0) return a;
if (b == 1) return 1;
}
}
}
![Page 19: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/19.jpg)
EvaluationEvaluation
Run for all combinations of values a and b in
[1 .. 9999]
Pentium 4 with 3.6-GHZ
19
Repetitive
Subtraction Version
Modulo Version Blended Version
14.56 sec. 18.55 sec. 12.14 sec.
![Page 20: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/20.jpg)
20
Data dependencies and Instruction Parallelism (1)Data dependencies and Instruction Parallelism (1)
Data dependencies affect the processor’s ability to execute instruction simultaneously
Ex:
a = u * v
b = w * x
c = y * z
clock 0
5 clock
throughput15 clock latency
![Page 21: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/21.jpg)
21
Data dependencies and Instruction Parallelism (2)Data dependencies and Instruction Parallelism (2)
Ex: a = w * x * y * z
w * x
y * z
wx * yz
clock 0
5 clock
throughput15 clock latency
data dependency
Instruction parallelism limited by data dependencies and
latencies is a key limiting factor to algorithm
performance
![Page 22: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/22.jpg)
22
ExampleExample
a = 0;
for (x = 0; x < 1000; x ++)
a += buffer[x];
a = b = c = d = 0;
for (x = 0; x < 250; x ++)
{
a += buffer[x];
b += buffer[x+250];
c += buffer[x+500];
d += buffer[x+750];
}
a = a + b + c + d;
More arithmetic operations can occur on each
clock due to fewer data dependencies
![Page 23: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/23.jpg)
Memory RequirementsMemory Requirements
Fetching main memory is almost the slowest
operations for a processor
An algorithm that uses smaller amounts of memory
is usually faster
Any benefit that an algorithm gains by using extra
memory might be lost due to the speed of the
memory accesses involved
Treat memory accesses as high-latency instructions
23
![Page 24: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/24.jpg)
Generality of AlgorithmsGenerality of Algorithms
Readily available algorithms often solve general
problems
It may be that the specific problem can be solved
more efficiently than solving a more general problem
Example: determine whether the string is empty
strlen function => O(n)
Test the first element of the string to see if it is ‘\0’
or not => O(1)
24
![Page 25: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/25.jpg)
25
Detecting Algorithm IssuesDetecting Algorithm Issues
Using call graph feature in the VTune analyzer
Optimizing the complete algorithm instead of the
individual functions will result in more optimization
opportunities and higher performance
Check CPI by sampling
Clockticks event
Instructions Retired event
![Page 26: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/26.jpg)
26
Key Points (1)Key Points (1)
Selecting the right algorithm is absolutely critical to great performance
Computational complexity
Instruction selections
Memory accesses
Avoiding processor issues
Performance issues on algorithm
Instruction latency, instruction throughput, data dependencies, memory accesses
![Page 27: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/27.jpg)
Key Points (2)Key Points (2)
Keep data dependencies low enough that the
processor is able to execute at least four or more
operations at the same time
Choose algorithm that allow much of the
computation to be done in parallel or to be done
using vector instructions
Tailor algorithms to the problem to eliminate
inefficiency
Use the VTune’s call graph analysis to detect
algorithm’s hotspots
27
![Page 28: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/28.jpg)
BackupBackup
28
![Page 29: Performance Issues: Algorithm](https://reader030.vdocuments.mx/reader030/viewer/2022012008/61d9a4cea27b8561e811c965/html5/thumbnails/29.jpg)
29
Approximate Instruction Performance on Pentium 4Approximate Instruction Performance on Pentium 4
Instruction Latency Throughput
Addition, subtraction, increment, decrement, logic, …
0.5 0.5
Push, pop, rotate, shift SIMD memory moves, …
1 1
128-bit SIMD integer operations, … 2 2
Integer multiplication 15 4
Integer division, … 23 23
SIMD single-precision floating-point divide
32 32
Double-precision (64-bit) floating-point division
38 38
Memory operations may take much longer depending upon the state of the cache