ece 1747h : parallel programming lecture 1: overview

ECE 1747H : Parallel Programming

Lecture 1: Overview

ECE 1747H

• Meeting time: Mon 4-6 PM

• Meeting place: BA 4164

• Instructor: Cristiana Amza,

http://www.eecg.toronto.edu/~amza

amza@eecg.toronto.edu, office Pratt 484E

Material

• Course notes

• Web material (e.g., published papers)

• No required textbook, some recommended

Prerequisites

• Programming in C or C++

• Data structures

• Basics of machine architecture

• Basics of network programming

• Please send e-mail to eugenia@eecg

to get an eecg account !! (name, stuid, class, instructor)

Other than that

• No written homeworks, no exams

• 10% for each small programming assignments (expect 1-2)

• 10% class participation

• Rest comes from major course project

Programming Project

• Parallelizing a sequential program, or improving the performance or the functionality of a parallel program

• Project proposal and final report

• In-class project proposal and final report presentation

• “Sample” project presentation posted

Parallelism (1 of 2)

• Ability to execute different parts of a single program concurrently on different machines

• Goal: shorter running time

• Grain of parallelism: how big are the parts?

• Can be instruction, statement, procedure, …

• Will mainly focus on relative coarse grain

Parallelism (2 of 2)

• Coarse-grain parallelism mainly applicable to long-running, scientific programs

• Examples: weather prediction, prime number factorization, simulations, …

Lecture material (1 of 4)

• Parallelism– What is parallelism?– What can be parallelized?– Inhibitors of parallelism: dependences

• Standard models of parallelism– shared memory (Pthreads)– message passing (MPI)– shared memory + data parallelism (OpenMP)

• Classes of applications– scientific– servers

• Transaction processing – classic programming model for databases– now being proposed for scientific programs

• Perf. of parallel & distributed programs– architecture-independent optimization– architecture-dependent optimization

Course Organization

• First month of semester:– lectures on parallelism, patterns, models– small programming assignments, done

individually

• Rest of the semester– major programming project, done individually

or in small group– Research paper discussions

Parallel vs. Distributed Programming

Parallel programming has matured:

• Few standard programming models

• Few common machine architectures

• Portability between models and architectures

Bottom Line

• Programmer can now focus on program and use suitable programming model

• Reasonable hope of portability

• Problem: much performance optimization is still platform-dependent– Performance portability is a problem

ECE 1747H: Parallel Programming

Lecture 1-2: Parallelism, Dependences

Parallelism

• Ability to execute different parts of a program concurrently on different machines

• Goal: shorten execution time

Measures of Performance

• To computer scientists: speedup, execution time.

• To applications people: size of problem, accuracy of solution, etc.

Speedup of Algorithm

• Speedup of algorithm = sequential execution time / execution time on p processors (with the same data set).

speedup

Speedup on Problem

• Speedup on problem = sequential execution time of best known sequential algorithm / execution time on p processors.

• A more honest measure of performance.

• Avoids picking an easily parallelizable algorithm with poor sequential execution time.

What Speedups Can You Get?

• Linear speedup– Confusing term: implicitly means a 1-to-1

speedup per processor.– (almost always) as good as you can do.

• Sub-linear speedup: more normal due to overhead of startup, synchronization, communication, etc.

Speedup

speeduplinear

actual

Scalability

• No really precise decision.

• Roughly speaking, a program is said to scale to a certain number of processors p, if going from p-1 to p processors results in some acceptable improvement in speedup (for instance, an increase of 0.5).

Super-linear Speedup?

• Due to cache/memory effects: – Subparts fit into cache/memory of each node.– Whole problem does not fit in cache/memory of

a single node.

• Nondeterminism in search problems.– One thread finds near-optimal solution very

quickly => leads to drastic pruning of search space.

Cardinal Performance Rule

• Don’t leave (too) much of your code sequential!

Amdahl’s Law

• If 1/s of the program is sequential, then you can never get a speedup better than s.– (Normalized) sequential execution time =

1/s + (1- 1/s) = 1– Best parallel execution time on p processors =

1/s + (1 - 1/s) /p– When p goes to infinity, parallel execution =

1/s– Speedup = s.

Why keep something sequential?

• Some parts of the program are not parallelizable (because of dependences)

• Some parts may be parallelizable, but the overhead dwarfs the increased speedup.

When can two statements execute in parallel?

• On one processor:statement 1;

statement 2;

• On two processors:processor1: processor2:

statement1; statement2;

Fundamental Assumption

• Processors execute independently: no control over order of execution between processors

When can 2 statements execute in parallel?

• Possibility 1Processor1: Processor2:

statement1;

statement2;

• Possibility 2Processor1: Processor2:

statement2:

statement1;

• Their order of execution must not matter!

• In other words,statement1; statement2;

must be equivalent tostatement2; statement1;

Example 1

a = 1;b = 2;

• Statements can be executed in parallel.

Example 2

a = 1;b = a;

• Statements cannot be executed in parallel

• Program modifications may make it possible.

Example 3

a = f(x);b = a;

• May not be wise to change the program (sequential execution would take longer).

Example 5

a = 1;a = 2;

• Statements cannot be executed in parallel.

True dependence

Statements S1, S2

S2 has a true dependence on S1

S2 reads a value written by S1

Anti-dependence

Statements S1, S2.

S2 has an anti-dependence on S1

S2 writes a value read by S1.

Output Dependence

Statements S1, S2.

S2 has an output dependence on S1

S2 writes a variable written by S1.

S1 and S2 can execute in parallel

there are no dependences between S1 and S2– true dependences– anti-dependences– output dependences

Some dependences can be removed.

Example 6

• Most parallelism occurs in loops.

for(i=0; i<100; i++) a[i] = i;

• No dependences.• Iterations can be executed in parallel.

Example 7

for(i=0; i<100; i++) { a[i] = i; b[i] = 2*i;}

Iterations and statements can be executed in parallel.

Example 8

for(i=0;i<100;i++) a[i] = i;for(i=0;i<100;i++) b[i] = 2*i;

Iterations and loops can be executed in parallel.

Example 9

for(i=0; i<100; i++)

a[i] = a[i] + 100;

• There is a dependence … on itself!

• Loop is still parallelizable.

Example 10

for( i=0; i<100; i++ )

a[i] = f(a[i-1]);

• Dependence between a[i] and a[i-1].

• Loop iterations are not parallelizable.

Loop-carried dependence

• A loop carried dependence is a dependence that is present only if the statements are part of the execution of a loop.

• Otherwise, we call it a loop-independent dependence.

• Loop-carried dependences prevent loop iteration parallelization.

Example 11

for(i=0; i<100; i++ )

for(j=0; j<100; j++ )

a[i][j] = f(a[i][j-1]);

• Loop-independent dependence on i.• Loop-carried dependence on j.• Outer loop can be parallelized, inner loop cannot.

Example 12

for( j=0; j<100; j++ )

for( i=0; i<100; i++ )

a[i][j] = f(a[i][j-1]);

• Inner loop can be parallelized, outer loop cannot.

• Less desirable situation.

• Loop interchange is sometimes possible.

Level of loop-carried dependence

• Is the nesting depth of the loop that carries the dependence.

• Indicates which loops can be parallelized.

Be careful … Example 13

printf(“a”);printf(“b”);

Statements have a hidden output dependence due to the output stream.

a = f(x);b = g(x);

Statements could have a hidden dependence if f and g update the same variable.

Also depends on what f and g can do to x.

for(i=0; i<100; i++)

a[i+10] = f(a[i]);

• Dependence between a[10], a[20], …

• Dependence between a[11], a[21], …

• …

• Some parallel execution is possible.

for( i=1; i<100;i++ ) {a[i] = …;... = a[i-1];

• Dependence between a[i] and a[i-1]

• Complete parallel execution impossible

• Pipelined parallel execution possible

for( i=0; i<100; i++ )

a[i] = f(a[indexa[i]]);

• Cannot tell for sure.

• Parallelization depends on user knowledge of values in indexa[].

• User can tell, compiler cannot.

An aside

• Parallelizing compilers analyze program dependences to decide parallelization.

• In parallelization by hand, user does the same analysis.

• Compiler more convenient and more correct

• User more powerful, can analyze more patterns.

To remember

• Statement order must not matter.

• Statements must not have dependences.

• Some dependences can be removed.

• Some dependences may not be obvious.

ece 1747h : parallel programming lecture 1: overview

Documents

architecture of parallel computers csc / ece 506 ......–...

ece 1749h: interconnection networks for parallel computer...

ece 669 parallel computer architecture lecture 2...

ece 372 – microcontroller design parallel io ports -...

a multithreading c# data synchronization program and its...

architecture of parallel computers csc / ece 506 message

ece 1747h: parallel programming lecture 2-3: more on...

ece 3183 – ee systems chapter 2 – part a parallel,...

ece 1749h: interconnection networks for parallel computer...

ece 1747 parallel programming shared memory: openmp...

lecture 11: parallel processing of irregular computations &...

ece 1747: parallel programming basics of parallel...

ece 454 computer systems programming parallel architectures...

ece 669 parallel computer architecture - umass · pdf...

ece 1747: parallel programming

ece 669 parallel computer architecture lecture 19 processor...

ece 1747h: parallel programming lecture 2: data parallelism

maximizing parallel performance review of...

computer organization -...

ece 372 – microcontroller design parallel io ports -...