1. implementation: target architectures · implementation:... risc technology ... • different...
TRANSCRIPT
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 1 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
1. Implementation: Target Architectures
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 1 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
1. Implementation: Target Architectures
• different target architectures for numerical simulations:
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 1 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
1. Implementation: Target Architectures
• different target architectures for numerical simulations:
– monoprocessors
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 1 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
1. Implementation: Target Architectures
• different target architectures for numerical simulations:
– monoprocessors
– supercomputers
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 1 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
1. Implementation: Target Architectures
• different target architectures for numerical simulations:
– monoprocessors
– supercomputers
• modern microprocessors:
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 1 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
1. Implementation: Target Architectures
• different target architectures for numerical simulations:
– monoprocessors
– supercomputers
• modern microprocessors:
– obvious trends:
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 1 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
1. Implementation: Target Architectures
• different target architectures for numerical simulations:
– monoprocessors
– supercomputers
• modern microprocessors:
– obvious trends:
* increasing clock rates (> 2GHz almost standard)
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 1 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
1. Implementation: Target Architectures
• different target architectures for numerical simulations:
– monoprocessors
– supercomputers
• modern microprocessors:
– obvious trends:
* increasing clock rates (> 2GHz almost standard)
* more MIPS, more FLOPS
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 1 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
1. Implementation: Target Architectures
• different target architectures for numerical simulations:
– monoprocessors
– supercomputers
• modern microprocessors:
– obvious trends:
* increasing clock rates (> 2GHz almost standard)
* more MIPS, more FLOPS
* very-, ultra-, and ???-large scale integration; hence, moretransistors and more functionality on the chip
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 1 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
1. Implementation: Target Architectures
• different target architectures for numerical simulations:
– monoprocessors
– supercomputers
• modern microprocessors:
– obvious trends:
* increasing clock rates (> 2GHz almost standard)
* more MIPS, more FLOPS
* very-, ultra-, and ???-large scale integration; hence, moretransistors and more functionality on the chip
* longer words: 64 Bit architectures are standard (work-stations) or coming (PCs)
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 1 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
1. Implementation: Target Architectures
• different target architectures for numerical simulations:
– monoprocessors
– supercomputers
• modern microprocessors:
– obvious trends:
* increasing clock rates (> 2GHz almost standard)
* more MIPS, more FLOPS
* very-, ultra-, and ???-large scale integration; hence, moretransistors and more functionality on the chip
* longer words: 64 Bit architectures are standard (work-stations) or coming (PCs)
– important features:
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 1 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
1. Implementation: Target Architectures
• different target architectures for numerical simulations:
– monoprocessors
– supercomputers
• modern microprocessors:
– obvious trends:
* increasing clock rates (> 2GHz almost standard)
* more MIPS, more FLOPS
* very-, ultra-, and ???-large scale integration; hence, moretransistors and more functionality on the chip
* longer words: 64 Bit architectures are standard (work-stations) or coming (PCs)
– important features:
* RISC (Reduced Instruction Set Computer) technology
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 1 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
1. Implementation: Target Architectures
• different target architectures for numerical simulations:
– monoprocessors
– supercomputers
• modern microprocessors:
– obvious trends:
* increasing clock rates (> 2GHz almost standard)
* more MIPS, more FLOPS
* very-, ultra-, and ???-large scale integration; hence, moretransistors and more functionality on the chip
* longer words: 64 Bit architectures are standard (work-stations) or coming (PCs)
– important features:
* RISC (Reduced Instruction Set Computer) technology
* well-developed pipelining
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 1 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
1. Implementation: Target Architectures
• different target architectures for numerical simulations:
– monoprocessors
– supercomputers
• modern microprocessors:
– obvious trends:
* increasing clock rates (> 2GHz almost standard)
* more MIPS, more FLOPS
* very-, ultra-, and ???-large scale integration; hence, moretransistors and more functionality on the chip
* longer words: 64 Bit architectures are standard (work-stations) or coming (PCs)
– important features:
* RISC (Reduced Instruction Set Computer) technology
* well-developed pipelining
* superscalarprocessor organization
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 1 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
1. Implementation: Target Architectures
• different target architectures for numerical simulations:
– monoprocessors
– supercomputers
• modern microprocessors:
– obvious trends:
* increasing clock rates (> 2GHz almost standard)
* more MIPS, more FLOPS
* very-, ultra-, and ???-large scale integration; hence, moretransistors and more functionality on the chip
* longer words: 64 Bit architectures are standard (work-stations) or coming (PCs)
– important features:
* RISC (Reduced Instruction Set Computer) technology
* well-developed pipelining
* superscalarprocessor organization
* cachingand multi-level memory hierarchy
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 1 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
1. Implementation: Target Architectures
• different target architectures for numerical simulations:
– monoprocessors
– supercomputers
• modern microprocessors:
– obvious trends:
* increasing clock rates (> 2GHz almost standard)
* more MIPS, more FLOPS
* very-, ultra-, and ???-large scale integration; hence, moretransistors and more functionality on the chip
* longer words: 64 Bit architectures are standard (work-stations) or coming (PCs)
– important features:
* RISC (Reduced Instruction Set Computer) technology
* well-developed pipelining
* superscalarprocessor organization
* cachingand multi-level memory hierarchy
* VLIW, Multi Thread Architecture, On-chip multiproces-sors, ...
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 2 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
2. RISC Technology
• counter-trend to CISC: more and more complex instructions en-tailing microprogramming
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 2 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
2. RISC Technology
• counter-trend to CISC: more and more complex instructions en-tailing microprogramming
• now instead:
– relatively small number of instructions (tens)
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 2 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
2. RISC Technology
• counter-trend to CISC: more and more complex instructions en-tailing microprogramming
• now instead:
– relatively small number of instructions (tens)
– simple machine instructions, fixed format, few address modes
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 2 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
2. RISC Technology
• counter-trend to CISC: more and more complex instructions en-tailing microprogramming
• now instead:
– relatively small number of instructions (tens)
– simple machine instructions, fixed format, few address modes
– load-and-storeprinciple: only explicit LOAD/WRITE instruc-tions have memory access
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 2 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
2. RISC Technology
• counter-trend to CISC: more and more complex instructions en-tailing microprogramming
• now instead:
– relatively small number of instructions (tens)
– simple machine instructions, fixed format, few address modes
– load-and-storeprinciple: only explicit LOAD/WRITE instruc-tions have memory access
– no more need for microprogramming
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 3 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
3. Pipelining
• decompose instructions into simple steps involving different partsof the CPU:
– load,
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 3 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
3. Pipelining
• decompose instructions into simple steps involving different partsof the CPU:
– load,
– decode,
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 3 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
3. Pipelining
• decompose instructions into simple steps involving different partsof the CPU:
– load,
– decode,
– reserve registers,
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 3 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
3. Pipelining
• decompose instructions into simple steps involving different partsof the CPU:
– load,
– decode,
– reserve registers,
– execute,
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 3 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
3. Pipelining
• decompose instructions into simple steps involving different partsof the CPU:
– load,
– decode,
– reserve registers,
– execute,
– write results
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 3 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
3. Pipelining
• decompose instructions into simple steps involving different partsof the CPU:
– load,
– decode,
– reserve registers,
– execute,
– write results
• further improvement: reorder steps of an instruction (LOAD asearly as possible, WRITE as late as possible: avoids risk of idlewaiting time)
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 3 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
3. Pipelining
• decompose instructions into simple steps involving different partsof the CPU:
– load,
– decode,
– reserve registers,
– execute,
– write results
• further improvement: reorder steps of an instruction (LOAD asearly as possible, WRITE as late as possible: avoids risk of idlewaiting time)
• best case: identical instructions to be pipelined/overlapped, asin vector processors
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 3 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
3. Pipelining
• decompose instructions into simple steps involving different partsof the CPU:
– load,
– decode,
– reserve registers,
– execute,
– write results
• further improvement: reorder steps of an instruction (LOAD asearly as possible, WRITE as late as possible: avoids risk of idlewaiting time)
• best case: identical instructions to be pipelined/overlapped, asin vector processors
• pipelining needs different functional units in the CPU that candeal with the different steps in parallel; therefore:
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 4 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
4. Superscalar Processors
• several parts of the CPU are available in more than 1 copy
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 4 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
4. Superscalar Processors
• several parts of the CPU are available in more than 1 copy
• example: MIPS R10000 has 5 execution pipelines
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 4 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
4. Superscalar Processors
• several parts of the CPU are available in more than 1 copy
• example: MIPS R10000 has 5 execution pipelines
– one for FP-multiplication, one for FP-addition
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 4 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
4. Superscalar Processors
• several parts of the CPU are available in more than 1 copy
• example: MIPS R10000 has 5 execution pipelines
– one for FP-multiplication, one for FP-addition
– two integer ALU (arithmetic-logical units)
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 4 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
4. Superscalar Processors
• several parts of the CPU are available in more than 1 copy
• example: MIPS R10000 has 5 execution pipelines
– one for FP-multiplication, one for FP-addition
– two integer ALU (arithmetic-logical units)
– one address pipeline
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 5 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
5. Cache Memory
• CPU performance increased faster than memory access speed
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 5 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
5. Cache Memory
• CPU performance increased faster than memory access speed
• thus: reduce memory access time / latency
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 5 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
5. Cache Memory
• CPU performance increased faster than memory access speed
• thus: reduce memory access time / latency
• cache memory: small and fast on-chip memory, keeps part ofthe main memory
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 5 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
5. Cache Memory
• CPU performance increased faster than memory access speed
• thus: reduce memory access time / latency
• cache memory: small and fast on-chip memory, keeps part ofthe main memory
• optimum: needed data is always available in cache memory
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 5 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
5. Cache Memory
• CPU performance increased faster than memory access speed
• thus: reduce memory access time / latency
• cache memory: small and fast on-chip memory, keeps part ofthe main memory
• optimum: needed data is always available in cache memory
• look for strategies to ensure hit-probability p close to 1:
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 5 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
5. Cache Memory
• CPU performance increased faster than memory access speed
• thus: reduce memory access time / latency
• cache memory: small and fast on-chip memory, keeps part ofthe main memory
• optimum: needed data is always available in cache memory
• look for strategies to ensure hit-probability p close to 1:
– choice of section: what to be kept in cache?
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 5 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
5. Cache Memory
• CPU performance increased faster than memory access speed
• thus: reduce memory access time / latency
• cache memory: small and fast on-chip memory, keeps part ofthe main memory
• optimum: needed data is always available in cache memory
• look for strategies to ensure hit-probability p close to 1:
– choice of section: what to be kept in cache?
– ensure locality of data (instructions in cache need data incache)
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 5 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
5. Cache Memory
• CPU performance increased faster than memory access speed
• thus: reduce memory access time / latency
• cache memory: small and fast on-chip memory, keeps part ofthe main memory
• optimum: needed data is always available in cache memory
• look for strategies to ensure hit-probability p close to 1:
– choice of section: what to be kept in cache?
– ensure locality of data (instructions in cache need data incache)
– strategies for fetching, replacement, and updating
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 5 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
5. Cache Memory
• CPU performance increased faster than memory access speed
• thus: reduce memory access time / latency
• cache memory: small and fast on-chip memory, keeps part ofthe main memory
• optimum: needed data is always available in cache memory
• look for strategies to ensure hit-probability p close to 1:
– choice of section: what to be kept in cache?
– ensure locality of data (instructions in cache need data incache)
– strategies for fetching, replacement, and updating
– association: how to check whether data are available incache?
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 5 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
5. Cache Memory
• CPU performance increased faster than memory access speed
• thus: reduce memory access time / latency
• cache memory: small and fast on-chip memory, keeps part ofthe main memory
• optimum: needed data is always available in cache memory
• look for strategies to ensure hit-probability p close to 1:
– choice of section: what to be kept in cache?
– ensure locality of data (instructions in cache need data incache)
– strategies for fetching, replacement, and updating
– association: how to check whether data are available incache?
– consistency: no different versions in cache and main mem-ory
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 6 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
6. Memory Hierarchy
• today: several cache levels → memory hierarchy:
– register,
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 6 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
6. Memory Hierarchy
• today: several cache levels → memory hierarchy:
– register,
– (level-1/2/3) cache,
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 6 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
6. Memory Hierarchy
• today: several cache levels → memory hierarchy:
– register,
– (level-1/2/3) cache,
– main memory,
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 6 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
6. Memory Hierarchy
• today: several cache levels → memory hierarchy:
– register,
– (level-1/2/3) cache,
– main memory,
– hard disk,
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 6 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
6. Memory Hierarchy
• today: several cache levels → memory hierarchy:
– register,
– (level-1/2/3) cache,
– main memory,
– hard disk,
– remote memory
the faster, the smaller
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 6 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
6. Memory Hierarchy
• today: several cache levels → memory hierarchy:
– register,
– (level-1/2/3) cache,
– main memory,
– hard disk,
– remote memory
the faster, the smaller
• notion of the target computer’s memory hierarchy is importantfor numerical algorithms’ efficiency:
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 6 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
6. Memory Hierarchy
• today: several cache levels → memory hierarchy:
– register,
– (level-1/2/3) cache,
– main memory,
– hard disk,
– remote memory
the faster, the smaller
• notion of the target computer’s memory hierarchy is importantfor numerical algorithms’ efficiency:
– example: matrix-vector product Ax with A too large for cache
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 6 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
6. Memory Hierarchy
• today: several cache levels → memory hierarchy:
– register,
– (level-1/2/3) cache,
– main memory,
– hard disk,
– remote memory
the faster, the smaller
• notion of the target computer’s memory hierarchy is importantfor numerical algorithms’ efficiency:
– example: matrix-vector product Ax with A too large for cache
– standard algorithm:
* outer loop over rows of A,
* inner loop for scalar product of one row of A with x
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 6 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
6. Memory Hierarchy
• today: several cache levels → memory hierarchy:
– register,
– (level-1/2/3) cache,
– main memory,
– hard disk,
– remote memory
the faster, the smaller
• notion of the target computer’s memory hierarchy is importantfor numerical algorithms’ efficiency:
– example: matrix-vector product Ax with A too large for cache
– standard algorithm:
* outer loop over rows of A,
* inner loop for scalar product of one row of A with x
– if current contents of cache are some rows of A, it’s OK
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 6 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
6. Memory Hierarchy
• today: several cache levels → memory hierarchy:
– register,
– (level-1/2/3) cache,
– main memory,
– hard disk,
– remote memory
the faster, the smaller
• notion of the target computer’s memory hierarchy is importantfor numerical algorithms’ efficiency:
– example: matrix-vector product Ax with A too large for cache
– standard algorithm:
* outer loop over rows of A,
* inner loop for scalar product of one row of A with x
– if current contents of cache are some rows of A, it’s OK
– if current contents of cache are some columns of A: slow!
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 6 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
6. Memory Hierarchy
• today: several cache levels → memory hierarchy:
– register,– (level-1/2/3) cache,– main memory,– hard disk,– remote memory
the faster, the smaller
• notion of the target computer’s memory hierarchy is importantfor numerical algorithms’ efficiency:
– example: matrix-vector product Ax with A too large for cache– standard algorithm:
* outer loop over rows of A,
* inner loop for scalar product of one row of A with x
– if current contents of cache are some rows of A, it’s OK– if current contents of cache are some columns of A: slow!– tuning crucial: peak performance up to 4 orders of magni-
tude higher than performance observed in practice (withouttuning)
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 7 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
7. Parallel Computers – Topologies
• parallel computers – distributed systems: frontier?
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 7 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
7. Parallel Computers – Topologies
• parallel computers – distributed systems: frontier?
• different possibilities of arrangement:
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 7 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
7. Parallel Computers – Topologies
• parallel computers – distributed systems: frontier?
• different possibilities of arrangement:
– static network topologies:
* bus, ring, grid, or torus
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 7 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
7. Parallel Computers – Topologies
• parallel computers – distributed systems: frontier?
• different possibilities of arrangement:
– static network topologies:
* bus, ring, grid, or torus
* binary tree or fat tree
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 7 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
7. Parallel Computers – Topologies
• parallel computers – distributed systems: frontier?
• different possibilities of arrangement:
– static network topologies:
* bus, ring, grid, or torus
* binary tree or fat tree
* hypercube
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 7 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
7. Parallel Computers – Topologies
• parallel computers – distributed systems: frontier?
• different possibilities of arrangement:
– static network topologies:
* bus, ring, grid, or torus
* binary tree or fat tree
* hypercube
– dynamic network topologies:
* crossbar switch
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 7 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
7. Parallel Computers – Topologies
• parallel computers – distributed systems: frontier?
• different possibilities of arrangement:
– static network topologies:
* bus, ring, grid, or torus
* binary tree or fat tree
* hypercube
– dynamic network topologies:
* crossbar switch
* shuffle exchange network
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 7 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
7. Parallel Computers – Topologies
• parallel computers – distributed systems: frontier?
• different possibilities of arrangement:
– static network topologies:
* bus, ring, grid, or torus
* binary tree or fat tree
* hypercube
– dynamic network topologies:
* crossbar switch
* shuffle exchange network
• crucial quantities:
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 7 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
7. Parallel Computers – Topologies
• parallel computers – distributed systems: frontier?
• different possibilities of arrangement:
– static network topologies:
* bus, ring, grid, or torus
* binary tree or fat tree
* hypercube
– dynamic network topologies:
* crossbar switch
* shuffle exchange network
• crucial quantities:
– diameter (longest path between two processors)
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 7 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
7. Parallel Computers – Topologies
• parallel computers – distributed systems: frontier?
• different possibilities of arrangement:
– static network topologies:
* bus, ring, grid, or torus
* binary tree or fat tree
* hypercube
– dynamic network topologies:
* crossbar switch
* shuffle exchange network
• crucial quantities:
– diameter (longest path between two processors)
– number of network connections (ports) per processor
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 7 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
7. Parallel Computers – Topologies
• parallel computers – distributed systems: frontier?
• different possibilities of arrangement:
– static network topologies:
* bus, ring, grid, or torus
* binary tree or fat tree
* hypercube
– dynamic network topologies:
* crossbar switch
* shuffle exchange network
• crucial quantities:
– diameter (longest path between two processors)
– number of network connections (ports) per processor
– parallel communications possible?
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 7 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
7. Parallel Computers – Topologies
• parallel computers – distributed systems: frontier?
• different possibilities of arrangement:
– static network topologies:
* bus, ring, grid, or torus
* binary tree or fat tree
* hypercube
– dynamic network topologies:
* crossbar switch
* shuffle exchange network
• crucial quantities:
– diameter (longest path between two processors)
– number of network connections (ports) per processor
– parallel communications possible?
– existence of bottlenecks?
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 8 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
8. Flynn’s Classification (1972)
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 8 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
8. Flynn’s Classification (1972)
• SISD: Single Instruction Single Data
– classical von-Neumann monoprocessor
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 8 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
8. Flynn’s Classification (1972)
• SISD: Single Instruction Single Data
– classical von-Neumann monoprocessor
• SIMD : Single Instruction Multiple Data
– vector computers: extreme pipeling, one instruction appliedto a sequence (vector) of data (CRAY 1,2,X,Y,J/C/T90,. . . )
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 8 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
8. Flynn’s Classification (1972)
• SISD: Single Instruction Single Data
– classical von-Neumann monoprocessor
• SIMD : Single Instruction Multiple Data
– vector computers: extreme pipeling, one instruction appliedto a sequence (vector) of data (CRAY 1,2,X,Y,J/C/T90,. . . )
– array computers: array of processors, concurrency (Think-ing Machines CM-2, MasPar MP-1, MP-2)
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 8 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
8. Flynn’s Classification (1972)
• SISD: Single Instruction Single Data
– classical von-Neumann monoprocessor
• SIMD : Single Instruction Multiple Data
– vector computers: extreme pipeling, one instruction appliedto a sequence (vector) of data (CRAY 1,2,X,Y,J/C/T90,. . . )
– array computers: array of processors, concurrency (Think-ing Machines CM-2, MasPar MP-1, MP-2)
• MIMD : Multiple Instruction Multiple Data
– multiprocessors:
* distributed memory(loose coupling, explicit communica-tion; Intel Paragon, IBM SP-2) or
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 8 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
8. Flynn’s Classification (1972)
• SISD: Single Instruction Single Data
– classical von-Neumann monoprocessor
• SIMD : Single Instruction Multiple Data
– vector computers: extreme pipeling, one instruction appliedto a sequence (vector) of data (CRAY 1,2,X,Y,J/C/T90,. . . )
– array computers: array of processors, concurrency (Think-ing Machines CM-2, MasPar MP-1, MP-2)
• MIMD : Multiple Instruction Multiple Data
– multiprocessors:
* distributed memory(loose coupling, explicit communica-tion; Intel Paragon, IBM SP-2) or
* shared memory(tight coupling, global address space, im-plicit communication; most workstation servers) or
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 8 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
8. Flynn’s Classification (1972)
• SISD: Single Instruction Single Data
– classical von-Neumann monoprocessor
• SIMD : Single Instruction Multiple Data
– vector computers: extreme pipeling, one instruction appliedto a sequence (vector) of data (CRAY 1,2,X,Y,J/C/T90,. . . )
– array computers: array of processors, concurrency (Think-ing Machines CM-2, MasPar MP-1, MP-2)
• MIMD : Multiple Instruction Multiple Data
– multiprocessors:
* distributed memory(loose coupling, explicit communica-tion; Intel Paragon, IBM SP-2) or
* shared memory(tight coupling, global address space, im-plicit communication; most workstation servers) or
* nets/clusters
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 8 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
8. Flynn’s Classification (1972)
• SISD: Single Instruction Single Data
– classical von-Neumann monoprocessor
• SIMD : Single Instruction Multiple Data
– vector computers: extreme pipeling, one instruction appliedto a sequence (vector) of data (CRAY 1,2,X,Y,J/C/T90,. . . )
– array computers: array of processors, concurrency (Think-ing Machines CM-2, MasPar MP-1, MP-2)
• MIMD : Multiple Instruction Multiple Data
– multiprocessors:
* distributed memory(loose coupling, explicit communica-tion; Intel Paragon, IBM SP-2) or
* shared memory(tight coupling, global address space, im-plicit communication; most workstation servers) or
* nets/clusters
• MISD : Multiple Instruction Single Data: rare
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 9 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
9. Memory Access Classification
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 9 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
9. Memory Access Classification
• other criteria for classification:
scalability (S), programming model (PM), portability (P), and loaddistribution (L)
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 9 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
9. Memory Access Classification
• other criteria for classification:
scalability (S), programming model (PM), portability (P), and loaddistribution (L)
• UMA : Uniform Memory Access
– shared memory systems: SMP (symmetric multiprocessors,parallel vector processors); PC- and WS-servers, CRAYYMP
– advantage: P, PM, L; drawback: S
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 9 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
9. Memory Access Classification
• other criteria for classification:
scalability (S), programming model (PM), portability (P), and loaddistribution (L)
• UMA : Uniform Memory Access
– shared memory systems: SMP (symmetric multiprocessors,parallel vector processors); PC- and WS-servers, CRAYYMP
– advantage: P, PM, L; drawback: S
• NORMA : No Remote Memory Access
– distributed memory systems; clusters, IBM SP-2, iPSC/860
– advantage: S; drawback: P, PM, L
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 9 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
9. Memory Access Classification
• other criteria for classification:
scalability (S), programming model (PM), portability (P), and loaddistribution (L)
• UMA : Uniform Memory Access
– shared memory systems: SMP (symmetric multiprocessors,parallel vector processors); PC- and WS-servers, CRAYYMP
– advantage: P, PM, L; drawback: S
• NORMA : No Remote Memory Access
– distributed memory systems; clusters, IBM SP-2, iPSC/860
– advantage: S; drawback: P, PM, L
• NUMA : Non-Uniform Memory Access
– systems with virtually shared memory; KSR-1, CRAY T3D/T3E,CONVEX SPP
– Advantage: PM, S, P; drawback: cache-coherence, com-mun.
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 10 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
10. Parallelization
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 10 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
10. Parallelization
• classical programming paradigms are, in principle, all well-suitedfor explicit or implicit parallelization:
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 10 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
10. Parallelization
• classical programming paradigms are, in principle, all well-suitedfor explicit or implicit parallelization:
– imperative: FORTRAN, C (dominant male, recently withsome OO-touch like in C++)
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 10 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
10. Parallelization
• classical programming paradigms are, in principle, all well-suitedfor explicit or implicit parallelization:
– imperative: FORTRAN, C (dominant male, recently withsome OO-touch like in C++)
– logical/relational: PROLOG
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 10 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
10. Parallelization
• classical programming paradigms are, in principle, all well-suitedfor explicit or implicit parallelization:
– imperative: FORTRAN, C (dominant male, recently withsome OO-touch like in C++)
– logical/relational: PROLOG
– object-oriented: SMALLTALK
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 10 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
10. Parallelization
• classical programming paradigms are, in principle, all well-suitedfor explicit or implicit parallelization:
– imperative: FORTRAN, C (dominant male, recently withsome OO-touch like in C++)
– logical/relational: PROLOG
– object-oriented: SMALLTALK
– functional/applicative: LISP
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 10 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
10. Parallelization
• classical programming paradigms are, in principle, all well-suitedfor explicit or implicit parallelization:
– imperative: FORTRAN, C (dominant male, recently withsome OO-touch like in C++)
– logical/relational: PROLOG
– object-oriented: SMALLTALK
– functional/applicative: LISP
• implicit parallelization typically via special compilers
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 10 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
10. Parallelization
• classical programming paradigms are, in principle, all well-suitedfor explicit or implicit parallelization:
– imperative: FORTRAN, C (dominant male, recently withsome OO-touch like in C++)
– logical/relational: PROLOG
– object-oriented: SMALLTALK
– functional/applicative: LISP
• implicit parallelization typically via special compilers
• explicit parallelization typically via linked communication libraries
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 10 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
10. Parallelization
• classical programming paradigms are, in principle, all well-suitedfor explicit or implicit parallelization:
– imperative: FORTRAN, C (dominant male, recently withsome OO-touch like in C++)
– logical/relational: PROLOG
– object-oriented: SMALLTALK
– functional/applicative: LISP
• implicit parallelization typically via special compilers
• explicit parallelization typically via linked communication libraries
• traditional way in Scientific Computing: FORTRAN code,vectorizing compiler, CRAY, wait for results
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 10 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
10. Parallelization
• classical programming paradigms are, in principle, all well-suitedfor explicit or implicit parallelization:
– imperative: FORTRAN, C (dominant male, recently withsome OO-touch like in C++)
– logical/relational: PROLOG
– object-oriented: SMALLTALK
– functional/applicative: LISP
• implicit parallelization typically via special compilers
• explicit parallelization typically via linked communication libraries
• traditional way in Scientific Computing: FORTRAN code,vectorizing compiler, CRAY, wait for results
• explicit parallelization often difficult (cf. Gauß-Seidel), this makesnon-conventional approaches attractive
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 11 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
11. The Programming Model MPI
• How to write parallel programs?
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 11 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
11. The Programming Model MPI
• How to write parallel programs?
– UMA systems: simple answer – just as sequential ones
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 11 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
11. The Programming Model MPI
• How to write parallel programs?
– UMA systems: simple answer – just as sequential ones
– distributed memory systems: MPI model or standard
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 11 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
11. The Programming Model MPI
• How to write parallel programs?
– UMA systems: simple answer – just as sequential ones
– distributed memory systems: MPI model or standard
* Message Passing Interface
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 11 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
11. The Programming Model MPI
• How to write parallel programs?
– UMA systems: simple answer – just as sequential ones
– distributed memory systems: MPI model or standard
* Message Passing Interface
* originally for clusters, today used even on massivelyparallel computers, too
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 11 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
11. The Programming Model MPI
• How to write parallel programs?
– UMA systems: simple answer – just as sequential ones
– distributed memory systems: MPI model or standard
* Message Passing Interface
* originally for clusters, today used even on massivelyparallel computers, too
* MPI-1 developed 1992-1994
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 11 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
11. The Programming Model MPI
• How to write parallel programs?
– UMA systems: simple answer – just as sequential ones
– distributed memory systems: MPI model or standard
* Message Passing Interface
* originally for clusters, today used even on massivelyparallel computers, too
* MPI-1 developed 1992-1994
* explicit exchange of messages: higher amount of pro-gramming work, but increasing possibilities of tuning andoptimizing
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 11 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
11. The Programming Model MPI
• How to write parallel programs?
– UMA systems: simple answer – just as sequential ones
– distributed memory systems: MPI model or standard
* Message Passing Interface
* originally for clusters, today used even on massivelyparallel computers, too
* MPI-1 developed 1992-1994
* explicit exchange of messages: higher amount of pro-gramming work, but increasing possibilities of tuning andoptimizing
• MPI Features:
– parallel program: n processes, separate address spaces,no remote access
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 11 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
11. The Programming Model MPI
• How to write parallel programs?
– UMA systems: simple answer – just as sequential ones
– distributed memory systems: MPI model or standard
* Message Passing Interface
* originally for clusters, today used even on massivelyparallel computers, too
* MPI-1 developed 1992-1994
* explicit exchange of messages: higher amount of pro-gramming work, but increasing possibilities of tuning andoptimizing
• MPI Features:
– parallel program: n processes, separate address spaces,no remote access
– message exchange via system calls sendand receive
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 11 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
11. The Programming Model MPI
• How to write parallel programs?
– UMA systems: simple answer – just as sequential ones
– distributed memory systems: MPI model or standard
* Message Passing Interface
* originally for clusters, today used even on massivelyparallel computers, too
* MPI-1 developed 1992-1994
* explicit exchange of messages: higher amount of pro-gramming work, but increasing possibilities of tuning andoptimizing
• MPI Features:
– parallel program: n processes, separate address spaces,no remote access
– message exchange via system calls sendand receive
– MPI-kernel: library of communication routines, allowing tointegrate MPI commands into standard languages
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 12 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
12. MPI Messages
• messages consist of a
– header (recipient, buffer, type, context of communication)and of their
– body(contents)
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 12 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
12. MPI Messages
• messages consist of a
– header (recipient, buffer, type, context of communication)and of their
– body(contents)
• messages are buffered (send buffer, receive buffer)
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 12 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
12. MPI Messages
• messages consist of a
– header (recipient, buffer, type, context of communication)and of their
– body(contents)
• messages are buffered (send buffer, receive buffer)
• sending a message can be
– blocking(finished only after message has left node) or
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 12 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
12. MPI Messages
• messages consist of a
– header (recipient, buffer, type, context of communication)and of their
– body(contents)
• messages are buffered (send buffer, receive buffer)
• sending a message can be
– blocking(finished only after message has left node) or
– non-blocking(finished immediately, message may be sentlater)
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 12 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
12. MPI Messages
• messages consist of a
– header (recipient, buffer, type, context of communication)and of their
– body(contents)
• messages are buffered (send buffer, receive buffer)
• sending a message can be
– blocking(finished only after message has left node) or
– non-blocking(finished immediately, message may be sentlater)
• the same holds for receiving a message:
– blocking: waiting;
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 12 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
12. MPI Messages
• messages consist of a
– header (recipient, buffer, type, context of communication)and of their
– body(contents)
• messages are buffered (send buffer, receive buffer)
• sending a message can be
– blocking(finished only after message has left node) or
– non-blocking(finished immediately, message may be sentlater)
• the same holds for receiving a message:
– blocking: waiting;
– non-blocking: looking for it from time to time
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 12 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
12. MPI Messages
• messages consist of a
– header (recipient, buffer, type, context of communication)and of their
– body(contents)
• messages are buffered (send buffer, receive buffer)
• sending a message can be
– blocking(finished only after message has left node) or
– non-blocking(finished immediately, message may be sentlater)
• the same holds for receiving a message:
– blocking: waiting;
– non-blocking: looking for it from time to time
cost of passing a message (length N, buffer cap. K):
t(N) = α · NK
+ β ·Ninitializing cost/time α, transportation cost β
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 13 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
13. Programming with MPI
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 13 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
13. Programming with MPI
• a simple example:
P1: compute something P2: compute somethingstore result in SBUF store result in SBUFSendBlocking(P2,SBUF) SendBlocking(P1,SBUF)RecBlocking(P2,RBUF) RecBlocking(P1,RBUF)read data in RBUF read data in RBUFcompute again compute again
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 13 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
13. Programming with MPI
• a simple example:
P1: compute something P2: compute somethingstore result in SBUF store result in SBUFSendBlocking(P2,SBUF) SendBlocking(P1,SBUF)RecBlocking(P2,RBUF) RecBlocking(P1,RBUF)read data in RBUF read data in RBUFcompute again compute again
• without buffering: deadlocks possible
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 13 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
13. Programming with MPI
• a simple example:
P1: compute something P2: compute somethingstore result in SBUF store result in SBUFSendBlocking(P2,SBUF) SendBlocking(P1,SBUF)RecBlocking(P2,RBUF) RecBlocking(P1,RBUF)read data in RBUF read data in RBUFcompute again compute again
• without buffering: deadlocks possible
– nothing specified: buffering possible, but not imperative
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 13 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
13. Programming with MPI
• a simple example:
P1: compute something P2: compute somethingstore result in SBUF store result in SBUFSendBlocking(P2,SBUF) SendBlocking(P1,SBUF)RecBlocking(P2,RBUF) RecBlocking(P1,RBUF)read data in RBUF read data in RBUFcompute again compute again
• without buffering: deadlocks possible
– nothing specified: buffering possible, but not imperative
– never: no buffering (efficient, but risky)
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 13 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
13. Programming with MPI
• a simple example:
P1: compute something P2: compute somethingstore result in SBUF store result in SBUFSendBlocking(P2,SBUF) SendBlocking(P1,SBUF)RecBlocking(P2,RBUF) RecBlocking(P1,RBUF)read data in RBUF read data in RBUFcompute again compute again
• without buffering: deadlocks possible
– nothing specified: buffering possible, but not imperative
– never: no buffering (efficient, but risky)
– always: secure, but sometimes costly
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 13 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
13. Programming with MPI
• a simple example:
P1: compute something P2: compute somethingstore result in SBUF store result in SBUFSendBlocking(P2,SBUF) SendBlocking(P1,SBUF)RecBlocking(P2,RBUF) RecBlocking(P1,RBUF)read data in RBUF read data in RBUFcompute again compute again
• without buffering: deadlocks possible
– nothing specified: buffering possible, but not imperative
– never: no buffering (efficient, but risky)
– always: secure, but sometimes costly
• collective communication features available:
– broadcast, gather, gather-to-all, scatter, all-to-all,. . .
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 14 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
14. Load Distribution
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 14 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
14. Load Distribution
• load: amount of work on processors
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 14 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
14. Load Distribution
• load: amount of work on processors
– optimum: minimize idle times; needs estimates and moni-toring
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 14 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
14. Load Distribution
• load: amount of work on processors
– optimum: minimize idle times; needs estimates and moni-toring
– strategy: load balancingor load distribution or scheduling
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 14 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
14. Load Distribution
• load: amount of work on processors
– optimum: minimize idle times; needs estimates and moni-toring
– strategy: load balancingor load distribution or scheduling
– important: avoid overhead
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 14 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
14. Load Distribution
• load: amount of work on processors
– optimum: minimize idle times; needs estimates and moni-toring
– strategy: load balancingor load distribution or scheduling
– important: avoid overhead
• one distinguishes
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 14 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
14. Load Distribution
• load: amount of work on processors
– optimum: minimize idle times; needs estimates and moni-toring
– strategy: load balancingor load distribution or scheduling
– important: avoid overhead
• one distinguishes
– scheduling:
* global: where do which processes run?
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 14 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
14. Load Distribution
• load: amount of work on processors
– optimum: minimize idle times; needs estimates and moni-toring
– strategy: load balancingor load distribution or scheduling
– important: avoid overhead
• one distinguishes
– scheduling:
* global: where do which processes run?
* local: when does which processor which process
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 14 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
14. Load Distribution
• load: amount of work on processors
– optimum: minimize idle times; needs estimates and moni-toring
– strategy: load balancingor load distribution or scheduling
– important: avoid overhead
• one distinguishes
– scheduling:
* global: where do which processes run?
* local: when does which processor which process
– load balancing:
* static: a priori
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 14 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
14. Load Distribution
• load: amount of work on processors
– optimum: minimize idle times; needs estimates and moni-toring
– strategy: load balancingor load distribution or scheduling
– important: avoid overhead
• one distinguishes
– scheduling:
* global: where do which processes run?
* local: when does which processor which process
– load balancing:
* static: a priori
* dynamic: during runtime
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 14 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
14. Load Distribution
• load: amount of work on processors
– optimum: minimize idle times; needs estimates and moni-toring
– strategy: load balancingor load distribution or scheduling
– important: avoid overhead
• one distinguishes
– scheduling:
* global: where do which processes run?
* local: when does which processor which process
– load balancing:
* static: a priori
* dynamic: during runtime
• in Scientific Computing applications load is often not predictable:
– adaptive refinement of a finite element mesh,
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 14 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
14. Load Distribution
• load: amount of work on processors
– optimum: minimize idle times; needs estimates and moni-toring
– strategy: load balancingor load distribution or scheduling
– important: avoid overhead
• one distinguishes
– scheduling:
* global: where do which processes run?
* local: when does which processor which process
– load balancing:
* static: a priori
* dynamic: during runtime
• in Scientific Computing applications load is often not predictable:
– adaptive refinement of a finite element mesh,
– convergence behaviour of iterations may differ
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 14 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
14. Load Distribution
• load: amount of work on processors
– optimum: minimize idle times; needs estimates and moni-toring
– strategy: load balancingor load distribution or scheduling
– important: avoid overhead
• one distinguishes
– scheduling:
* global: where do which processes run?
* local: when does which processor which process
– load balancing:
* static: a priori
* dynamic: during runtime
• in Scientific Computing applications load is often not predictable:
– adaptive refinement of a finite element mesh,
– convergence behaviour of iterations may differ
– thus: static load balancing not sufficient
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 15 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
15. Designing Load Distribution
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 15 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
15. Designing Load Distribution
• Which are the primary objectives?
– optimization of system loador application runtime?
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 15 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
15. Designing Load Distribution
• Which are the primary objectives?
– optimization of system loador application runtime?
– placementof new processes or migration of running pro-cesses?
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 15 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
15. Designing Load Distribution
• Which are the primary objectives?
– optimization of system loador application runtime?
– placementof new processes or migration of running pro-cesses?
• Which is the level of integration?
– Who initiates actions (measure load, chose strategy)?
* application program
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 15 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
15. Designing Load Distribution
• Which are the primary objectives?
– optimization of system loador application runtime?
– placementof new processes or migration of running pro-cesses?
• Which is the level of integration?
– Who initiates actions (measure load, chose strategy)?
* application program
* runtime system
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 15 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
15. Designing Load Distribution
• Which are the primary objectives?
– optimization of system loador application runtime?
– placementof new processes or migration of running pro-cesses?
• Which is the level of integration?
– Who initiates actions (measure load, chose strategy)?
* application program
* runtime system
* OS?
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 15 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
15. Designing Load Distribution
• Which are the primary objectives?
– optimization of system loador application runtime?
– placementof new processes or migration of running pro-cesses?
• Which is the level of integration?
– Who initiates actions (measure load, chose strategy)?
* application program
* runtime system
* OS?
• Any special features of the application to be considered?
– restrictions in allocation process-to-processor frequent inS.C.
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 15 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
15. Designing Load Distribution
• Which are the primary objectives?
– optimization of system loador application runtime?
– placementof new processes or migration of running pro-cesses?
• Which is the level of integration?
– Who initiates actions (measure load, chose strategy)?
* application program
* runtime system
* OS?
• Any special features of the application to be considered?
– restrictions in allocation process-to-processor frequent inS.C.
• Which units shall be distributed or displaced?
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 15 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
15. Designing Load Distribution
• Which are the primary objectives?
– optimization of system loador application runtime?
– placementof new processes or migration of running pro-cesses?
• Which is the level of integration?
– Who initiates actions (measure load, chose strategy)?
* application program
* runtime system
* OS?
• Any special features of the application to be considered?
– restrictions in allocation process-to-processor frequent inS.C.
• Which units shall be distributed or displaced?
– whole processes (coarse grain)
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 15 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
15. Designing Load Distribution
• Which are the primary objectives?
– optimization of system loador application runtime?
– placementof new processes or migration of running pro-cesses?
• Which is the level of integration?
– Who initiates actions (measure load, chose strategy)?
* application program
* runtime system
* OS?
• Any special features of the application to be considered?
– restrictions in allocation process-to-processor frequent inS.C.
• Which units shall be distributed or displaced?
– whole processes (coarse grain)
– threads (fine grain)
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 15 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
15. Designing Load Distribution
• Which are the primary objectives?
– optimization of system loador application runtime?
– placementof new processes or migration of running pro-cesses?
• Which is the level of integration?
– Who initiates actions (measure load, chose strategy)?
* application program
* runtime system
* OS?
• Any special features of the application to be considered?
– restrictions in allocation process-to-processor frequent inS.C.
• Which units shall be distributed or displaced?
– whole processes (coarse grain)
– threads (fine grain)
– objects or data (typical for simulation applications)
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 16 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
16. Classification of Strategies
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 16 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
16. Classification of Strategies
• origin of the idea:
from physics (diffusion model), from combinatorics (graph the-ory), economics (bidding, brokerage)
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 16 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
16. Classification of Strategies
• origin of the idea:
from physics (diffusion model), from combinatorics (graph the-ory), economics (bidding, brokerage)
• for networks, for bus topologies
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 16 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
16. Classification of Strategies
• origin of the idea:
from physics (diffusion model), from combinatorics (graph the-ory), economics (bidding, brokerage)
• for networks, for bus topologies
• data represented as grids, trees, sets, or . . .
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 16 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
16. Classification of Strategies
• origin of the idea:
from physics (diffusion model), from combinatorics (graph the-ory), economics (bidding, brokerage)
• for networks, for bus topologies
• data represented as grids, trees, sets, or . . .
• distribution mechanisms:
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 16 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
16. Classification of Strategies
• origin of the idea:
from physics (diffusion model), from combinatorics (graph the-ory), economics (bidding, brokerage)
• for networks, for bus topologies
• data represented as grids, trees, sets, or . . .
• distribution mechanisms:
– load handed over to neighbouring nodes only?
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 16 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
16. Classification of Strategies
• origin of the idea:
from physics (diffusion model), from combinatorics (graph the-ory), economics (bidding, brokerage)
• for networks, for bus topologies
• data represented as grids, trees, sets, or . . .
• distribution mechanisms:
– load handed over to neighbouring nodes only?
– just distribution of new units or migration of running ones(how?)?
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 16 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
16. Classification of Strategies
• origin of the idea:
from physics (diffusion model), from combinatorics (graph the-ory), economics (bidding, brokerage)
• for networks, for bus topologies
• data represented as grids, trees, sets, or . . .
• distribution mechanisms:
– load handed over to neighbouring nodes only?
– just distribution of new units or migration of running ones(how?)?
• flow of information:
to whom is load communicated, from where comes information?
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 16 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
16. Classification of Strategies
• origin of the idea:
from physics (diffusion model), from combinatorics (graph the-ory), economics (bidding, brokerage)
• for networks, for bus topologies
• data represented as grids, trees, sets, or . . .
• distribution mechanisms:
– load handed over to neighbouring nodes only?
– just distribution of new units or migration of running ones(how?)?
• flow of information:
to whom is load communicated, from where comes information?
• coordination:
who makes decisions? autonomous/cooperative/competitive?
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 16 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
16. Classification of Strategies
• origin of the idea:
from physics (diffusion model), from combinatorics (graph the-ory), economics (bidding, brokerage)
• for networks, for bus topologies
• data represented as grids, trees, sets, or . . .
• distribution mechanisms:
– load handed over to neighbouring nodes only?
– just distribution of new units or migration of running ones(how?)?
• flow of information:
to whom is load communicated, from where comes information?
• coordination:
who makes decisions? autonomous/cooperative/competitive?
• algorithms:
who initiates measures? adaptivity? costs relevant? evalua-tion?
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 17 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
17. Examples of LD-Strategies
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 17 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
17. Examples of LD-Strategies
• diffusion model:
permanent balancing process between neighbours
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 17 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
17. Examples of LD-Strategies
• diffusion model:
permanent balancing process between neighbours
• bidding model:
supply and demand, establishment of some market
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 17 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
17. Examples of LD-Strategies
• diffusion model:
permanent balancing process between neighbours
• bidding model:
supply and demand, establishment of some market
• broker model:
– esp. for heterogeneous hierarchical topologies, scalable
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 17 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
17. Examples of LD-Strategies
• diffusion model:
permanent balancing process between neighbours
• bidding model:
supply and demand, establishment of some market
• broker model:
– esp. for heterogeneous hierarchical topologies, scalable
– broker with partial knowledge, budget-based decision whetherlocal processing or looking for better offers
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 17 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
17. Examples of LD-Strategies
• diffusion model:
permanent balancing process between neighbours
• bidding model:
supply and demand, establishment of some market
• broker model:
– esp. for heterogeneous hierarchical topologies, scalable
– broker with partial knowledge, budget-based decision whetherlocal processing or looking for better offers
– prices for use of resources and brokerage
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 17 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
17. Examples of LD-Strategies
• diffusion model:
permanent balancing process between neighbours
• bidding model:
supply and demand, establishment of some market
• broker model:
– esp. for heterogeneous hierarchical topologies, scalable
– broker with partial knowledge, budget-based decision whetherlocal processing or looking for better offers
– prices for use of resources and brokerage
• matching model:
construct matching in topology graph, balance along edges
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 17 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
17. Examples of LD-Strategies
• diffusion model:
permanent balancing process between neighbours
• bidding model:
supply and demand, establishment of some market
• broker model:
– esp. for heterogeneous hierarchical topologies, scalable
– broker with partial knowledge, budget-based decision whetherlocal processing or looking for better offers
– prices for use of resources and brokerage
• matching model:
construct matching in topology graph, balance along edges
• balanced allocation, space-filling curves, . . .
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 18 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
18. Performance Evaluation
• performance evaluation of algortihms and computers
• average parallelism(for p processors):
A(p) =sum of processor runtimes
parallel runtime
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 18 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
18. Performance Evaluation
• performance evaluation of algortihms and computers
• average parallelism(for p processors):
A(p) =sum of processor runtimes
parallel runtime
• speedup S: S =sequential runtime
parallel runtime
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 18 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
18. Performance Evaluation
• performance evaluation of algortihms and computers
• average parallelism(for p processors):
A(p) =sum of processor runtimes
parallel runtime
• speedup S: S =sequential runtime
parallel runtime
• efficiency E: E = Sp
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 18 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
18. Performance Evaluation
• performance evaluation of algortihms and computers
• average parallelism(for p processors):
A(p) =sum of processor runtimes
parallel runtime
• speedup S: S =sequential runtime
parallel runtime
• efficiency E: E = Sp
• Amdahl’s Law :
assumption: each program has some part 0 < seq < 1 that canonly be treated in a sequential way
S ≤ 1
seq+ 1−seqp
≤ 1seq
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 18 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
18. Performance Evaluation
• performance evaluation of algortihms and computers
• average parallelism(for p processors):
A(p) =sum of processor runtimes
parallel runtime
• speedup S: S =sequential runtime
parallel runtime
• efficiency E: E = Sp
• Amdahl’s Law :
assumption: each program has some part 0 < seq < 1 that canonly be treated in a sequential way
S ≤ 1
seq+ 1−seqp
≤ 1seq
• another important quantity: CCR (Communication-to-ComputationRatio)
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 18 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
18. Performance Evaluation
• performance evaluation of algortihms and computers
• average parallelism(for p processors):
A(p) =sum of processor runtimes
parallel runtime
• speedup S: S =sequential runtime
parallel runtime
• efficiency E: E = Sp
• Amdahl’s Law :
assumption: each program has some part 0 < seq < 1 that canonly be treated in a sequential way
S ≤ 1
seq+ 1−seqp
≤ 1seq
• another important quantity: CCR (Communication-to-ComputationRatio)
– CCR often increases with increasing p and constant prob-lem size (example: iterative methods for Ax = b)
Implementation: . . .
RISC Technology
Pipelining
Superscalar Processors
Cache Memory
Memory Hierarchy
Parallel Computers – . . .
Flynn’s Classification . . .
Memory Access . . .
Parallelization
The Programming . . .
MPI Messages
Programming with MPI
Load Distribution
Designing Load . . .
Classification of . . .
Examples of LD- . . .
Performance Evaluation
Page 18 of 18
Introduction to Scientific Computing
9. ImplementationMiriam Mehl
18. Performance Evaluation
• performance evaluation of algortihms and computers
• average parallelism(for p processors):
A(p) =sum of processor runtimes
parallel runtime
• speedup S: S =sequential runtime
parallel runtime
• efficiency E: E = Sp
• Amdahl’s Law :assumption: each program has some part 0 < seq < 1 that canonly be treated in a sequential way
S ≤ 1
seq+ 1−seqp
≤ 1seq
• another important quantity: CCR (Communication-to-ComputationRatio)
– CCR often increases with increasing p and constant prob-lem size (example: iterative methods for Ax = b)
– therefore: do not compare speedups for different p, butsame problem size