guang r. gao acm fellow and ieee fellow endowed distinguished professor
DESCRIPTION
Topic 3 -- II: System Software Fundamentals: Multithreaded Execution Models, Virtual Machines and Memory Models. Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor Electrical & Computer Engineering University of Delaware [email protected]. Outline. - PowerPoint PPT PresentationTRANSCRIPT
CPEG421-2001-F-Topic-3-II 1
Topic 3 -- II: System Software Fundamentals:
Multithreaded Execution Models, Virtual Machines
and Memory Models
Guang R. Gao
ACM Fellow and IEEE FellowEndowed Distinguished ProfessorElectrical & Computer Engineering
University of Delaware
CPEG421-2001-F-Topic-3-II 2
Outline• An introduction to parallel program execution
models• Coarse-grain vs. fine-grain multithreading• Evolution of fine-grain multithreaded program
execution models.• Memory and synchronization. models• Fine-Grain Multithreaded execution and virtual
machine models for peta-scale computing: a case study on HTMT/EARTH
CPEG421-2001-F-Topic-3-II 3
Terminology Clarification
• Parallel Model of Computation– Parallel Models for Algorithm Designers– Parallel Models for System Designers
• Parallel Programming Models• Parallel Execution Models• Parallel Architecture Models
CPEG421-2001-F-Topic-3-II 4
System Characterization
Questions:
Q1: What characteristics of a computational system are required …
Q2: The diversity of existing and potential multi-core architectures…
Response:
R1: An important characteristic of such a compiler should include, at both chip level and system level, a program execution model that should at least include the specification and API
Gao, ECCD Workshop, Washington D.C., Nov. 2007
CPEG421-2001-F-Topic-3-II 5
What Does Program Execution Model (PXM) Mean ?
• The notion of PXM
The program execution model (PXM) is the basic
low-level abstraction of the underlying system
architecture upon which our programming model,
compilation strategy, runtime system, and other software components are developed.
• The PXM (and its API) serves as an interface between the architecture and the software.
CPEG421-2001-F-Topic-3-II 6
Program Execution Model (PXM) – Cont’d
Unlike an instruction set architecture (ISA) specification, which usually focuses on lower level details (such as instruction encoding and organization of registers for a specific processor), the PXM refers to machine organization at a higher level for a whole class of high-end machines as view by the users
Gao, et. al., 2000
CPEG421-2001-F-Topic-3-II 7
What is your “Favorite”
Program Execution Model?
A Generic MIMD Architecture
CPEG421-2001-F-Topic-3-II 8
Memory NICCommunication
Assist
$
P
$
P
IC
Node: Processor(s), Memory System plus Communication assist (Network Interface & Communication Controller)
Full Feature Interconnect Networks. Packet Switching Fabrics. Key: Scalable Network
Objective: Make efficient use of scarce communication resources – providing high bandwidth, low-latency communication between nodes with a minimum cost and energy
Programming Models for Multi-Processor Systems
• Message Passing Model– Multiple address
spaces
– Communication can only be achieved through “messages”
• Shared Memory Model– Memory address space
is accessible to all
– Communication is achieved through memory
CPEG421-2001-F-Topic-3-II 9
Local Memory
Processor
Local Memory
Processor
Messages
Processor Processor
Global Memory
Comparison
Message Passing
+ Less Contention
+ Highly Scalable
+ Simplified Synch – Message Passing Sync +
Comm.
– But does not mean highly programmable
- Load Balancing
- Deadlock prone
- Overhead of small messages
Shared Memory
+ global shared address space
+ Easy to program (?)
+ No (explicit) message passing (e.g. communication through memory put/get operations)
- Synchronization (memory consistency models, cache models)
- Scalability
CPEG421-2001-F-Topic-3-II 10
What is A Shared Memory Execution Model?
CPEG421-2001-F-Topic-3-II 11
Thread ModelA set of rules for creating, destroying and managing threads
Thread ModelA set of rules for creating, destroying and managing threads
Memory ModelDictate the ordering of memory operations
Memory ModelDictate the ordering of memory operations
Synchronization ModelProvide a set of mechanisms to protect from data races
Synchronization ModelProvide a set of mechanisms to protect from data races
Execution Model
The Thread Virtual MachineThe Thread Virtual Machine
CPEG421-2001-F-Topic-3-II 12
Essential Aspects in User-Level Shared Memory Support?
• Shared address space support and management
• Access control and management
- Memory consistency model (MCM)
- Cache management mechanism
CPEG421-2001-F-Topic-3-II 13
Grand Challenge Problems
• How to build a shared-memory multiprocessor that is
scalable both within a (multi-core/many-core chip) and a
system with many chips ?
• How to program and optimize application programs?
Our view: One major obstacle in solving these problems in
the memory coherence assumption in today’s hardware-
centric memory consistency model.
A Parallel Execution Model
CPEG421-2001-F-Topic-3-II 14
Application Programming Interface (API)
Execution / Architecture Model
Thread Model
Memory Model
Synchronization Model
A Parallel Execution Model
CPEG421-2001-F-Topic-3-II 15
Application Programming Interface (API)
With Dataflow Origins
Execution / Architecture Model
Fine Grained Multithreaded
Model
Memory Adaptive /
Aware Model
Fine Grained Synchronization
Model
Our Model
CPEG421-2001-F-Topic-3-II 16
Comment on OS impact?
• Should compiler be OS-Aware too ? If so, how ?
• Or other alternatives ? Compiler-controlled runtime, of compiler-aware kernels, etc.
• Example: software pipelining …
Gao, ECCD Workshop, Washington D.C., Nov. 2007
CPEG421-2001-F-Topic-3-II 17
Outline
• An introduction to multithreaded program execution models
• Coarse-grain vs. fine-grain parallel execution models – a historical overview
• Fine-grain multithreaded program execution models.
• Memory and synchronization. models• Fine-grain multithreaded execution and virtual
machine models for extreme-scale machines: a case study on HTMT/EARTH
Course Grain Execution Models
CPEG421-2001-F-Topic-3-II 18
The Single Instruction Multiple Data (SIMD) Model
The Single Program Multiple Data (SPMD) Model
The Data Parallel Model
Pipelined Vector Unit orPipelined Vector Unit or
Array of ProcessorsArray of Processors
Program
Processor
Program
Processor
Program
Processor
Program
Processor
Task Task Task Task
Data Structure
Data Parallel Model
CPEG421-2001-F-Topic-3-II 19
Difficult to write unstructured programsDifficult to write unstructured programsConvenient only for problems with regular structured parallelism.
Limited composability!Limited composability!Inherent limitation of coarse-grain multi-threading
Compute
Communication
Compute
Communication
?
Limitations
Dataflow Model of Computation
CPEG421-2001-F-Topic-3-II 20
++
++**
a b c d e
1
3
4
3
Dataflow Model of Computation
CPEG421-2001-F-Topic-3-II 21
++
++**
a b c d e
4
3
4
Dataflow Model of Computation
CPEG421-2001-F-Topic-3-II 22
++
++**
a b c d e
7
4
Dataflow Model of Computation
CPEG421-2001-F-Topic-3-II 23
++
++**
a b c d e
28
Dataflow Model of Computation
CPEG421-2001-F-Topic-3-II 24
++
++**
a b c d e
1
3
4
3
28
Dataflow Software Pipelining
CPEG421-2001-F-Topic-3-II 25
Outline
• An introduction to multithreaded program execution models
• Coarse-grain vs. fine-grain parallel execution models – A Historical Overview
• Fine-grain multithreaded program execution models.
• Memory and synchronization. models• Fine-grain multithreaded execution and virtual
machine models for peta-scale machines: a case study on HTMT/EARTH
CPEG421-2001-F-Topic-3-II 26
CPU
Memory
Fine-Grain non-preemptive thread-The “hotel” model
ThreadUnit
ExecutorLocus
Coarse-Grain vs. Fine-Grain Multithreading
A PoolThread
CPU
Memory
ExecutorLocus
A SingleThread
Coarse-Grain thread-The family home model
ThreadUnit
[Gao: invited talk at Fran Allen’s Retirement Workshop, 07/2002]
CPEG421-2001-F-Topic-3-II 27
Evolution of Multithreaded Execution and Architecture Models
Non-dataflowbased
CDC 66001964
MASAHalstead1986
HEPB. Smith1978
Cosmic CubeSeiltz1985
J-MachineDally1988-93
M-MachineDally1994-98
Dataflowmodel inspired
MIT TTDAArvind1980
ManchesterGurd & Watson1982
*T/Start-NGMIT/Motorola1991-
SIGMA-IShimada1988
MonsoonPapadopoulos& Culler 1988
P-RISCNikhil & Arvind1989
EM-5/4/X RWC-11992-97
Iannuci’s1988-92
Others: Multiscalar (1994), SMT (1995), etc.
Flynn’sProcessor1969
CHoPP’77 CHoPP’87
TAMCuller1990
TeraB. Smith1990-
AlwifeAgarwal1989-96
CilkLeiserson
LAUSyre1976
Eldorado
CASCADE
StaticDataflowDennis 1972MIT
Arg-FetchingDataflowDennisGao1987-88
MDFAGao1989-93
MTAHumTheobaldGao 94
EARTH CAREPACT95’, ISCA96, Theobald99
Marquez04
The Von Neumann-type Processing
CPEG421-2001-F-Topic-3-II 28
begin for i = 1 … … endforend
begin for i = 1 … … endforend
Source Code
CompilerSequential Machine
Representation
CPU
Load
Processor
A Multithreaded Architecture
CPEG421-2001-F-Topic-3-II 29
To Other PE’s
One PE
CPEG421-2001-F-Topic-3-II 30
McGill Data FlowArchitecture Model
(MDFA)
CPEG421-2001-F-Topic-3-II 31
n1
n2 n3
stor
e
store
fetchfetch
n1
n2 n3
store
fetch fetch
Argument –flow Principle Argument –fetching Principle
A Dataflow Program Tuple
CPEG421-2001-F-Topic-3-II 32
Program Tuple = { P-Code . S-Code }Program Tuple = { P-Code . S-Code }
P-CodeP-Code
N1: x = a + b;N2: y = c – d;N3: z = x * y;
S-CodeS-Code
22
33n1n1
a
b
22
33n2n2
c
d
22
33n1n1
IPUIPU ISUISU
The McGill Dataflow Architecture Model
CPEG421-2001-F-Topic-3-II 33
Pipelined Instruction Processing Unit (PIPU)
Dataflow Instruction Scheduling Unit (DISU)
Enable Memory & Controller
Signal Processing
Fire Done
The McGill Dataflow Architecture Model
CPEG421-2001-F-Topic-3-II 34
Pipelined Instruction Processing Unit (PIPU)
Dataflow Instruction Scheduling Unit (DISU)
Fire Done
Waiting Instructions
Enabled Instructions = PC
Important Features
Pipeline can be kept fully utilized provided that the program has sufficient parallelism
The Scheduling Memory (Enable)
CPEG421-2001-F-Topic-3-II 35
Dataflow Instruction Scheduling Unit (DISU)
CONTROLLER
1 1
1 1
01
0 0
0 0
0
1 1
1
1 0
0 0
0 1
Signal Processing
Fire Done
Count Signal(s)
0 Waiting Instructions1 Enabled Instructions
CPEG421-2001-F-Topic-3-II 36
Advantages of the McGill Dataflow Architecture Model
• Eliminate unnecessary token copying and transmission overhead
• Instruction scheduling is separated from the main datapath of the processor (e.g. asynchronous, decoupled)
Von Neumann Threads as Macro Dataflow Nodes
CPEG421-2001-F-Topic-3-II 37
1
2
3
k
A sequence of instructions is “packed” into a macro-dataflow node
Synchronization is done at the macro-node level
CPEG421-2001-F-Topic-3-II 38
Hybrid Evaluation Von Neumann Style Instruction Execution” on
the McGill Dataflow Architecture• Group a “sequence” of dataflow instruction into a “thread” or
a macro dataflow node.• Data-driven synchronization among threads.• “Von Neumann style sequencing” within a thread.
Advantage:Preserves the parallelism among threads but avoids unnecessary fine-grain synchronization between instructions within a sequential thread.
CPEG421-2001-F-Topic-3-II 39
What Do We Get?
• A hybrid architecture model without sacrificing the advantage of fine-grain parallelism!(latency-hiding, pipelining support)
A Realization of the Hybrid Evaluation
CPEG421-2001-F-Topic-3-II 40
Pipelined Instruction Processing Unit (PIPU)
Dataflow Instruction Scheduling Unit (DISU)
Fire Done
Shortcut
1 2 k
Von Neumann bitVon Neumann bit