a portable runtime interface for multi-level memory hierarchies mike houston, ji-young park, manman...
TRANSCRIPT
![Page 1: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/1.jpg)
A Portable Runtime Interface For Multi-Level Memory Hierarchies
Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian,
Alex Aiken, William Dally, Pat Hanrahan
Stanford University
![Page 2: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/2.jpg)
2Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
The Problem
Lots of different architectures– Shared memory – Distributed memory – Exposed communication– Disk systems
Each architecture has its own programming system Composed machines difficult to program and manage
– Different mechanisms for each architecture
Previous runtimes and languages limited– Designed for a single architecture– Struggle with memory hierarchies
![Page 3: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/3.jpg)
3Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Shared Memory Machines
AMD Barcelona SGI Altix
![Page 4: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/4.jpg)
4Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Distributed Memory Machines
Marenostrum2,282 Nodes
ASC Blue Gene/L65,536 Nodes
![Page 5: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/5.jpg)
5Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Exposed Communication Architectures
90nm | ~220 mm2 |~100 WSTI CELL processor
~200 GFLOPS
![Page 6: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/6.jpg)
6Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Complex machines
Cluster of SMPs?– MPI?
Cluster of Cell processors?– MPI + ALF/Cell SDK
Cluster of clusters of SMPs with Cell accelerators? What about disk systems?
– Disk I/O often second class
![Page 7: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/7.jpg)
7Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Previous Work
APIs– MPI/MPI-2– Pthreads– OpenMP
(Dagum et al. 1998)– GASNet
(Bonachea et al. 2002)– Charm
(Kale et al, 1993)– SVM
(Lebonte et al. 2004)– Cell SDK
(IBM 2006)– CellSs
(Bellens et al. 2006)– HTA
(Bikshandi et al. 2006)– …
Languages– Co-Array Fortran
(Numrich et al. 1994)– Titanium
(Yelick et al. 1998) – UPC
(Carlson et al. 1999)– ZPL
(Deitz et al. 2004)– Chapel
(Callahan et al. 2004)– X10
(Charles et al. 2005)– Brook
(Buck et al. 2004)– Cilk
(Blumofe et al. 1995)– …
![Page 8: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/8.jpg)
8Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Contributions
Uniform scheme for explicitly describing memory hierarchies– Capture common traits important for performance
– Allow composition of memory hierarchies
Simple, portable API interface for many parallel machines– Mechanism independence for communication and management of
parallel resources
– Few entry points
– Efficient execution
Efficient compiler target– Compiler can concentrate on global optimizations
– Runtime manages mechanisms
![Page 9: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/9.jpg)
9Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
The Sequoia System
Programming language for memory hierarchies– Fatahalian et al. (Supercomputing 2006)– Adapts Parallel Memory Hierarchies programming abstraction
Compiler for exposed communication architectures– Knight et al. (PPoPP 2007)– Takes in Sequoia program, machine file, and mapping file– Generates task calls and data transfers– Performs large granularity optimizations
Portable runtime system– Houston et al. (PPoPP 2008)– Target for Sequoia compiler– Serves as abstract machine
http://sequoia.stanford.edu
![Page 10: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/10.jpg)
Abstract Machine Model
![Page 11: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/11.jpg)
11Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Parallel Memory Hierarchy Model (PMH)
Abstract machines a trees of memories (each memory is an address space)(Alpern et al. 1995)
Memory
CPU
Memory
CPU
Memory
CPU
Memory
CPU
Memory
![Page 12: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/12.jpg)
12Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Parallel Memory Hierarchy Model
CPU CPU CPU CPU CPU CPU CPU CPU
Memory
Memory
Memory Memory Memory Memory
Memory
Memory Memory Memory Memory
![Page 13: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/13.jpg)
13Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Example Mappings
SPE
LS
MainMemory
Cell Processor
SPE
LS
SPE
LS
SPE
LS
SPE
LS
SPE
LS
SPE
LS
SPE
LS
CPU CPU
NodeMemory
NodeMemory
...
Aggregate Node Memory(Virtual Level)
Cluster
CPU
Disk
Disk
NodeMemory
Shared Memory Multi-processor
CPU
...
MainMemory
L2
CPU
L2
![Page 14: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/14.jpg)
14Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Multi-Level Configurations
SPE SPE SPE SPE SPE SPE
...
CPU
...
Disk + Playstation 3
Cluster of Playstation 3s
Cluster of SMPs
...
CPU CPU
...
CPUSPE SPE SPE SPE SPE SPE
SPE SPE SPE SPE SPE SPE
Aggregate Cluster Memory
L2 L2 L2 L2
NodeMemory
NodeMemory
LS LS LS LS LS LS
MainMemory
Disk
LS LS LS LS LS LS
MainMemory
MainMemory
LS LS LS LS LS LS
Aggregate Cluster Memory
![Page 15: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/15.jpg)
15Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Abstraction Rules
Tree of nodes 1 control thread per node 1 memory per node
Threads can:– Transfer in bulk from/to parent memory
asynchronously
– Wait for transfers from/to parent to complete
– Allocate data
– Only access their memory directly
– Transfer control to child node(s)
– Non-leaf threads only operate to move data and control
– Synchronize with siblings
Memory
CPU
Memory
CPU
Memory
CPU
Simliar to Space Limited Procedures (Alpern et al. 1995)
![Page 16: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/16.jpg)
Portable Runtime Interface
![Page 17: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/17.jpg)
17Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Requirements
Resource allocation– Data allocation and naming
– Setup parallel resources Explicit bulk asynchronous communication
– Transfer lists
– Transfer commands Parallel execution
– Launch tasks on children
– Asynchronous Synchronization
– Make sure tasks/transfers complete before continuing Runtime isolation
– No direct knowledge of other runtimes
![Page 18: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/18.jpg)
18Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Memory Level i+1
CPU Level i+1
Memory Level iChild N
Memory Level i…
Graphical Runtime Representation
Memory Level iChild 1
Runtime
CPU Level iChild 1
CPU Level i…
CPU Level iChild N
![Page 19: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/19.jpg)
19Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Top Interface
// create and free runtimeRuntime(TaskTable table, int numChildren);virtual ~Runtime();
// allocate and deallocate arraysvirtual Array_t* AllocArray (Size_t elmtSize, int dimensions, Size_t* dim_sizes, ArrayDesc_t
descriptor, int alignment) = 0; virtual void FreeArray(Array_t* array) = 0;
// array naming virtual void AddArray(Array_t array);virtual Array_t GetArray(ArrayDesc_t descriptor);virtual void RemoveArray(ArrayDesc_t descriptor);
// launch and synchronize on tasks virtual TaskHandle_t CallChildTask(TaskID_t taskid, ChildID_t start, ChildID_t end) =
0;virtual void WaitTask(TaskHandle_t handle) = 0;
![Page 20: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/20.jpg)
20Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Bottom Interface
// array namingvirtual Array_t* GetArray (ArrayDesc_t descriptor);
// create, free, invoke, and synchronize on transfer lists virtual XferList* CreateXferList (Array_t* dst, Array_t*
src, Size_t* dst_idx, Size_t*
src_idx, Size_t* lengths, int
count) = 0;virtual void FreeXferList (XferList* list) = 0;virtual XferHandle_t Xfer (XferList* list) = 0;virtual void WaitXfer (XferHandle_t handle) = 0;
// get number of children in bottom level, get local processor id, // and barrierint GetSiblingCount ();int GetID (); virtual void Barrier (ChildID_t start, ChildID_t stop) =
0;
![Page 21: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/21.jpg)
21Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Compiler/Runtime Interaction
Compiler initializes runtime for each pair of memories in the hierarchy
Initialize runtime for root memory– Machine description specifies runtime to initialize (SMP, Cluster,
Disk, Cell, etc.)
If more levels in hierarchy– Initialize runtimes for child levels
Runtime cleanup is inverse– Call exit on children, wait, cleanup local resources, return to
parent
Control of hierarchy via task calls
![Page 22: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/22.jpg)
Runtime Implementations
![Page 23: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/23.jpg)
23Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
SMP Runtime
CPU CPU...
MainMemory
SMP Runtime
CPU
Disk
NodeMemory
Disk Runtime
SPE
LS
MainMemory
SPE
LS
SPE
LS
SPE
LS
SPE
LS
SPE
LS
PowerPC
Cell Runtime
…CPU CPU
NodeMemory
NodeMemory
...
Aggregate Cluster Memory
Cluster Runtime
![Page 24: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/24.jpg)
24Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
SMP Implementation
Pthreads based– Launch a thread per child– CallChildTask enqueues work on task queue per child
Data transfers– Memory copy from source to destination– Optimizations
• Pass reference to parent array• Feedback to compiler to remove transfers
– Machine file information
No processor at parent level– Processor 0 represents the parent node and a child node
![Page 25: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/25.jpg)
25Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Disk Implementation
No processor at top level– Host CPU represents the parent node and child node
Implementation – Allocation
• Open file on disk– Data transfers
• Use Async I/O API to read/write data to disk
![Page 26: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/26.jpg)
26Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Cell Implementation
Overlay handling– At runtime creation, load overlay loader into SPEs
– On task call
• PowerPC notifies SPE of function to load
• SPE loads overlay and executes Data alignment
– All data allocated to 128 byte boundaries
– Multi-dimensional arrays padded to maintain alignment for dimensions Heavy use of mailboxes
– SPE synchronization
– PPE<->SPE communication Data transfers
– DMA lists
– DMA commands
![Page 27: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/27.jpg)
27Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Cluster Implementation
No processor at top level– Node 0 represents the parent node and a child node
Virtual level– Distributed Shared Memory (DSM)
• Many software and hardware solutions• IVY (Li et al. 1988), Mether (Minnich et al. 1991), …• Alewife (Agarwal et al. 1991), FLASH (Kuskin et al. 1994), …• Complex coherence and consistency mechanisms
– Overkill for our needs• No sharing between sibling processors = no coherence
requirements• Sequoia enforces consistency by disallowing aliasing on
writes to parent memory
![Page 28: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/28.jpg)
28Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Cluster Implementation, cont.
Virtual level implementation– Interval trees
• Represent arrays as covering an range of 0->N elements• Each node can have any sub-portion (interval) of range• Tree structure allows fast lookup of all nodes that cover the
interval of interest• Allows complex data distributions• See CLRS Section 14.3 for more detailed information
– Array allocation• Define distribution as interval tree and broadcast• Allocate data for intervals and register data pointers with
MPI-2 (MPI_win_create)• Align data to page alignment for network fast transfers
![Page 29: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/29.jpg)
29Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Cluster Implementation, cont.
Virtual level implementation– Data transfer
• Compare requested data range against interval tree• Read from parent: issue MPI_Get to any nodes with
matching intervals (MPI_LOCK_SHARED)
• Write to parent: issue MPI_Put to all nodes with matching intervals (MPI_LOCK_EXCLUSIVE)
– Optimizations• If requested range is local, return reference to parent
memory
![Page 30: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/30.jpg)
30Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Runtime Composition
Disk
MainMemory
SPE
LS
SPE
LS
SPE
LS
SPE
LS
SPE
LS
SPE
LS
PowerPC
MainMemory
SPE
LS
SPE
LS
SPE
LS
SPE
LS
SPE
LS
SPE
LS
PowerPC
MainMemory
SPE
LS
SPE
LS
SPE
LS
SPE
LS
SPE
LS
SPE
LS
PowerPC
Aggregate Cluster Memory
NodeMemory
NodeMemory
...
Aggregate Cluster Memory
CPU CPU... CPU CPU...
Disk + Playstation 3
Cluster of Playstation 3s
Cluster of SMPs
...
Cluster Runtime
Cluster Runtime
Cell Runtime
Cell Runtime
Disk Runtime
SMP Runtime SMP Runtime
Cell Runtime
![Page 31: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/31.jpg)
Evaluation
![Page 32: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/32.jpg)
32Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Evaluation Metrics
Can we run unmodified Sequoia code on all these systems?
How much overhead do we have in our abstraction? How well can we utilize machine resources?
– Computation– Bandwidth
Is our application performance competitive with best available implementations?
Can we effectively compose runtimes for more complex systems?
![Page 33: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/33.jpg)
33Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Sequoia Benchmarks
Linear Algebra
Blas Level 1 SAXPY, Level 2 SGEMV, and Level 3 SGEMM benchmarks
2D single precision convolution with 9x9 support (non-periodic boundary constraints)
Complex single precision FFT
100 time steps of N-body stellar dynamics simulation (N2) single precision
Fuzzy protein string matching using HMM evaluation (Horn et al. SC2005 paper)
Conv2D
FFT3DGravity
HMMER
Best available implementations used as leaf task
![Page 34: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/34.jpg)
34Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Single Runtime System Configurations
Scalar– 2.4 GHz Intel Pentium4 Xeon, 1GB
8-way SMP– 4 dual-core 2.66GHz Intel P4 Xeons, 8GB
Disk– 2.4 GHz Intel P4, 160GB disk, ~50MB/s from disk
Cluster– 16, Intel 2.4GHz P4 Xeons, 1GB/node, Infiniband interconnect
(780MB/s) Cell
– 3.2 GHz IBM Cell blade (1 Cell – 8 SPE), 1GB PS3
– 3.2 GHz Cell in Sony Playstation 3 (6 SPE), 256MB (160MB usable)
![Page 35: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/35.jpg)
35Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
System Utilization
SAXPY SGEMV SGEMM CONV2D FFT3D GRAVITY HMMER
SMP | Disk | Cluster | Cell | PS3
Per
cent
age
of R
untim
e
100
0
![Page 36: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/36.jpg)
36Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Resource Utilization – IBM Cell
Bandwidth utilizationCompute utilization
Res
ourc
e U
tiliz
atio
n (%
)
100
0
![Page 37: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/37.jpg)
37Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Single Runtime Configurations – GFlop/s
Scalar SMP Disk Cluster Cell PS3
SAXPY 0.3 0.7 0.007 4.9 3.5 3.1
SGEMV 1.1 1.7 0.04 12 12 10
SGEMM 6.9 45 5.5 91 119 94
CONV2D 1.9 7.8 0.6 24 85 62
FFT3D 0.7 3.9 0.05 5.5 54 31
GRAVITY 4.8 40 3.7 68 97 71
HMMER 0.9 11 0.9 12 12 7.1
![Page 38: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/38.jpg)
38Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
SGEMM Performance
Cluster– Intel Cluster MKL: 101 GFlop/s– Sequoia: 91 GFlop/s
SMP– Intel MKL: 44 GFlop/s– Sequoia: 45 GFlop/s
![Page 39: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/39.jpg)
39Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
FFT3D Performance
Cell– Mercury: 58 GFlop/s– FFTW 3.2 alpha 2: 35 GFlop/s– Sequoia: 54 GFlop/s
Cluster– FFTW 3.2 alpha 2: 5.3 GFlop/s– Sequoia: 5.5 GFlop/s
SMP– FFTW 3.2 alpha 2: 4.2 GFlop/s– Sequoia: 3.9 GFlop/s
![Page 40: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/40.jpg)
40Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Best Known Implementations
HMMer– ATI X1900XT: 9.4 GFlop/s
(Horn et al. 2005)
– Sequoia Cell: 12 GFlop/s– Sequoia SMP: 11 GFlop/s
Gravity– Grape-6A: 2 billion interactions/s
(Fukushige et al. 2005)
– Sequoia Cell: 4 billion interactions/s– Sequoia PS3: 3 billion interactions/s
![Page 41: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/41.jpg)
41Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Multi-Runtime System Configurations
Cluster of SMPs– Four 2-way, 3.16GHz Intel Pentium 4 Xeons
connected via GigE (80MB/s peak) Disk + PS3
– Sony Playstation 3 bringing data from disk (~30MB/s) Cluster of PS3s
– Two Sony Playstation 3’s connected via GigE (60MB/s peak)
![Page 42: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/42.jpg)
42Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Multi-Runtime Utilization
SAXPY SGEMV SGEMM CONV2D FFT3D GRAVITY HMMER
Cluster of SMPs | Disk + PS3 | Cluster of PS3s
Pe
rce
nta
ge
of
Ru
ntim
e
100
0
![Page 43: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/43.jpg)
43Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Cluster of PS3 Issues
SAXPY SGEMV SGEMM CONV2D FFT3D GRAVITY HMMER
Cluster of SMPs | Disk + PS3 | Cluster of PS3s
Pe
rce
nta
ge
of
Ru
ntim
e
100
0
![Page 44: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/44.jpg)
44Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Cluster of PS3 Issues
SAXPY SGEMV
Cluster of PS3s | PS3
Pe
rce
nta
ge
of
Ru
ntim
e
100
0
![Page 45: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/45.jpg)
45Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Multi-Runtime Configurations - GFlop/s
Cluster-SMP Disk+PS3 PS3 Cluster
SAXPY 1.9 0.004 5.3
SGEMV 4.4 0.014 15
SGEMM 48 3.7 30
CONV2D 4.8 0.48 19
FFT3D 1.1 0.05 0.36
GRAVITY 50 66 119
HMMER 14 8.3 13
![Page 46: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/46.jpg)
Conclusion
![Page 47: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/47.jpg)
47Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Summary
Uniform runtime interface for multi-level memory hierarchies– Horizontal portability
• SMP, cluster, disk, Cell, PS3– Complex machines supported through composition
• Cluster of SMPs, disk + PS3, cluster of PS3s– Provides mechanism independence for communication and thread
management Efficient abstraction for multiple machines
– Maximize machine resource utilization– Low overhead– Competitive performance
Simple interface– <20 entry points
Code portability– Unmodified code running on 9 system configurations
Demonstrates viability of the Parallel Memory Hierarchies model
![Page 48: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/48.jpg)
48Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Future Work
Higher level functions in the runtime?
Load balancing?
Running on more complex machines?
Combine with transactional memory?
What machines can be mapped as a tree?
![Page 49: A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,](https://reader035.vdocuments.mx/reader035/viewer/2022062618/551478f5550346ea6e8b45aa/html5/thumbnails/49.jpg)
49Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Questions?
Acknowledgements:Intel Graduate Fellowship ProgramDOE ASCIBMLANL