fuat keceli, alexandrostzannes, george c. caragea, rajeev ... · fuat keceli, alexandrostzannes,...

71
Fuat Keceli, Alexandros Tzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park, MD Toolchain For Programming, Simulating and Studying the XMT Many-Core Architecture

Upload: hahanh

Post on 15-Feb-2019

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Fuat Keceli, AlexandrosTzannes,George C. Caragea, Rajeev Barua and Uzi Vishkin

University of Maryland, College Park, MD

Toolchain For Programming, Simulating and

Studying the XMT Many-Core Architecture

Page 2: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

5/20/2011HIPS20112

XMT: eXplicit Multi-Threading (Not Cray XMT)

Available at:http://www.umiacs.umd.edu/users/vishkin/XMT/index.shtml#sw-release

Or search XMTC on SourceForge

Page 3: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Overview

5/20/2011HIPS2011

XMT framework and architecture

XMTSim

The cycle-accurate simulator of the architecture

XMT Compiler

Optimizing compiler for explicit parallelism

Importance of the toolchain

Summary

3

Page 4: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Explicit Multi-Threading (XMT)

General Purpose Many-Core Computer

Framework and Architecture

5/20/2011HIPS20114

Page 5: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Searching for parallel platforms

which provide …

5/20/2011HIPS2011

Ease-of-programming

Competitive performance for both high and low degrees of

parallelism

Manageable power

5

Page 6: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Explicit Multi-Threading “Explicitly” (definition)

Parallelism is explicitly exposed by the programmer

HP 4th ed. “If you want your program to run significantly faster,

you're going to have to parallelize it”

5/20/2011HIPS20116

Page 7: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

XMTC Programming

5/20/2011HIPS20117

Execution Modes:

Serial and parallel

Page 8: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

XMTC Programming

5/20/2011HIPS20118

start

serial code

spawn(# threads) {

parallel SPMD code*

}

serial code

halt

* Not MPI type SPMD.

Page 9: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

XMT Architecture

5/20/2011HIPS20119

Page 10: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

XMT Architecture

5/20/2011HIPS201110

Parallel sections

Serial Sections

Page 11: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

XMT Architecture

5/20/2011HIPS201111

Page 12: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Explicit Multi-Threading “Explicitly” (definition)

Parallelism is explicitly exposed by the programmer

HP 4th ed. “If you want your program to run significantly faster,

you're going to have to parallelize it”

Why focus on Explicitness

Automatic parallelization: limited scale & generality

Irregular parallelism

Serial coding sometimes hides parallelism (BFS).

5/20/2011HIPS201112

Page 13: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

A Holistic ApproachVertical Stack of Programming

Programmer’s workflow

Execution Abstraction [CACM11]

Algorithmic theory – PRAM

Program (software) – XMTC

Compiler/Run-Time (middle-ware)

Hardware/Simulator

5/20/2011HIPS201113

Page 14: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Why is the toolchain important?

5/20/2011HIPS2011

“A perfection of means, and confusion of aims, seems to be our

main problem.”

Albert Einstein

14

Page 15: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

XMTSim

The Cycle-Accurate Simulator of the XMT Architecture

• Overview and modeled components

• Software structure

• Discrete event vs. discrete time simulation

• Other features

5/20/2011HIPS201115

Page 16: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Highlights

5/20/2011HIPS201116Image from wikipedia.

Page 17: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Highlights Highly configurable: Number of units, clock frequencies, …

5/20/2011HIPS201117

Page 18: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Highlights

5/20/2011HIPS2011

Highly configurable: Number of units, clock frequencies, …

Relatively easy to replace large components (i.e. ICN)

18

Page 19: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Highlights Highly configurable: Number of units, clock frequencies, …

Relatively easy to replace large components (i.e. ICN)

Cycle-accuracy validated against the FPGA prototype

5/20/2011HIPS201119

Page 20: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Highlights Highly configurable: Number of units, clock frequencies, …

Relatively easy to replace large components (i.e. ICN)

Cycle-accuracy validated against the FPGA prototype

Custom mechanisms for statistics collection, debugging,

runtime modifications

5/20/2011HIPS201120

Page 21: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Implementation details Written in Java (object oriented)

~28K lines of code + ~14K lines of inline documentation

Simulation speed (Intel Xeon server @ 3GHz)

10K – 2M instructions/second

Order of 1K cycles/second (1024-core conf.)

5/20/2011HIPS201121

Page 22: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Software Structure

5/20/2011HIPS201122

Page 23: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Software Structure

5/20/2011HIPS201123

Page 24: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Software Structure

5/20/2011HIPS201124

Page 25: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Software Structure

5/20/2011HIPS201125

Page 26: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Software Structure

5/20/2011HIPS201126

Page 27: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Software Structure

5/20/2011HIPS201127

Page 28: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Software Structure

5/20/2011HIPS201128

Page 29: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Simulation Strategy

Discrete Time vs. Discrete Event Simulation

5/20/2011HIPS201129

Page 30: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Discrete Time SimulationMain loop:

while(still running) {

Set clock to next cycle

Execute simulation code for one cycle

}

5/20/2011HIPS201130

Page 31: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Discrete Event Simulation

Event List

Actor Actor

ScheduleWake-up

Schedule

Wake-up

Iterate

5/20/2011HIPS201131

while(still running) {

Set clock to next event time

Remove event from queue

Notify the scheduling actor

}

Page 32: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

DE vs. DT Simulation

Discrete Time Simulation Discrete Event Simulation

•More compact code for smaller

simulations

•Efficient if a lot of work done for

every simulated cycle

•Naturally suitable for an object-

oriented structure

•More flexible in quantization of

simulated time

•Can simulate asynchronous logic

5/20/2011HIPS201132

Page 33: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

DE vs. DT Simulation

Discrete Time Simulation Discrete Event Simulation

•Efficient if a lot of work done for

every simulated cycle

•More compact code for smaller

simulations

•Naturally suitable for an object-

oriented structure

•More flexible in quantization of

simulated time

•Can simulate asynchronous logic

•Requires complex case analysis for

a large simulator.

•Slow if not all components do

work every clock cycle

•Event list operations are expensive

•Might require more work for

emulating one clock cycle

5/20/2011HIPS201133

Page 34: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

How do I simulate this?

5/20/2011HIPS201134

Page 35: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Best of both worlds Discrete event for a modular structure

Event List

Cluster

ScheduleWake-up

ScheduleWake-up

Iterate

5/20/2011HIPS201135

ICN …

Page 36: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Best of both worlds Heavy-weight action code for every actor (similar to the code

for one iteration of DT simulation)

Event List

Cluster

ScheduleWake-up

ScheduleWake-up

Iterate

5/20/2011HIPS201136

ICN …

Page 37: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

In the paper/tech. report

5/20/2011HIPS2011

Implementation details

Handling the communication between actors

Optimizing the event list

Other features

Visualizing data on floorplan

Default plug-ins

Execution check-points

37

Page 38: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

XMT Compiler

Optimizing compiler for explicit parallelism

• Serial GCC for parallel code

• Latency Hiding

• XMT Runtime

• Software structure of the compiler

5/20/2011HIPS201138

Page 39: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Designing a compiler from

scratch is hard…

5/20/2011HIPS201139

Page 40: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

How hire 1000s of developers

for free?

5/20/2011HIPS201140

Page 41: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Use GCC!

5/20/2011HIPS201141

Page 42: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Serial GCC for parallel code?• Illegal dataflow

• Reasons

• Concurrent semantics

• Control transfer between serial core and parallel cores

5/20/2011HIPS201142

Page 43: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Example: Illegal Code MotionXMTC Code

int A[N]=...;

bool found=false;

spawn (0, N-1) {

if(A[$]!=0)

found=true;

}

if (found) counter+=1;

5/20/2011HIPS201143

Page 44: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Example: Illegal Code Motion

int A[N]=...;

bool found=false;

asm (spawn 0, N-1)

if(A[$]!=0)

found=true;

asm (join)

if (found) counter+=1;

5/20/2011HIPS201144

Page 45: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Example: Illegal Code MotionCompiler doesn’t know the spawn construct!

int A[N]=...;

bool found=false;

asm (spawn 0, N-1)

if(A[$]!=0)

found=true;

asm (join)

if (found) counter+=1;

5/20/2011HIPS201145

Page 46: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Example: Illegal Code MotionCompiler doesn’t know the spawn construct!

int A[N]=...;

bool found=false;

asm (spawn 0, N-1)

if(A[$]!=0)

found=true;

if (found) counter+=1;

asm (join)

5/20/2011HIPS201146

Page 47: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

A Partial Solution: Outlining

5/20/2011HIPS201147

The compiler's view

int A[N]=...;

bool found=false;

asm (spawn 0, N-1)

if(A[$]!=0)

found=true;

asm (join)

if (found) counter+=1;

After Outlining

int A[N]=...;

bool found=false;

outlined(A,&found);

if(found) counter+=1;

outlined(int (*A), bool *found) {

asm (spawn 0, N-1)

if (A[$]!=0)

(*found) = true;

asm (join)

}

Page 48: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Other Issues

5/20/2011HIPS2011

Register Liveness Across MTCU TCU transfers of control

Broadcast live MTCU registers

Assembly code layout escaping spawn/join XMT statements

Analysis for relocating offending basic blocks

Details in the paper!

48

Page 49: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

On-chip Memory Architecture

5/20/2011HIPS201149

Page 50: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

On-chip Memory Architecture Multiple shared cache memory modules

No Coherent Private Cache

→ Reduced power and ICN bandwidth traffic (+)

→ Increased latency to d-Cache (-)

But Hide Latency (~30 cycles to shared caches) with …

Parallelism

Compiler optimizations

5/20/2011HIPS201150

Page 51: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Latency Hiding

5/20/2011HIPS2011

Non-Blocking Stores

Broadcasting of Code and Data

Software Prefetching

Planned for next release:

Read Only Caches (at Cluster Level)

Scratch Pads

51

Page 52: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Latency Hiding: Prefetching

5/20/2011HIPS2011

Loop Prefetching

Unique resource aware prefetch algorithm

Motivated by scarcity of TCU resources

4 Prefetch buffer locations per TCU

Existing algorithms do not take resources into account and

assume that after prefetching all reads are cache hits

52

Page 53: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Software Structure of Compiler

• CIL• Outlining

• …

• GCC• Broadcasting

• Cactus-Stack Allocation

• Prefetching

• …

• SableCC• Basic Block Layout

• Linking

• …

5/20/2011HIPS201153

source

code

.sim .b

CIL Pre-pass

Source-to-source

GCC Core Pass

Source to Assembly

SableCC Post-pass

Assembly-to-Binary

Page 54: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Why is the toolchain important?

5/20/2011HIPS2011

“A perfection of means, and confusion of aims, seems to be our

main problem.”

Albert Einstein

54

Page 55: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Current Implementation & Vision

5/20/2011HIPS201155

64-core FPGA prototype

available

1024-cores on-chip feasible

with today’s technology

[HotPar2010]

Developing the specifications

via the toolchain!

Figure from CACM11

Page 56: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

How to advance towards the vision?

5/20/2011HIPS2011

Architecture/compiler research

XMT vs. GPU comparison [HotPar2010]

Scheduler for nested parallelism [PPoPP2010]

Software prefetching [IJPP2010]

56

Page 57: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

5/20/2011HIPS2011

Architecture/compiler research

Comparison with GPUs [HotPar2010]

Scheduler for nested parallelism [PPoPP2010]

Software prefetching [IJPP2010]

Algorithms research

Max-Flow [SPAA11]

Bi-Connectivity [DIMACS2011]

57

How to advance towards the vision?

Page 58: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

5/20/2011HIPS2011

Architecture/compiler research

Comparison with GPUs [HotPar2010]

Scheduler for nested parallelism [PPoPP2010]

Software prefetching [IJPP2010]

Algorithms research

Max-Flow [SPAA11]

Bi-Connectivity [DIMACS2011]

Teaching

Undergrad. and grad. classes – UMD/UIUC [EduPar2011]

High school/middle school [SIGCSE2010]

58

How to advance towards the vision?

Page 59: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Under Release Testing

5/20/2011HIPS2011

Compiler

Parallel function calls

Scheduling for nested parallelism (lazy binary splitting)

Simulator

Power/temperature models included (HotSpot from UVA)

59

Page 60: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Under Development

5/20/2011HIPS2011

Compiler

More latency hiding mechanisms

(read only caches and scratch pads)

Simulator

Increasing simulator speed:

Parallel simulation, phase sampling and fast forwarding

Simulation of asynchronous interconnect in [TCAD2010]

Operating system

60

Page 61: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Summary

5/20/2011HIPS2011

Explicit Multi-Threading

From abstraction/algorithms to hardware

XMTSim

Highly configurable

Can simulate GALS, asynchronous logic, dynamic management

XMT Compiler

How to use serial GCC for a parallel language?

Optimizations: Latency Hiding, efficient scheduling

Impact

Architecture/compiler/algorithms research, teaching

61

Page 62: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Past Contributors

5/20/2011HIPS2011

Yoni Ashar, Thomas Dubois, Bryant Lee (UMD Students)

Tali Moreshet, Katherine Bertaut (Swarthmore College)

Genette Gill (Columbia University)

62

Page 63: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

References

5/20/2011HIPS2011

[CACM2011] U. Vishkin. Using simple abstraction to reinvent computing for parallelism. CACM 54,1,p. 75-85, Jan. 2011

[SPAA2011] G.C. Caragea and U. Vishkin. Better speedups for parallel max-flow. SPAA, 2011.

[EduPar2011] D. Padua, U. Vishkin and J. Carver. Joint UIUC/UMD parallel algorithms/programming course. EduPar-11, in IPDPS, 2011

[SIGCSE2010] S. Torbert, U. Vishkin, R. Tzur and D. Ellison. Is teaching parallel algorithmic thinking to high-school student possible? One teacher's experience. SIGCSE, 2010

[IJPP2010] C.G. Caragea, A. Tzannes, F. Keceli, R. Barua and U. Vishkin. Resource-aware compiler prefetching for many-cores. Intern. J. of Parallel Programming, 2010

[HotPar2010] C.G. Caragea, F. Keceli, A. Tzannes and U. Vishkin. General-purpose vs. GPU: Comparison of many-cores on irregular workloads. HotPar, 2010

[PPoPP2010] A. Tzannes, G.C. Caragea, R. Barua and U. Vishkin. Lazy binary splitting: A run-time adaptive dynamic work-stealing scheduler. PPoPP, 2010.

[DIMACS 2011] From asymptotic PRAM speedups to easy-to-obtain concrete XMT ones, invited talk, DIMACS Workshop on Parallelism: A 20-20 Vision, 2011

[TCAD2010] M.N. Horak, S.M. Nowick, M. Carlberg and U. Vishkin. A low-overhead asynchronous interconnection network for GALS chip miltiprocessor. TCAD p. 494-507, special Issue for NOCS, Apr. 2010

63

Page 64: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Backup Slides

5/20/2011HIPS201164

Page 65: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Speedups

5/20/2011HIPS2011

XMT vs. GTX280 GPU 6x speedup on irregular par.

MaxFlow >100x over serial

Bi-Connectivity Up to 33x over serial

65

Page 66: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

Immediate Concurrent Execution

5/20/2011HIPS201166

Figure from CACM11

Page 67: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

How do actors communicate? Problem: Order of concurrent events is not deterministic

Ports

Concept similar to HDLs/SystemC

Event priority mechanism for ports

Two phases/priorities: Evaluate and update

How to handle halt signals?

Macro Actor contract: No combinatorial path between inputs

and outputs

5/20/2011HIPS201167

Page 68: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

DE Simulation Example

Event List

Actor Actor

Wake me up at T+1 Wake me up at T+2

5/20/2011HIPS201168

Time = T

Page 69: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

DE Simulation Example

Event List

Actor Actor

Wake-up!

5/20/2011HIPS201169

Time = T+1

Page 70: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

DE Simulation Example

Event List

Actor Actor

Wake-up!

5/20/2011HIPS201170

Time = T+2

Page 71: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

DT Example: Pipeline simulation

Initial:

5/20/2011HIPS201171

How to simulate advancing the

pipeline for a single clock cycle?

Page 72: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

DT Example: Pipeline simulation

Initial:

5/20/2011HIPS201172

Page 73: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

DT Example: Pipeline simulation

Initial:

5/20/2011HIPS201173

Page 74: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

DT Example: Pipeline simulation

Initial:

5/20/2011HIPS201174

After 1 simulated

cycle:

Page 75: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

DE Example: Pipeline simulation

Initial:

5/20/2011HIPS201175

How to simulate advancing the

pipeline for a single clock cycle?

Page 76: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

DE Example: Pipeline simulation

Initial:

Use intermediate

storage:

5/20/2011HIPS201176

Page 77: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

DE Example: Pipeline simulation

Initial:

After 1 simulated

cycle:

5/20/2011HIPS201177

Page 78: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

XMTC Runtime

5/20/2011HIPS2011

Allocation of TCUs to virtual-threads

Using XMT hardware support (ps, chkid)

Dynamic Memory Allocation

Serial only now. Parallel is left for future work

Parallel Stack Allocation & Nested Parallelism Scheduling

78

Page 79: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

XMTC memory model

Problem: Many processors read and write to the memory in parallel.

Question: In what order does processor A see memory operations from processor B? (Is the

order experienced the same by A and B?)

Example: Initially: x=0, y=0;

Thread A Thread B

x:=1 Read y

y:=1 Read x

Can B read (x,y) = (1,0) ?

5/20/2011HIPS201179

Page 80: Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev ... · Fuat Keceli, AlexandrosTzannes, George C. Caragea, Rajeev Barua and Uzi Vishkin University of Maryland, College Park,

XMTC memory model

Initially: x=0, y=0

Thread A Thread B

x=1; ... = psm(0,y);//Read y

psm(1,y); //atomic{y++} ... = x;// Read x

implicit

memory

barrier

5/20/2011HIPS201180