atlas (a.k.a. ramp red) parallel programming with transactional memory njuguna njoroge and sewook...

17
ATLAS (a.k.a. RAMP Red) Parallel Programming with Transactional Memory Njuguna Njoroge and Sewook Wee Transactional Coherence and Consistency Computer System Lab Stanford University http:/tcc.stanford.edu/prototypes

Post on 21-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: ATLAS (a.k.a. RAMP Red) Parallel Programming with Transactional Memory Njuguna Njoroge and Sewook Wee Transactional Coherence and Consistency Computer

ATLAS(a.k.a. RAMP Red)

Parallel Programming with Transactional Memory

Njuguna Njoroge and Sewook Wee

Transactional Coherence and Consistency

Computer System Lab

Stanford University

http:/tcc.stanford.edu/prototypes

Page 2: ATLAS (a.k.a. RAMP Red) Parallel Programming with Transactional Memory Njuguna Njoroge and Sewook Wee Transactional Coherence and Consistency Computer

2

Why we built ATLAS

Multicore processors exposes challenges of multithreaded programming• Transactional Memory simplifies parallel programming

As simple as coarse-grain locks As fast as fine-grain locks

Currently missing for evaluating TM• Fast TM prototypes to develop software on

FPGAs improving capabilities attractive for CMP prototyping• Fast Can operate > 100 MHz• More logic, memory and I/O’s• Larger libraries of pre-designed IP cores

ATLAS: 8-processor Transactional Memory System• 1st FPGA-based Hardware TM system• Member of RAMP initiative RAMP Red

Page 3: ATLAS (a.k.a. RAMP Red) Parallel Programming with Transactional Memory Njuguna Njoroge and Sewook Wee Transactional Coherence and Consistency Computer

3

ATLAS provides …

Speed• > 100x speed-up over SW simulator [FPGA 2007]

Rich software environment• Linux OS• Full GNU environment (gcc, glibc, etc.)

Productivity• Guided performance tuning• Standard GDB environment + deterministic replay

Page 4: ATLAS (a.k.a. RAMP Red) Parallel Programming with Transactional Memory Njuguna Njoroge and Sewook Wee Transactional Coherence and Consistency Computer

4

Transaction• Building block of a program• Critical region• Executed atomically & isolated from others

TCC’s Execution Model

Page 5: ATLAS (a.k.a. RAMP Red) Parallel Programming with Transactional Memory Njuguna Njoroge and Sewook Wee Transactional Coherence and Consistency Computer

5

TCC’s Execution Model

CPU 0

CPU 1

CPU 2

Commit

Arbitrate

Execute

Code

Commit

Arbitrate

Execute

Code

Undo

Execute

Code

ld 0xbeefRe-

Execute

Code

...

ld 0xaaaa

ld 0xbbbb

...

ld 0xbeef...

...

0xbeef0xbeef

st 0xbeef

...

ld 0xdddd

ld 0xeeee

...

In TCC, All Transactions All The Time [PACT 2004]

Page 6: ATLAS (a.k.a. RAMP Red) Parallel Programming with Transactional Memory Njuguna Njoroge and Sewook Wee Transactional Coherence and Consistency Computer

6

Processor

W7:0TAG

(2-ported)Data

Cache

ViolationLoad/Store

Address

Snoop Control

Commit Address

CommitControl

CommitData

StoreAddress

FIFO

RegisterCheckpoint

Commit Bus

Refill Bus

CommitAddress In

CommitData Out

CommitAddress Out

DATA(single-ported)

R7:0V

CMP Architecture for TCC

Speculatively Read Bits:

ld 0xdeadbeef

Speculatively Written Bits:

st 0xcafebabe

Violation Detection:Compare incoming address to R bits

Commit:Read pointers from Store Address FIFO, flush addresses W bits set

Page 7: ATLAS (a.k.a. RAMP Red) Parallel Programming with Transactional Memory Njuguna Njoroge and Sewook Wee Transactional Coherence and Consistency Computer

7

ATLAS 8-way CMP on BEE2 Board

Control FPGA

DDR2DRAM

Controller

CommitTokenArbiter

LinuxPPC

I/O(disk, net)

User FPGA

TCC$I $

TCC PPC

TCC$I $

TCC PPC

User FPGA

TCC$I $

TCC PPC

TCC$I $

TCC PPC

User FPGA

TCC$I $

TCC PPC

TCC$I $

TCC PPC

User FPGA

TCC$I $

TCC PPC

TCC$I $

TCC PPC

User Switch

Control Switch

User Switch

User Switch

User Switch

User FPGAs 4 FPGAs for a total of 8

TCC CPUs PPC, TCC caches, BRAMs

and busses run @ 100 MHz

Control FPGA Linux PPC @ 300 MHz

• Launch TCC apps here• Handle system services for

TCC PowerPCs Fabric runs @ 100 MHz

Page 8: ATLAS (a.k.a. RAMP Red) Parallel Programming with Transactional Memory Njuguna Njoroge and Sewook Wee Transactional Coherence and Consistency Computer

8

ATLAS Software Overview

TM Application

TM API ATLAS Profiler

ATLAS Subsystem

Linux OS

ATLAS HW on BEE2

TM application can be easily written with TM API

ATLAS profiler provides a runtime profiling and guided performance tuning

ATLAS subsystem provides Linux OS support for the TM application

Page 9: ATLAS (a.k.a. RAMP Red) Parallel Programming with Transactional Memory Njuguna Njoroge and Sewook Wee Transactional Coherence and Consistency Computer

9

ATLAS subsystem

Commit

Linux

PPC

TCCPPC0

Transfersinitial context

TCCPPC1

TCCPPC2

… TCCPPC7

Invokes parallel work

Joins parallel work

Exit withapp. stats

Violation

Page 10: ATLAS (a.k.a. RAMP Red) Parallel Programming with Transactional Memory Njuguna Njoroge and Sewook Wee Transactional Coherence and Consistency Computer

10

ATLAS System Support

TCC PPC requests OS support.

(TLB miss, system call)

Linux PPC replies back to the requestor.

Linux PPC regenerates

and services the

request.

Serialize, if request is irrevocable

• System Call

• Page-out

Linux PPC

TCCPPC

Page 11: ATLAS (a.k.a. RAMP Red) Parallel Programming with Transactional Memory Njuguna Njoroge and Sewook Wee Transactional Coherence and Consistency Computer

11

Coding with TM API: histogram

main (int argc, void* argv) { … sequential code … TM_PARALLEL(run, NULL, numCpus); … sequential code …}

// static scheduling with interleaved access to A[]void* run(void* args) { int i = TM_GET_THREAD_ID(); for (;i < NUM_LOOP; i+= TM_GET_NUM_THREAD()) { TM_BEGIN(); bucket[A[i]]++; TM_END();}

OpenTM will provide high-level (OpenMP style) pragmas

Page 12: ATLAS (a.k.a. RAMP Red) Parallel Programming with Transactional Memory Njuguna Njoroge and Sewook Wee Transactional Coherence and Consistency Computer

12

Guided Performance Tuning

TAPE: Light-weight runtime profiler [ICS 2005] Tracking most significant violations (longest loss time)

• Violated object address

• PC where object was read

• Loss time & # of occurrence

• Committing thread’s ID and transaction PC Tracking most significant overflows (longest duration)

• Overflows: when speculative state can no longer stay in TCC$

• PC where overflows

• Overflow duration & number of occurrence

• Type of overflow (LRU or Write Buffer)

Page 13: ATLAS (a.k.a. RAMP Red) Parallel Programming with Transactional Memory Njuguna Njoroge and Sewook Wee Transactional Coherence and Consistency Computer

13

Deterministic Replay

All Transactions All The Time

• TM 101: Transaction is executed atomically and in isolation

• TM’s illusion: transaction starts after older transactions finish

Only need to record “the order of commit”

• Minimal runtime overhead & footprint size = 1B / transaction

Logging execution Replay execution

write-set

time time

LOG:

T0

T1

T2 T2

write-set

T0

T1

T2 T2

Token arbiter enforces

commit orderspecified in LOG

T0 T1 T2

Page 14: ATLAS (a.k.a. RAMP Red) Parallel Programming with Transactional Memory Njuguna Njoroge and Sewook Wee Transactional Coherence and Consistency Computer

14

Useful Features of Replay

Monitoring code in the transaction

• Remember we only record the transaction order

Verification

• Log is not written in stone

• Complete runtime scenario coverage is possible

Choice of running Replay on

• ATLAS itself HW support for other debugging tools (see next slide)

• Local machine (your favorite desktop or workstation) Runs natively on faster local machine, sequentially

Seamless access to existing debugging tools

Page 15: ATLAS (a.k.a. RAMP Red) Parallel Programming with Transactional Memory Njuguna Njoroge and Sewook Wee Transactional Coherence and Consistency Computer

15

GDB support

Current status

• GDB integrated with local machine replay GDB provides debugability while guaranteeing deterministic

replay

• Below are work-in-progress

Breakpoint

• Thread local BP vs. global BP

• Stop the world by controlling commit token

Stepping

• Backward stepping: Transaction is ready to roll back

• Transaction stepping

Unlimited data-watch (ATLAS only)

• Separate monitor TCC cache to register data-watches

Page 16: ATLAS (a.k.a. RAMP Red) Parallel Programming with Transactional Memory Njuguna Njoroge and Sewook Wee Transactional Coherence and Consistency Computer

16

Conclusion: ATLAS provides

Speed• > 100x speed-up over SW simulator [FPGA 2007]

Software environment• Linux OS• Full GNU environment (gcc, glibc, etc.)

Productivity • TAPE: Guided performance tuning• Deterministic replay • Standard GDB environment

Future Work• High-level language support (Java, Python, …)

Page 17: ATLAS (a.k.a. RAMP Red) Parallel Programming with Transactional Memory Njuguna Njoroge and Sewook Wee Transactional Coherence and Consistency Computer

17

Questions and Answers

[email protected]

ATLAS Team Members

• System Hardware – Njuguna Njoroge, PhD Candidate

• System Software – Sewook Wee, PhD Candidate

• High level languages – Jiwon Seo, PhD Candidate

• HW Performance – Lewis Mbae, BS Candidate

Past contributors

• Interconnection Fabric – Jared Casper, PhD Candidate

• Undergrads – Justy Burdick, Daxia Ge, Yuriy Teslar