software transactional memory system for c++ serge preis, ravi narayanaswami intel corporation

Software Transactional Memory system for C++

Serge Preis, Ravi Narayanaswami

Intel Corporation

2Software and Services Group 2/20

Copyright © 2008, Intel Corporation. All rights reserved.* Other names and brands may be claimed as the property of others

Agenda

•Parallel programming challenges

•Transactional memory introduction and overview

−Software implementation of TM specifics

•Intel STM system

−Language extensions

−C++ support

−Compiler overview

−Library overview

•Performance results and analysis

•Conclusion

•Questions and answers



Parallel programming

•Shared memory gets more popular −Multi-core is getting momentum−2-socket workstations for maximum desktop performance−Hyper-threading is back

•Shared resources access may be a bottle-neck−Locks required to avoid races−Single global lock is simple, but limiting−Fine-grain locks may provide best scalability−Fine-grain locks are hard to design, implement and test−Poor implementation is error-prone and limiting−Locks are not composable



What is transactional memory

•Syntax:

__tm_atomic {

// transaction code goes here

}

•Semantics:

−Isolation: effects are localized

−Atomicity: commit or rollback

−Retry if data conflict is detected

−Publication and privatization safety

−Composability via nesting

−Fine-grain: based on data accesses



Example: Insert Node into a link list

{ new_node->prev = node; new_node->next = node->next; node->next->prev = new_node; node->next = new_node; }




Thread 1 Thread 2

{ new_node->prev = node; new_node->next = node->next; node->next->prev = new_node; node->next = new_node; }




Lock { new_node->prev = node; new_node->next = node->next; node->next->prev = new_node; node->next = new_node; }

Lock { new_node->prev = node; new_node->next = node->next; node->next->prev = new_node; node->next = new_node; }

Thread 1

Thread 2

Single global lockget lock and execute

waits for lock to be released




__tm_atomic { new_node->prev = node; new_node->next = node->next; node->next->prev = new_node; node->next = new_node; }

__tm_atomic { new_node->prev = node; new_node->next = node->next; node->next->prev = new_node; node->next = new_node; }

Thread 1 Thread 2

TM Based

Both threads execute in parallel, if they write to same node then abort and retry one of transactions



Transactional memory overview

• Fine-grain concurrency management without locks−Concurrent readers are welcome−Re-execute entire transaction if conflict is detected

• Simple syntax and semantics−Looks and behaves like single global lock−Simpler create race-free programs

• Possible implementations−Purely software (STM)−Software with HW acceleration (HaSTM)−Separate HW and SW (HyTM)−HW-based with TX size restrictions (RTM)−HW-based for short transactions, software for unbounded (VTM)

• No TM hardware is out yet−SUN* ROCK* planned to be released next year−AMD* has published spec for TM H/W assistance



Software transactional memory

•Weak atomicity: guarantees are only for transactional code−Same as for locks

•Unbounded: transactions of arbitrary size are supported−HW resources (memory) is the limit

•Instrumentation of memory accesses is required within transactions−Spatial and performance overhead−Including called functions−Not always possible−Different data and contention management techniques−Object based or word-based depending on language



Intel STM implementation

•C/C++ Compiler + run-time library

•Word-based STM system

•Close nesting

•Failure atomicity (__tm_abort)

•Irrevocable execution support for I/O and legacy

•Support for C++ constructs

•Highly optimized

System is complete, based on production compiler and published for everyone to try



Basic language extensions

•Atomic blocks

−Nesting is possible

>Composability is a great advantage

−Support of function calls

>Including indirect

−Failure atomicity using user aborts

>Lexical only

−Support of pre-compiled code

>Dynamically incompatible with aborts

•Escape from atomic execution

−Programmer ensures correctness

__tm_atomic { if (local_var > Global_val) { Global_max = local_var; local_var = Global_arr[++loc_i]; foo(Global_max); } __tm_atomic { foo(Global_max++) }}

__tm_atomic { cout << “HelloWorld!”; //precompiled __tm_atomic { // some code __tm_abort; // inner TX aborted } if (some_condition) { __tm_abort; // runtime error }}



Annotations

• All functions called from atomic block are either annotated or treated as pre-compiled−Including class methods, template instantiations etc.−Can be deduced by compiler automatically for some cases−Special care is taken over indirect calls

• Main annotations are:−tm_callable: may be called from atomic region if processed by compiler−tm_safe: same as callable, but safe to mix with tm_abort (no unsafe pre-

compiled code inside)−tm_pure: may be called from atomic region as is−tm_unknown: pre-compiled code (same as no annotation at all), overrides

all other annotations in some cases

• Some functions processed specially−Memory allocations−Memory copying−There is special annotation to code TM wrappers for functions



Annotating many functions at once

• Class annotations

− tm_callable, tm_safe and tm_unknow supported

−Serves as default to all methods introduced in the class

−Inherited and may be overridden on child

−May be overridden on individual methods, overriding rules apply

• Virtual members annotations

−All annotations are supported

−Inheritance has higher priority than class one

−Overriding rules apply

• Templates annotations

−My be overridden on explicit instantiation and specialization

−Both functions and classes are supported

Derived class

Base unknown callable safe pure

unknown Yes Yes Yes Yes

callable no Yes Yes no

safe no No Yes no

pure no No no Yes

#define _ds(x) __declspec(x)struct A { virtual int _ds(tm_safe) getA() const; virtual int process(); //unknown };_ds(tm_callable) struct B { virtual int getA() const; // tm_callable virtual int __ds(tm_unknown) process();};struct AB : A, B { // tm_callable int _ds(tm_pure) getA(); //tm_safe→error int process(); // tm_unknown};



Exception Handling with TM

•Do not let exceptions leave atomic block open

−Catch all exceptions at block boundary and commit implicitly

−Abort on exception is also supported explicitly also with rethrow

•Process all handlers belonging to atomic block accordingly

•Support correct calls to copy c’tors and d’tors from C++ RTL during stack unwinding

−Dynamically decide whether transactional or non-transactional version should be called at each frame

•Do not corrupt C++ RTL EH state during abort and retry

−Should not long-jump out of catch block as we do for retry

−At the same time should not execute user code after conflict is detected

−Transparent for programmer

−Extremely complicated task, now in implementation



Intel STM optimizing compiler

• Language extensions• Instrumentation of code

−Transaction boundaries−Memory accesses and function calls−Exception handling

• Optimizations−Reduce deoptimization overhead−Optimized memory instrumentation

>Taking order of access into account>Taking care of locals

−Annotations propagation−TM mode based on transaction properties−Simplified instrumentation for some TM modes

There is still plenty of room for optimization

The producto Based on production version

of Intel® C/C++ compiler o Full set of classic

optimizations including IPO, vectorizer and parallelizer

o Windows*/Linux, IA32/Intel 64o Compatible with GCC on Linux

and various versions of Microsoft* Visual Studio* on Windows*

o Version 3.0 is comingo Publicly available to tryo Feedback is appreciated



Intel STM library

• Comprehensive and flexible ABI−Supports various STM

implementations−Allows compiler optimizations

• Effective contention management with switchable strategies−In-place updates−2 dynamically interchangeable

strategies−Effective implementation−Additional obstinate mode for long

transactions

• Irrevocable execution support• Failure atomicity support for locals• Nesting support with local aborts• Special handling for memory

allocation and copying

STM library ABI at a glanceTxnDesc* getTransaction()int beginTx(TxnDesc*,int modes_hints)void commitTx(TxnDesc*)int beginInnerTx(TxnDesc*,int)void commitInnerTx(TxnDesc*)void abortTx(TxnDesc*)void switchToSerialMode(TxnDesc*)void write<Type>(TxnDesc*,Type*,Type)Type read<Type>(TxnDesc*,Type*)void logValue<Type>(TxnDesc*,Type*)void logMem(TxnDesc*,void*,size t)void writeAfterRead(Type*,Type)void writeAfterWrite<Type>(Type*,Type)Type readAfterRead<Type>(Type*)Type readAfterWrite<Type>(Type*)Type readForWrite<Type>(Type*)



Example of translation

__declcpec(tm_callable)

foo();

//...

__tm_atomic {

foo(++c);

}

Label:

action = StartTX();

if (action & restoreLiveVariables)

liveX = saved_liveX;

if (action & retryTransaction)

goto Label;

else if (action & saveLiveVariables)

saved_liveX = liveX;

if (action & InstrumentedCode) { temp = Read<Int>(c);

temp = temp + 1;

Write<Int>(c);

foo_$TXN(c); // TM-version of foo

} else { c = c + 1;

foo(c);

} CommitTX();



Performance data (SPLASH2 benchmarks)

0.000

1.000

2.000

3.000

4.000

5.000

6.000

7.000

8.000

1T 2T 4T 8T

FGL

STM

GLOCK

Speedup vs. serial execution on 8T

0.000

1.000

2.000

3.000

4.000

5.000

6.000

7.000

8.000

BARNES

CHOLESKY

FFTFM

M

RADIOSIT

Y

RAYTRACE

GEOMEAN

FGL

STM

GLOCK

Scalability of BARNESS benchmark

Performance of STM depends highly on workload



Performance analysis

• For many workloads STM outperforms Single Global Lock

−On 8 or more threads

−Results still lower than Fine Grain Locks

−More optimizations and improvements are on the way

• RAYTRACE is and example where ceiling is hit

−When all 8 threads are running the HW is 100% busy and thus STM code increases the machine load beyond the limit and performance drops

−Picture is quite different for 8T on 16-way HW, but for 16T is the same

−Optimization to run short transactions in Global Lock mode helps much to thisbenchmark

• Optimizations are not good for all benchmarks: RADIOSYTY performs best in pure STM runs

0.000

1.000

2.000

3.000

4.000

5.000

6.000

7.000

8.000

FGL

STM

GLOCK

Small transactions optimization is ON



Conclusion

• Pros−Simple programming model−Composability−Failure atomicity−Decent scalability at 8 threads and beyond for many workloads

• Cons−Overhead−Profitability highly depend on workload−Retries eat power−Function annotation are not that convenient

• Intel C/C++ STM compiler prototype edition is publicly available at http://whaif.intel.com −Includes Intel STM library, user documentation and examples−Active discussion forum for questions and comments

We target the future



Q&A

software transactional memory system for c++ serge preis, ravi narayanaswami intel corporation

Documents

node node

node new

services group

property of othersexample

property of otherswhat

link listlock

link listthread

tm basedboth threads