software transactional memory system for c++ serge preis, ravi narayanaswami intel corporation
TRANSCRIPT
2Software and Services Group 2/20
Copyright © 2008, Intel Corporation. All rights reserved.* Other names and brands may be claimed as the property of others
Agenda
•Parallel programming challenges
•Transactional memory introduction and overview
−Software implementation of TM specifics
•Intel STM system
−Language extensions
−C++ support
−Compiler overview
−Library overview
•Performance results and analysis
•Conclusion
•Questions and answers
3Software and Services Group 3/20
Copyright © 2008, Intel Corporation. All rights reserved.* Other names and brands may be claimed as the property of others
Parallel programming
•Shared memory gets more popular −Multi-core is getting momentum−2-socket workstations for maximum desktop performance−Hyper-threading is back
•Shared resources access may be a bottle-neck−Locks required to avoid races−Single global lock is simple, but limiting−Fine-grain locks may provide best scalability−Fine-grain locks are hard to design, implement and test−Poor implementation is error-prone and limiting−Locks are not composable
4Software and Services Group 4/20
Copyright © 2008, Intel Corporation. All rights reserved.* Other names and brands may be claimed as the property of others
What is transactional memory
•Syntax:
__tm_atomic {
// transaction code goes here
}
•Semantics:
−Isolation: effects are localized
−Atomicity: commit or rollback
−Retry if data conflict is detected
−Publication and privatization safety
−Composability via nesting
−Fine-grain: based on data accesses
5Software and Services Group 5/20
Copyright © 2008, Intel Corporation. All rights reserved.* Other names and brands may be claimed as the property of others
Example: Insert Node into a link list
{ new_node->prev = node; new_node->next = node->next; node->next->prev = new_node; node->next = new_node; }
6Software and Services Group 6/20
Copyright © 2008, Intel Corporation. All rights reserved.* Other names and brands may be claimed as the property of others
Example: Insert Node into a link list
Thread 1 Thread 2
{ new_node->prev = node; new_node->next = node->next; node->next->prev = new_node; node->next = new_node; }
7Software and Services Group 7/20
Copyright © 2008, Intel Corporation. All rights reserved.* Other names and brands may be claimed as the property of others
Example: Insert Node into a link list
Lock { new_node->prev = node; new_node->next = node->next; node->next->prev = new_node; node->next = new_node; }
Lock { new_node->prev = node; new_node->next = node->next; node->next->prev = new_node; node->next = new_node; }
Thread 1
Thread 2
Single global lockget lock and execute
waits for lock to be released
8Software and Services Group 8/20
Copyright © 2008, Intel Corporation. All rights reserved.* Other names and brands may be claimed as the property of others
Example: Insert Node into a link list
__tm_atomic { new_node->prev = node; new_node->next = node->next; node->next->prev = new_node; node->next = new_node; }
__tm_atomic { new_node->prev = node; new_node->next = node->next; node->next->prev = new_node; node->next = new_node; }
Thread 1 Thread 2
TM Based
Both threads execute in parallel, if they write to same node then abort and retry one of transactions
9Software and Services Group 9/20
Copyright © 2008, Intel Corporation. All rights reserved.* Other names and brands may be claimed as the property of others
Transactional memory overview
• Fine-grain concurrency management without locks−Concurrent readers are welcome−Re-execute entire transaction if conflict is detected
• Simple syntax and semantics−Looks and behaves like single global lock−Simpler create race-free programs
• Possible implementations−Purely software (STM)−Software with HW acceleration (HaSTM)−Separate HW and SW (HyTM)−HW-based with TX size restrictions (RTM)−HW-based for short transactions, software for unbounded (VTM)
• No TM hardware is out yet−SUN* ROCK* planned to be released next year−AMD* has published spec for TM H/W assistance
10Software and Services Group 10/20
Copyright © 2008, Intel Corporation. All rights reserved.* Other names and brands may be claimed as the property of others
Software transactional memory
•Weak atomicity: guarantees are only for transactional code−Same as for locks
•Unbounded: transactions of arbitrary size are supported−HW resources (memory) is the limit
•Instrumentation of memory accesses is required within transactions−Spatial and performance overhead−Including called functions−Not always possible−Different data and contention management techniques−Object based or word-based depending on language
11Software and Services Group 11/20
Copyright © 2008, Intel Corporation. All rights reserved.* Other names and brands may be claimed as the property of others
Intel STM implementation
•C/C++ Compiler + run-time library
•Word-based STM system
•Close nesting
•Failure atomicity (__tm_abort)
•Irrevocable execution support for I/O and legacy
•Support for C++ constructs
•Highly optimized
System is complete, based on production compiler and published for everyone to try
12Software and Services Group 12/20
Copyright © 2008, Intel Corporation. All rights reserved.* Other names and brands may be claimed as the property of others
Basic language extensions
•Atomic blocks
−Nesting is possible
>Composability is a great advantage
−Support of function calls
>Including indirect
−Failure atomicity using user aborts
>Lexical only
−Support of pre-compiled code
>Dynamically incompatible with aborts
•Escape from atomic execution
−Programmer ensures correctness
__tm_atomic { if (local_var > Global_val) { Global_max = local_var; local_var = Global_arr[++loc_i]; foo(Global_max); } __tm_atomic { foo(Global_max++) }}
__tm_atomic { cout << “HelloWorld!”; //precompiled __tm_atomic { // some code __tm_abort; // inner TX aborted } if (some_condition) { __tm_abort; // runtime error }}
13Software and Services Group 13/20
Copyright © 2008, Intel Corporation. All rights reserved.* Other names and brands may be claimed as the property of others
Annotations
• All functions called from atomic block are either annotated or treated as pre-compiled−Including class methods, template instantiations etc.−Can be deduced by compiler automatically for some cases−Special care is taken over indirect calls
• Main annotations are:−tm_callable: may be called from atomic region if processed by compiler−tm_safe: same as callable, but safe to mix with tm_abort (no unsafe pre-
compiled code inside)−tm_pure: may be called from atomic region as is−tm_unknown: pre-compiled code (same as no annotation at all), overrides
all other annotations in some cases
• Some functions processed specially−Memory allocations−Memory copying−There is special annotation to code TM wrappers for functions
14Software and Services Group 14/20
Copyright © 2008, Intel Corporation. All rights reserved.* Other names and brands may be claimed as the property of others
Annotating many functions at once
• Class annotations
− tm_callable, tm_safe and tm_unknow supported
−Serves as default to all methods introduced in the class
−Inherited and may be overridden on child
−May be overridden on individual methods, overriding rules apply
• Virtual members annotations
−All annotations are supported
−Inheritance has higher priority than class one
−Overriding rules apply
• Templates annotations
−My be overridden on explicit instantiation and specialization
−Both functions and classes are supported
Derived class
Base unknown callable safe pure
unknown Yes Yes Yes Yes
callable no Yes Yes no
safe no No Yes no
pure no No no Yes
#define _ds(x) __declspec(x)struct A { virtual int _ds(tm_safe) getA() const; virtual int process(); //unknown };_ds(tm_callable) struct B { virtual int getA() const; // tm_callable virtual int __ds(tm_unknown) process();};struct AB : A, B { // tm_callable int _ds(tm_pure) getA(); //tm_safe→error int process(); // tm_unknown};
15Software and Services Group 15/20
Copyright © 2008, Intel Corporation. All rights reserved.* Other names and brands may be claimed as the property of others
Exception Handling with TM
•Do not let exceptions leave atomic block open
−Catch all exceptions at block boundary and commit implicitly
−Abort on exception is also supported explicitly also with rethrow
•Process all handlers belonging to atomic block accordingly
•Support correct calls to copy c’tors and d’tors from C++ RTL during stack unwinding
−Dynamically decide whether transactional or non-transactional version should be called at each frame
•Do not corrupt C++ RTL EH state during abort and retry
−Should not long-jump out of catch block as we do for retry
−At the same time should not execute user code after conflict is detected
−Transparent for programmer
−Extremely complicated task, now in implementation
16Software and Services Group 16/20
Copyright © 2008, Intel Corporation. All rights reserved.* Other names and brands may be claimed as the property of others
Intel STM optimizing compiler
• Language extensions• Instrumentation of code
−Transaction boundaries−Memory accesses and function calls−Exception handling
• Optimizations−Reduce deoptimization overhead−Optimized memory instrumentation
>Taking order of access into account>Taking care of locals
−Annotations propagation−TM mode based on transaction properties−Simplified instrumentation for some TM modes
There is still plenty of room for optimization
The producto Based on production version
of Intel® C/C++ compiler o Full set of classic
optimizations including IPO, vectorizer and parallelizer
o Windows*/Linux, IA32/Intel 64o Compatible with GCC on Linux
and various versions of Microsoft* Visual Studio* on Windows*
o Version 3.0 is comingo Publicly available to tryo Feedback is appreciated
17Software and Services Group 17/20
Copyright © 2008, Intel Corporation. All rights reserved.* Other names and brands may be claimed as the property of others
Intel STM library
• Comprehensive and flexible ABI−Supports various STM
implementations−Allows compiler optimizations
• Effective contention management with switchable strategies−In-place updates−2 dynamically interchangeable
strategies−Effective implementation−Additional obstinate mode for long
transactions
• Irrevocable execution support• Failure atomicity support for locals• Nesting support with local aborts• Special handling for memory
allocation and copying
STM library ABI at a glanceTxnDesc* getTransaction()int beginTx(TxnDesc*,int modes_hints)void commitTx(TxnDesc*)int beginInnerTx(TxnDesc*,int)void commitInnerTx(TxnDesc*)void abortTx(TxnDesc*)void switchToSerialMode(TxnDesc*)void write<Type>(TxnDesc*,Type*,Type)Type read<Type>(TxnDesc*,Type*)void logValue<Type>(TxnDesc*,Type*)void logMem(TxnDesc*,void*,size t)void writeAfterRead(Type*,Type)void writeAfterWrite<Type>(Type*,Type)Type readAfterRead<Type>(Type*)Type readAfterWrite<Type>(Type*)Type readForWrite<Type>(Type*)
18Software and Services Group 18/20
Copyright © 2008, Intel Corporation. All rights reserved.* Other names and brands may be claimed as the property of others
Example of translation
__declcpec(tm_callable)
foo();
//...
__tm_atomic {
foo(++c);
}
Label:
action = StartTX();
if (action & restoreLiveVariables)
liveX = saved_liveX;
if (action & retryTransaction)
goto Label;
else if (action & saveLiveVariables)
saved_liveX = liveX;
if (action & InstrumentedCode) { temp = Read<Int>(c);
temp = temp + 1;
Write<Int>(c);
foo_$TXN(c); // TM-version of foo
} else { c = c + 1;
foo(c);
} CommitTX();
19Software and Services Group 19/20
Copyright © 2008, Intel Corporation. All rights reserved.* Other names and brands may be claimed as the property of others
Performance data (SPLASH2 benchmarks)
0.000
1.000
2.000
3.000
4.000
5.000
6.000
7.000
8.000
1T 2T 4T 8T
FGL
STM
GLOCK
Speedup vs. serial execution on 8T
0.000
1.000
2.000
3.000
4.000
5.000
6.000
7.000
8.000
BARNES
CHOLESKY
FFTFM
M
RADIOSIT
Y
RAYTRACE
GEOMEAN
FGL
STM
GLOCK
Scalability of BARNESS benchmark
Performance of STM depends highly on workload
20Software and Services Group 20/20
Copyright © 2008, Intel Corporation. All rights reserved.* Other names and brands may be claimed as the property of others
Performance analysis
• For many workloads STM outperforms Single Global Lock
−On 8 or more threads
−Results still lower than Fine Grain Locks
−More optimizations and improvements are on the way
• RAYTRACE is and example where ceiling is hit
−When all 8 threads are running the HW is 100% busy and thus STM code increases the machine load beyond the limit and performance drops
−Picture is quite different for 8T on 16-way HW, but for 16T is the same
−Optimization to run short transactions in Global Lock mode helps much to thisbenchmark
• Optimizations are not good for all benchmarks: RADIOSYTY performs best in pure STM runs
0.000
1.000
2.000
3.000
4.000
5.000
6.000
7.000
8.000
FGL
STM
GLOCK
Small transactions optimization is ON
21Software and Services Group 21/20
Copyright © 2008, Intel Corporation. All rights reserved.* Other names and brands may be claimed as the property of others
Conclusion
• Pros−Simple programming model−Composability−Failure atomicity−Decent scalability at 8 threads and beyond for many workloads
• Cons−Overhead−Profitability highly depend on workload−Retries eat power−Function annotation are not that convenient
• Intel C/C++ STM compiler prototype edition is publicly available at http://whaif.intel.com −Includes Intel STM library, user documentation and examples−Active discussion forum for questions and comments
We target the future