intel threading building blocks tbb

86
ntel Threading Building Block TBB http://www.threadingbuildingblocks.org

Upload: candace-myers

Post on 31-Dec-2015

75 views

Category:

Documents


5 download

DESCRIPTION

Intel Threading Building Blocks TBB. http://www.threadingbuildingblocks.org. Overview. TBB enables you to specify tasks instead of threads TBB targets threading for performance TBB is compatible with other threading packages TBB emphasize scalable, data parallel programming - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Intel Threading Building Blocks TBB

Intel Threading Building BlocksTBB

http://www.threadingbuildingblocks.org

Page 2: Intel Threading Building Blocks TBB

Overview

• TBB enables you to specify tasks instead of threads

•TBB targets threading for performance

•TBB is compatible with other threading packages

•TBB emphasize scalable, data parallel programming

•TBB relies on generic programming

Page 3: Intel Threading Building Blocks TBB
Page 4: Intel Threading Building Blocks TBB
Page 5: Intel Threading Building Blocks TBB
Page 6: Intel Threading Building Blocks TBB
Page 7: Intel Threading Building Blocks TBB

void SerialAverage(float *output,float *input, size_t){ for(size_t i=0 ; I < n ; i++) { output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.0f);

}}

Example 1: Compute Average of 3 numbers

Page 8: Intel Threading Building Blocks TBB

#include "tbb/parallel_for.h"#include "tbb/blocked_range.h"#include "tbb/task_scheduler_init.h"

using namespace tbb;

class Average {public: float* input; float* output; void operator()( const blocked_range<int>& range ) const { for( int i=range.begin(); i!=range.end(); ++i ) output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.0f); }};

// Note: The input must be padded such that input[-1] and input[n] // can be used to calculate the first and last output values.void ParallelAverage( float* output, float* input, size_t n ) { Average avg; avg.input = input; avg.output = output; parallel_for( blocked_range<int>( 0, n, 1000 ), avg );}

Page 9: Intel Threading Building Blocks TBB

//! Problem sizeconst int N = 100000;

int main( int argc, char* argv[] ) {

float output[N]; float raw_input[N+2]; raw_input[0] = 0; raw_input[N+1] = 0; float* padded_input = raw_input+1;

task_scheduler_init ;

............ ............ ParallelAverage(output, padded_input, N);

}

Page 10: Intel Threading Building Blocks TBB

Serial

Parallel

Page 11: Intel Threading Building Blocks TBB
Page 12: Intel Threading Building Blocks TBB

Notes on Grain Size

Page 13: Intel Threading Building Blocks TBB

Effect of Grain Size on A[i]=B[i]*c Computation (one million indices)

Page 14: Intel Threading Building Blocks TBB

Auto Partitioner

Page 15: Intel Threading Building Blocks TBB

Reduction

Serial

Parallel

Page 16: Intel Threading Building Blocks TBB

Class for use by Parallel Reduce

Page 17: Intel Threading Building Blocks TBB

Split-Join Sequence

Page 18: Intel Threading Building Blocks TBB
Page 19: Intel Threading Building Blocks TBB
Page 20: Intel Threading Building Blocks TBB
Page 21: Intel Threading Building Blocks TBB
Page 22: Intel Threading Building Blocks TBB
Page 23: Intel Threading Building Blocks TBB
Page 24: Intel Threading Building Blocks TBB
Page 25: Intel Threading Building Blocks TBB
Page 26: Intel Threading Building Blocks TBB
Page 27: Intel Threading Building Blocks TBB

Parallel_scan (parallel prefix)

Page 28: Intel Threading Building Blocks TBB

Parallel Scan

Page 29: Intel Threading Building Blocks TBB

Parallel Scan with Partitioner

• Parallel_scan, breaks a range into subranges and computes a partial result in each subrange in parallel.

• Then, the partial result for subrange k is used to update the information in subrange k+1, starting from k=0 and proceeding sequentially up to the last subrange.

• Finally, each subrange uses its updated information to compute its final result in parallel with all the other subranges.

Page 30: Intel Threading Building Blocks TBB

Parallel Scan Requirements

Page 31: Intel Threading Building Blocks TBB

Parallel Scan

Page 32: Intel Threading Building Blocks TBB
Page 33: Intel Threading Building Blocks TBB

Parallel While and Pipeline

Page 34: Intel Threading Building Blocks TBB

Linked List Example

• Assume Foo takes at least a few thousand instructions to run, then it is possible to get speedup by parallelizing

Page 35: Intel Threading Building Blocks TBB

Parallel_while

• Requires two user-defined objects1. Object that defines the stream of objects - must have pop_if_present - pop_if_present need not be thread safe - nonscalable, since fetching is serialized - possible to get useful speedup 2. Object that defines the loop body i.e. the operator()

Page 36: Intel Threading Building Blocks TBB

Stream of Objects

Page 37: Intel Threading Building Blocks TBB

Loop Body (Operator)

Page 38: Intel Threading Building Blocks TBB

Parallelized Foo Acting on Linked List

• Note: the body of parallel_while can add more work by calling

w.add(item)

Page 39: Intel Threading Building Blocks TBB

Notes on parallel_while scaling

Page 40: Intel Threading Building Blocks TBB

Pipelining

•Single pipeline

Page 41: Intel Threading Building Blocks TBB

Pipelining

•Parallel pipeline

Page 42: Intel Threading Building Blocks TBB

Pipelining Example

Page 43: Intel Threading Building Blocks TBB

Character Buffer

Page 44: Intel Threading Building Blocks TBB

Top Level Code for for Building and Running the Pipeline

Page 45: Intel Threading Building Blocks TBB
Page 46: Intel Threading Building Blocks TBB
Page 47: Intel Threading Building Blocks TBB
Page 48: Intel Threading Building Blocks TBB

Non-Linear Pipes

Page 49: Intel Threading Building Blocks TBB

Trick soln: Use Topologically Sorted Pipeline

•Note that the latency is increased

Page 50: Intel Threading Building Blocks TBB

parallel_sort

Page 51: Intel Threading Building Blocks TBB

CONTAINERS

Page 52: Intel Threading Building Blocks TBB

concurrent_queue<T>

•The template class concurrent_queue<T> implements a concurrent queue with values of type T.

•Multiple threads may simultaneously push and pop elements from the queue.

•In a single-threaded program, a queue is a first-in first-out structure.

•But if multiple threads are pushing and popping concurrently, the definition of first is uncertain.

•The only guarantee of ordering offered by concurrent_queue is that if a thread pushes multiple values, and another thread pops those same values, they will be popped in the same order that they were pushed.

Page 53: Intel Threading Building Blocks TBB

•Pushing is provided by the push method. There are blocking and nonblocking flavors of pop:

•pop_if_present : This method is nonblocking: it attempts to pop a value, and if it cannot because the queue is empty, it returns anyway.

•pop : This method blocks until it pops a value. If a thread must wait for an item to become available and it has nothing else to do, it should use pop(item) and notwhile(!pop_if_present(item)) continue; because pop uses processor resources more efficiently than the loop.

concurrent_queue<T>

Page 54: Intel Threading Building Blocks TBB

concurrent_queue<T>

•Unlike STL, TBB containers are not templated with an allocator argument. The library retains control over memory allocation.

•Unlike most STL containers, concurrent_queue::size_type is a signed integral type, not unsigned. This is because concurrent_queue::size( ) is defined as the number of push operations started minus the number of pop operations started.

•By default, a concurrent_queue<T> is unbounded. It may hold any number of values until memory runs out. It can be bounded by setting the queue capacity with the set_capacity method. Setting the capacity causes push to block until there is room in the queue.

Page 55: Intel Threading Building Blocks TBB

concurrent_queue Example

Page 56: Intel Threading Building Blocks TBB

concurrent_vector<T>

•A concurrent_vector<T> is a dynamically growable array of items of type T for which it is safe to simultaneously access elements in the vector while growing it.

•However, be careful not to let another task access an element that is under construction or is otherwise being modified.

•A concurrent_vector<T> never moves an element until the array is cleared, which can be an advantage over the STL std::vector (which can move elements to resize the vector), even for single-threaded code.

Page 57: Intel Threading Building Blocks TBB

concurrent_vector<T>

Page 58: Intel Threading Building Blocks TBB

concurrent_vector<T>

Page 59: Intel Threading Building Blocks TBB

concurrent_vector Example

Page 60: Intel Threading Building Blocks TBB

concurrent_hash_map<Key,T,HashCompare>

Page 61: Intel Threading Building Blocks TBB
Page 62: Intel Threading Building Blocks TBB
Page 63: Intel Threading Building Blocks TBB

concurrent_hash_map<Key,T,HashCompare>

•A concurrent_hash_map acts as a container of elements of type std::pair<const Key,T>.

•Typically, when accessing a container element, you are interested in either updating it or reading it.

•The template class concurrent_hash_map supports these two operations with the accessor and const_accessor classes

•An accessor represents update (write) access. As long as it points to an element, all other attempts to look up that key in the table block until the accessor is done.

•const_accessor is similar, except that it represents read-only access. Therefore, multiple const_accessors can point to the same element at the same time.

Page 64: Intel Threading Building Blocks TBB

concurrent_hash_map<Key,T,HashCompare>

• find and insert methods take an accessor or const_accessor as an argument.

•The choice tells concurrent_hash_map whether you are asking for update or read-only access, respectively.

•Once the method returns, the access lasts until the accessor or const_accessor is destroyed.

•Because having access to an element can block other threads, try to shorten the lifetime of the accessor or const_accessor. To do so, declare it in the innermost block possible.

•To release access even sooner than the end of the block, use the release method.

•The method remove(key) can also operate concurrently. It implicitly requests write access. Therefore, before removing the key, it waits on any other accesses on the key.

Page 65: Intel Threading Building Blocks TBB

Use of release method

Page 66: Intel Threading Building Blocks TBB

MUTUAL EXCLUSION in TBB

•In TBB, you program in terms of tasks, not threads, therefore, you will probably think of mutual exclusion of tasks.

•TBB offers two kinds of mutual exclusion:-Mutexes :

These will be familiar to anyone who has used locks in other environments, and they include common variants such as reader-writer locks. - Atomic operations These are based on atomic operations offered by hardware processors, and they provide a solution that is simpler and faster than mutexes in a limited set of situations.

Page 67: Intel Threading Building Blocks TBB

MUTEXES

•In TBB, mutual exclusion is implemented by classes known as mutexes and locks.

•A mutex is an object on which a task can acquire a lock. Only one task at a time can have a lock on a mutex; other tasks have to wait their turn.

Page 68: Intel Threading Building Blocks TBB

MUTEX EXAMPLE

•With the object-oriented interface, destruction of the scoped_lock object causes the lock to be released, no matter whether the protected region was exited by normal control flow or an exception.

Page 69: Intel Threading Building Blocks TBB

MUTEX EXAMPLE

Page 70: Intel Threading Building Blocks TBB

FLAVORS of MUTEXES

•The simplest mutex is the spin_mutex. A task trying to acquire a lock on a busy spin_mutex waits until it can acquire the lock. A spin_mutex is appropriate when the lock is held for only a few instructions.

Page 71: Intel Threading Building Blocks TBB

•Some mutexes are called scalable. In a strict sense, this is not an accurate name because a mutex limits execution to one task at a time and is therefore necessarily a drag on scalability.

•A scalable mutex is rather one that does no worse than forcing single- threaded performance.

•A mutex actually can do worse than serialize execution if the waiting tasks consume excessive processor cycles and memory bandwidth, reducing the speed of tasks trying to do real work.

•Scalable mutexes are often slower than nonscalable mutexes under light contention, so a nonscalable mutex may be better. When in doubt, use a scalable mutex.

SCALABLE MUTEX

Page 72: Intel Threading Building Blocks TBB

•Mutexes can be fair or unfair.

•A fair mutex lets tasks through in the order they arrive.

•Fair mutexes avoid starving tasks. Each task gets its turn.

•However, unfair mutexes can be faster because they let tasks that are running go through first, instead of the task that is next in line, which may be sleeping because of an interrupt.

FAIR MUTEX

Page 73: Intel Threading Building Blocks TBB

•Mutexes can be reentrant or nonreentrant.

•A reentrant mutex allows a task that is already holding a lock on the mutex to acquire another lock on it.

•This is useful in some recursive algorithms, but it typically adds overhead to the lock implementation.

REENTRANT MUTEX

Page 74: Intel Threading Building Blocks TBB

SLEEP or SPIN MUTEX

•Mutexes can cause a task to spin in user space or sleep while it is waiting.

•For short waits, spinning in user space is fastest because putting a task to sleep takes cycles.

•For long waits, sleeping is better because it causes the task to give up its processor to some task that needs it. Spinning is also undesirable in processors with multiple-task support in a single core, such as Intel processors with hyperthreading technology.

Page 75: Intel Threading Building Blocks TBB

MUTEX TYPES

•A spin_mutex is nonscalable, unfair, nonreentrant, and spins in user space. It would seem to be the worst of all possible worlds, except that it is very fast in lightly contended situations. If you can design your program so that contention is somehow spread out among many spin mutexes, you can improve performance over other kinds of mutexes. If a mutex is heavily contended, your algorithm will not scale anyway. Consider redesigning the algorithm instead of looking for a more efficient lock.

•A queuing_mutex is scalable, fair, nonreentrant, and spins in user space. Use it when scalability and fairness are important.

• A spin_rw_mutex and a queuing_rw_mutex are similar to spin_mutex and queuing_mutex, but they additionally support reader locks.

•A mutex is a wrapper around the system’s native mutual exclusion mechanism. On Windows systems, it is implemented on top of a CRITICAL_SECTION. On Linux systems, it is implemented on top of a pthread mutex.

Page 76: Intel Threading Building Blocks TBB

•Requests for a reader lock are distinguished from requests for a writer lock via an extra Boolean parameter in the constructor for scoped_lock. The parameter is false to request a reader lock and true to request a writer lock. It defaults to true when it is omitted

•It is also possible to upgrade a reader lock to a writer lock by using the method upgrade_to_writer.

Reader/Writer, Upgrade/Downgrade

Page 77: Intel Threading Building Blocks TBB

Reader Writer Mutex Example

Note: upgrade_to_writer returns true if the upgrade happened withoutre-acquiring the lock and false if opposite

Page 78: Intel Threading Building Blocks TBB

ATOMIC OPERATIONS

•Atomic operations are a fast and relatively easy alternative to mutexes.

•They do not suffer from the deadlock and convoying problems

•The main limitation of atomic operations is that they are limited in current computer systems to fairly small data sizes: the largest is usually the size of the largest scalar, often a double-precision floating-point number.

•Atomic operations are also limited to a small set of operations supported by the underlying hardware processor.

•The class atomic<T> implements atomic operations with C++ style.

Page 79: Intel Threading Building Blocks TBB

Fundamental Operations on an atomic variable

Page 80: Intel Threading Building Blocks TBB

SCALABLE MEMORY ALLOCATION

•The scalable memory allocator is cleanly separate from the rest of TBB so that your choice of memory allocator for concurrent usage is independent of your choice of parallel algorithm and container templates.

•When ordinary, nonthreaded allocators are used, memory allocation becomes a serious bottleneck in a multithreaded program because each thread competes for a global lock for each allocation and deallocation of memory from a single global heap. •TBB scalable allocator is built for scalability and speed. In some situations, this comes at a cost of wasted virtual space. Specifically, it wastes a lot of space when allocating blocks in the 9K to 12K range. It is also not yet terribly sophisticated about paging issues.

Page 81: Intel Threading Building Blocks TBB

FALSE SHARING

•False sharing occurs when multiple threads use memory locations that are close together, even if they are not actually using the same memory locations.

•Because processor cores fetch and hold memory in chunks called cache lines, any memory accesses within the same cache line should be done only by the same thread.

•Otherwise, accesses to memory on the same cache line will cause unnecessary contention and swapping of cache lines back and forth, resulting in slowdowns which can easily be a hundred times worse for the affected memory accesses.

•Example: float A[1000] float B[1000]

Page 82: Intel Threading Building Blocks TBB

Memory Allocators

• TBB scalable memory allocator utilizes a memory management algorithm divided on a per-thread basis to minimize contention associated with allocation from a single global heap.

•TBB offers two choices, both similar to the STL template class, std::allocator: - scalable_allocator This template offers just scalability, but it does not completely protect against false sharing. Memory is returned to each thread from a separate pool, which helps protect against false sharing if the memory is not shared with other threads.

-cache_aligned_allocator This template offers both scalability and protection against false sharing. It addresses false sharing by making sure each allocation is done on a cache line.

•Note that protection against false sharing between two objects is guaranteed only if both are allocated with cache_aligned_allocator.

Page 83: Intel Threading Building Blocks TBB

Memory Allocators

•The functionality of cache_aligned_allocator comes at some cost in space because it allocates in multiples of cache-line-size memory chunks, even for a small object.

• The padding is typically 128 bytes. Hence, allocating many small objects with cache_ aligned_allocator may increase memory usage.

Page 84: Intel Threading Building Blocks TBB

Library to Link

•Both the debug and release versions for Threading Building Blocks are divided into two dynamic shared libraries, one with general support and the other with scalable memory allocator.

•The latter is distinguished by malloc in its name (although it does not define a routine actually called malloc).

•For example, the release versions for Windows are tbb.dll and tbbmalloc.dll, respectively.

Page 85: Intel Threading Building Blocks TBB

Using the Allocator Argument to C++ STL Template Classes

•The following code shows how to declare an STL vector that uses cache_aligned_allocator for allocation:

std::vector< int, cache_aligned_allocator<int> >;

Page 86: Intel Threading Building Blocks TBB

MALLOC/FREE/REALLOC/CALLOC