optimizing sharing patterns and locality via thread migration
DESCRIPTION
Optimizing Sharing Patterns and Locality via Thread Migration. Vadim Gleizer Supervisor: Prof. Assaf Schuster. Contributions of this research. Internal Distributed Shared Memory (DSM) Mechanisms Thread Migration (TM) in DSM Systems Load Balancing in DSM Systems. Internal DSM Mechanisms. - PowerPoint PPT PresentationTRANSCRIPT
Optimizing Sharing Patterns and Locality via
Thread Migration
Vadim GleizerSupervisor: Prof. Assaf Schuster
2
Contributions of this research Internal Distributed Shared Memory
(DSM) Mechanisms Thread Migration (TM) in DSM
Systems Load Balancing in DSM Systems
3
Internal DSM Mechanisms An internal DSM mechanism or a DSM
handler is responsible to guarantee the consistent memory view on each workstation as follows: When a DSM region becomes invalid it is
protected Each access to the protected area will cause
an exception The internal DSM mechanism catches and
handles these exceptions
4
Implementation of DSM Handlers An exception handling service which is provided
by an operating system significantly simplifies this task
Let us see the Win32 Structured Exception Handling (SEH) service of Windows NT: A block of code that allowed to use DSM is wrapped by
an exception block using Win32 __try/__except keywords similarly to try/catch blocks in C++:
__try{user_main();
}__except(DSM_handler());
Let us see how such services work and the drawbacks of using them
5
Inside the SEH Service For each type of exceptions CPU generates a code, e.g.
division by zero has code 0; page fault has code E; a GPF (General Protection Fault) exception has code D:
In the case of a page fault exception a _KiTrap0E is called
6
Inside the SEH Service (cont.)
The following sequence of calls occurs before the control is passed to the DSM_handler: _KiTrap0E KiUserExceptionDispatcher RtlDispatchException RtlpExecuteHandlerForException ExecuteHandler __except_handler3 DSM_handler
7
Drawbacks of using SEH in DSM Systems1. Performance
The SEH service is highly time-consuming while most of its functionality is unnecessary for the DSM handler
User’s exception handlers are called before the DSM handler
2. The programmer may accidentally intercept a DSM exception
The internal DSM handler should work transparently to the programmer
Thus, if the programmer does not know that the DSM handler uses SEH – he/she may accidentally intercept a DSM exception
8
User-Mode First-Chance Exception Handling UMFC-EH:
Only kernel level part of SEH is used, i.e. the DSM_handler is called directly by _KiTrap0E
Thus, exceptions intercepted before any of the SEH user-mode functions is called:
• _KiTrap0E, DSM_handler• Instead of _KiTrap0E, KiUserExceptionDispatcher,
RtlDispatchException, RtlpExecuteHandlerForException, ExecuteHandler, __except_handler3, DSM_handler
To implement this scheme the detours library may be used
9
UMFC-EH (cont.)
Advantages: Solves both drawbacks of the SEH
service No __try/__except blocks are needed
Drawbacks: The kernel level part of SEH still is used All exceptions are intercepted, e.g.,
division by zero
10
Kernel-Mode First-Chance Exception Handling KMFC-EH:
Exceptions intercepted at kernel-mode by a special supervisor-level device driver, we call it DSM_filter
The DSM_filter informs the DSM_handler about DSM exceptions
Thus, the SEH service is not used
11
KMFC-EH (cont.)
Advantages: preserves all the advantages of the UMFC-EH
scheme SEH is not used, i.e., the CPU directly informs
the DSM_filter about page fault exceptions only page fault exceptions are intercepted
Drawbacks: all page fault exceptions are intercepted by
the DSM_filter, including those of other processes
• fortunately the overhead of this drawback is low
12
Performance Evaluation Our experimental environment consists of
the Millipede 4.0 DSM system: cluster of eight uniprocessor workstations
interconnected by a switched Myrinet LAN Each workstation equipped with:
• Pentium-II 300MHz• 128MB of RAM• 512KB of L2 cache• Windows NT 4.0 SP6 operating system
We have tested our DSM handlers on several commonly used for DSM benchmarks and microbenchmarks
13
Microbenchmarks (100000 page faults):
SEH 14 sec 100%
UMFC-EH 8 sec 57%
KMFC-EH 3 sec 21%
Performance Evaluation (cont.)
Related results (Brazos):
SEH 20 sec 200 MHz Pentium Pro with 192 MB of RAM
Segv Handler
47 sec Solaris 2.5.1 running on the same hardware
14
Performance Evaluation (cont.)
15
Performance Evaluation (cont.)
16
Thread Migration (TM) in DSM SystemsIntroduction: A thread is stopped at almost every moment
of its execution and launched on another machine from the same point where it was stopped
Applications of this facility: load balancing communication reduction fault tolerance cluster management powerful programming primitive
17
Designing a TM Mechanism Restrictions on TM – there are some
situations in which the migration makes no sense: the thread owns some local operating system
resources, e.g. a synchronization object the thread executes a local dependent
operation, e.g. prints a message Therefore the programmer should be
aware of thread migration and explicitly mark situations when a thread cannot migrate
18
Designing a TM Mechanism (cont.)
A state of a thread consists of: code global data heap data stack data processor’s register set other thread specific data
19
Host 1 Host 2
10045
10045
20002000
20042004
A
10001000
10041004
A
Designing a TM Mechanism (cont.)
10045
20
Designing a TM Mechanism (cont.)
Stack address translation Drawbacks:
• register values and stack values have to be investigated and probably updated (very inefficient for large stacks)
• identification of pointers (correctness, a value may resemble a pointer), possible solutions:
• special compiler or hardware support – more complex compiled code, often prevents compiler optimizations
• special programming primitives that register all pointers – harms efficiency and simplicity of programming, limit free usage of pointers
• the whole stack has to be copied at migration time
21
Designing a TM Mechanism (cont.)
Creating all mobile threads at DSM initialization time Advantages:
• no pointer investigation and modification Drawbacks:
• lack of scalability – the maximum number of threads are created on each host
• lack of portability – may not work in future versions of the same operating system and cannot be used for heterogeneous systems
• the whole stack has to be copied at migration
22
Designing a TM Mechanism (cont.)
Placement of stacks in a predefined memory region Advantages:
• no pointer investigation and modification• scalability – threads are created on
application demand or at migration time• portability
Drawbacks:• the whole stack has to be copied at
migration
23
Designing a TM Mechanism (cont.)
Placement of stacks in a DSM region Advantages:
• preserves all the advantages of the previous approach
• the stack has not to be copied at migration
24
Implementation of TM Placement of stacks in a predefined memory
region or the default stack approach the same address region is reserved at
initialization time of DSM on each host at creation each thread receives a slot for the
stack according to its id UNIX-like operating systems provide inside their
thread creation API an option to control stack location
this approach is difficult to implement in Windows NT since there is no any conventional way to control stack location
25
Implementation of TM (cont.)
Stack location control in Windows NT an application asks the DSM system to create a thread the thread is created in suspended state (the initial stack
is empty) the address of initial stack is obtained through its ESP
register and freed the value of the ESP register is changed to a new stack
location a pointer to the Win32 data structure – Thread
Information Block (TIB) – is obtained through the FS register
two fields inside the TIB are modified accordingly: pvStackUserTop and pvStackUserBase
the thread is resumed
26
Implementation of TM (cont.)
Placement of stacks in a DSM region a separate region is added to DSM a stack location of a thread is changed
to be a slot inside the new DSM region similarly to the previous approach
however the stack cannot be handled as a regular DSM region
27
Implementation of TM (cont.)
Why a thread’s stack cannot be handled as a regular DSM region? Let us see an example:
thread A migrates from host 1 to host 2 the stack of thread A remains on host 1 since it is placed
on DSM; therefore the first access to the stack will cause a page fault exception
DSM_handler should be called in order to bring the missing part of the stack
however the stack is protected and DSM_handler cannot be called in a regular way ...
host 1 host 2
thread A migrates
28
Implementation of TM (cont.)
The auxiliary stack approach: this approach is based on the KMFC-EH technique a memory region is allocated at initialization time of
DSM on each host, called the auxiliary stacks region page fault exceptions are intercepted by DSM_filter
(driver) at kernel-level when an exception has occurred on a stack DSM_filter
changes the stack location of the thread to be a slot inside the auxiliary stacks region and calls DSM_handler
DSM_handler brings the page for the original stack, sets appropriate protection, switches the stack back and transfers control to the thread
29
TM in the Millipede 4.0 DSM System In sum, our TM mechanism has the following
powerful features: two TM approaches kernel-level threads being migrated SEH support the FastMessages service is used to efficiently
transfer of migrating threads thread suspension and resumption are location
independent and may be recursive supporting safety of all API functions provided by
Millipede 4.0 statistics tool
30
Performance Evaluation
Cost of Communication in Myrinet
40.1 44
69
92
112
126.5
0
20
40
60
80
100
120
140
0 512 1024 1536 2048 2560 3072 3584 4096 4608
bytes
µse
cs
31
Performance Evaluation (cont.)
Win32 Function Cost
GetThreadContext 8.86 sec
SetThreadContext 9.57 sec
SuspendThread 5.42 sec
ResumeThread 6.35 sec
• The cost of Win32 calls used in TM:
• Performance of TM in Millipede 4.0:
Win32 Calls 30.2 sec
Network/copy 149.9 sec
Total TM Time 180.1 sec
Averaging over 1,000,000 instances of each call
Averaging over 1,000,000 of TMs with stack size of 176B
32
Performance Evaluation (cont.)
System 1K 2K 4K Hardware Characteristics
Ariadne 1100 1400 SPARC, Ethernet
PM2 210 450MHz Pentium-II
ActiveThreads 630 1100 4*50MHz HyperSPARC
CVM 15971704B
66.7MHz Power2, 128MB of RAM, 64KB of cache, 40 MB switch
Brazos 1010 4*200MHz Pentium Pro, 256MB of RAM, 256KB of cache, Gigabit Ethernet
Millipede 70000 100-MBs Ethernet
Millipede 4.0 202 219 256 Pentium-II 300MHz, 128MB of RAM, 512KB of cache
• Migration Time on Various Systems as function of stack size (sec):
33
Load Balancing (LB) in DSM SystemsIntroduction: Definition of load in DSM systems:
the CPU time that a computational thread consumes
the amount of communication that the thread causes during its work
Dynamic load sharing computes a less precise location scheme of
threads, but due to the relaxed requirements can often be as efficient as dynamic load balancing
34
11
77
4466
55
33
1414
221010
1212
88
99
1313
1111
1515
11
77
4466
55
33
1414
221010
1212
88
99
1313
1111
1515
11
77
4466
55
33
1414
221010
1212
88
99
1313
1111
1515
Introduction (cont.)
35
Designing an LS Mechanism The Goals of Load Sharing
A uniform distribution of threads among the stations
Minimization of communication overheads• Improving the locality of accesses• Avoiding page ping-pongs situations, in
which a page is transferred frequently among several hosts
36
Designing an LS Mechanism (cont.)
We propose a load sharing mechanism that works as a separate module, called the Load Sharing Module (LS-Module).
The LS-Module performs the following tasks: load imbalance detection load imbalance treatment ping-pong detection ping-pong treatment
37
Designing an LS Mechanism (cont.)
Load Imbalance Detection protocol has a centralized entity called the Load Sharing Server (LS-Server) that knows the power parameter of each host notified by an external module on each change
in the load for each change in the load calculates two
threshold values l and h of a host, in this way determining whether the host is normally loaded
begins load imbalance treatment protocol when load imbalance is detected
38
Designing an LS Mechanism (cont.)
Load Imbalance Treatment protocol is performed by the LS-Server which decides how many threads, say n, should be migrated from an overloaded host, say H1
to balance its load An entity called Load Sharing Client (LS-
Client) that runs on each host is responsible for selecting n threads whose migration will best minimize future communication
39
Designing an LS Mechanism (cont.)
Ping-Pong Detection protocol is performed by the Ping-Pong Client (PP- Client) entity
Each time there is an access to a remote page the PP-Client (one per host) is invoked
A ping-pong situation exists when the following two conditions are met:
1. local threads attempt to access a page a short time after it leaves the host
2. a page leaves the host a short time after it has arrived
40
Designing an LS Mechanism (cont.)
Ping-Pong Treatment protocol is performed by a centralized Ping-Pong Server (PP-Server) entity
The PP-Server determines which group of threads is participate in a ping-pong, then it chooses a destination host and migrates the threads to this host
If too many threads participate in a ping-pong or a ping-pong is detected a short time after it has been resolved, the PP-Server decides to treat the ping-pong using delays
41
LS in the Millipede 4.0 DSM System We have implemented the load sharing
mechanism in the Millipede 4.0 DSM system
Millipede 4.0 architecture The Thread-Server module The TM module The LS module:
• one centralized LS-Server• LS-Clients (one per host)• PP-Clients (one per host)
42
LS in the Millipede 4.0 DSM System (cont.)
Access History In order to select the threads for migration, for
each thread we keep an access history The access history contains at most one entry
for each page that was referenced by the local threads in last Tepoch time units
Obviously the access history should be updated as time passes
The access history keeps also an old history or prehistory
• summarizes the old access history of a thread
43
LS in the Millipede 4.0 DSM System (cont.)
Access History Structure
Page 0x0DCC
Page 0xACDC
. . .
Thread 0
Thread 7
. . .Prehistory
0:12:00 0:12:01 0:12:13
44
LS in the Millipede 4.0 DSM System (cont.)
Thread Selection Algorithm A heuristic value h(j) is calculated for each
thread j on the local host L. It takes into account the following characteristic:
• Maximal frequency of remote references to pages on R• Minimal access frequency of the threads remaining in L to
the pages used by the selected threads • Minimal access frequency to local pages• Maximal frequency of any remote references
Until enough threads are selected, the following procedure is performed:
• The thread j having the maximal value h(j) is chosen• The heuristic value of each thread i that has not yet been
selected is revised, taking into account migration of j
45
LS in the Millipede 4.0 DSM System (cont.)
Ping-Pong Detection
send page P to Hi
access page P (bring it from Hj)
receive page P from Hj
send page P to Hk
Tunused Twaiting Tuseful
Page ping-pong condition is:
(S is called the sensitivity of the ping-pong)
PPRatio =Tunused + Tuseful
Twaiting
< S
46
LS in the Millipede 4.0 DSM System (cont.)
Dynamic calculation of for page P The value of depends on the number
of threads that are using the page and on their behavior
P = · Nthpp
f (Nth)
is a constant;
Nthpp the number of threads involved in the ping-pong residing on the local host
Nth the total number of threads residing on the local host
f (Nth) is the function of that number
47
Performance Evaluation We have tested the LS module on
several benchmarks that are common in DSM systems, as well as on synthetic microbenchmarks specially designed for this purpose
We refer to the version of Millipede 4.0 with LS module as the LS version and to the version without the LS module as no LS version
48
Microbenchmark applications were designed to simulate various load imbalance situations
Using microbenchmark applications we have measured the individual performance of each part of the load sharing protocol: load imbalance treatment ping-pong treatment:
• locality optimization part• stabilization part
Performance Evaluation (cont.)
49
Locality optimization protocol
Performance Evaluation (cont.)
50
51
Stabilization protocol
Performance Evaluation (cont.)
52
Performance Evaluation (cont.) Stabilization protocol
53
Performance Evaluation (cont.) Stabilization protocol
54
Performance Evaluation (cont.) Stabilization protocol
55
Performance Evaluation (cont.) Stabilization protocol
56
Performance Evaluation (cont.)
Load imbalance protocol
57
Overhead for benchmark applications
Performance Evaluation (cont.)
58
Conclusions In this thesis we have researched and contributed
to three different aspects of DSM systems.1. Internal DSM handling mechanisms
We have researched how these mechanisms work We studied in depth the functionality of the SEH service We detail two major drawbacks of using SEH in DSM
systems We present two new techniques for internal DSM
mechanisms that will make these mechanisms more efficient and reliable
We analyzed the performance implication of using these techniques in the Millipede 4.0 DSM system
59
Conclusions (cont.)
2. Thread migration facilities in DSM Systems We observe how the correct application of this facility will
significantly increase the efficiency and reliability of the underlying DSM system
We present our design of a TM facility We discuss some correctness problems that a developer of a
TM facility has to consider We investigated different approaches for implementation of
this facility in DSM systems We have developed two new approaches: the stack on DSM
approach and the default stack approach We present two new techniques to overcome several technical
difficulties in implementing these approaches We implemented our TM facility in the Millipede 4.0 DSM
system, analyzed its performance, and compared it with other systems
60
Conclusions (cont.)
3. Load Balancing in DSM Systems We described several problems that occur as a result of poor
distribution of application threads We observed well-known strategies for load balancing in
distributed systems We present a design of a load sharing mechanism in DSM
systems This mechanism efficiently obtains thread communication
patterns It tries to avoid load imbalance by optimizing sharing patterns
and locality via thread migration We implemented the load sharing mechanism as a separate
module in the Millipede 4.0 DSM system Finally, we analyzed the performance implication of this
mechanism on several benchmark and microbenchmark applications