threads and multi threading
TRANSCRIPT
Threads
Cesarano AntonioDel Monte Bonaventura
Università degli studi di Salerno
7th April 2014
Operating Systems II
Agenda
Introduction Threads models Multithreading: single-core Vs
multicore Implementation A Case Study Conclusions
CPU Trends
IntroductionWhat’s a Thread?
Memory: Heavy vs Light processes
Introduction
Why should I care about Threads?
Pro• Responsiveness• Resources
sharing• Economy• Scalability
Cons• Hard implementation• Synchronization• Critical section,
deadlock, livelock…
Introduction
Thread Models
Two kinds of Threads
User Threads
Kernel Threads
Thread ModelsUser-level Threads
Implemented in software library Pthread Win32 API
Pro:• Easy handling• Fast context switch• Trasparent to OS• No new address space, no need to change address space
Cons:• Do not benefit from multithreading or multiprocessing• Thread blocked
Process blocked
Thread Models Kernel-level
Threads Executed only in kernel mode, managed by OS Kthreadd children
Pro:• Resource Aware• No need to use a new address space• Thread blocked
Scheduled
Con:• Slower then User-threads
Thread Models
Thread implementation models:From many to oneFrom one to oneFrom many to many
Thread ModelsFrom many to one
Whole process is blocked if one thread is blocked Useless on multicore architectures
Thread ModelsFrom one to one
Works fine on multicore architectureso Many kernel threads = High overhead
Thread ModelsFrom many to many
Works fine on multicore architectures Less overhead then “one to one” model
MultithreadingMultitasking
Single core Symmetric Multi-Processor
MultiThreading
Multithreading
Multithreading
HyperThreading
Multithreading
How can We use multithreading architectures?
Thread Level
Parallelism
Data Level
Parallelism
Multithreading
Thread Level ParallelismMultithreading
Data Level ParallelismMultithreading
Granularity Coarse-grained:
Multithreading
Context switch on high latency event
Very fast thread-switching, no threads slow
down
Loss of throughput due to short stalls:
pipeline start-up
Granularity Fine-grained
Multithreading
Context switch on every cycle Interleaved execution of multiple threads: it can hide both short and long stalls
Rarely-stalling threads are slowed down
GranularityMultithreading
Context SwitchingSingle-core Vs Multi-core
Xthread_ctxtswitch: pusha movl esp, [eax] movl edx, esp popa ret
CPUESP
Thread 1regs
Thread 2
registers
Thread 1 TCB
SP: ....
Thread 2 TCB
SP: ....
Running Ready
Pushing old contextSingle-core Vs Multi-core
Xthread_ctxtswitch: pusha movl esp, [eax] movl edx, esp popa ret
CPUESP
Thread 1regs
Thread 2
registers
Thread 1 TCB
SP: ....
Thread 2 TCB
SP: ....
Thread 1
registers
Running Ready
Saving old stack pointerSingle-core Vs Multi-core
Xthread_ctxtswitch: pusha movl esp, [eax] movl edx, esp popa ret
CPUESP
Thread 1regs
Thread 2
registers
Thread 1 TCB
SP: ....
Thread 2 TCB
SP: ....
Thread 1
registers
Running Ready
Changing stack pointerSingle-core Vs Multi-core
Xthread_ctxtswitch: pusha movl esp, [eax] movl edx, esp popa ret
CPUESP
Thread 1regs
Thread 2
registers
Thread 1 TCB
SP: ....
Thread 2 TCB
SP: ....
Thread 1
registers
Ready Running
Popping off thread #2 old context
Single-core Vs Multi-core
Xthread_ctxtswitch: pusha movl esp, [eax] movl edx, esp popa ret
CPUESP
Thread 2 regs
Thread 1 TCB
SP: ....
Thread 2 TCB
SP: ....
Thread 1
registers
Ready Running
Done: returnSingle-core Vs Multi-core
Xthread_ctxtswitch: pusha movl esp, [eax] movl edx, esp popa ret
CPUESP
Thread 2 regs
Thread 1 TCB
SP: ....
Thread 2 TCB
SP: ....
Thread 1
registers
Ready Running
RET pops of the returning address and it assigns its value to PC reg
Problems
Critical Section:When a thread A tries to access to a shared variable simultaneously to a thread B
Deadlock:When a process A is waiting for resource reserved to B, which is waiting for resource reserved to A
Race Condition: The result of an execution depens on the order of execution of different threads
More Issues
fork() and exec() system calls: to duplicate or to not deplicate all threads?
Signal handling in multithreading application.
Scheduler activation: kernel threads have to communicate with user thread, i.e.: upcalls
Thread cancellation: termination a thread before it has completed.
• Deferred cancellation
• Asynchronous cancellation: immediate
Designing a thread library
Multiprocessor support
Virtual processor
RealTime support
Memory Management
Provide functions library rather than a module
Portability
No Kernel mode
Implementation
Posix Thread Posix standard for threads: IEEE POSIX
1003.1c Library made up of a set of types and
procedure calls written in C, for UNIX platform
It supports:a) Thread management b) Mutexesc) Condition Variablesd) Synchronization between threads
using R/W locks and barries
Implementation
Thread Pool Different threads available in a pool
When a task arrives, it gets assigned to a free thread
Once a thread completes its service, it returns in the pool and awaits another work.
ImplementationPThred Lib base operations
pthread_create()- create and launch a new thread
pthread_exit()- destroy a running thread
pthread_attr_init()- set thread attributes to their default values
pthread_join()- the caller thread blocks and waits for another thread to finish
pthread_self()- it retrieves the id assigned to the calling thread
Implementation Example
N x N Matrix Multiplication
Implementation Example
A simple algorithmfor (int i = 0; i < MATRIX_ELEMENTS; i += MATRIX_LINE){ for (int j = 0; j < MATRIX_LINE; ++j) {
float tmp = 0;for (int k = 0; k < MATRIX_LINE; k++){
tmp += A[i + k] * B[(MATRIX_LINE * k) + j];
}C[i + j] = tmp;
}}
Implementation Example
SIMD Approachtranspose(B);for (int i = 0; i < MATRIX_LINE; i++) { for (int j = 0; j < MATRIX_LINE; j++){ __m128 tmp = _mm_setzero_ps(); for (int k = 0; k < MATRIX_LINE; k += 4){ tmp = _mm_add_ps(tmp, _mm_mul_ps(_mm_load_ps(&A[MATRIX_LINE * i + k]), _mm_load_ps(&B[MATRIX_LINE * j + k]))); } tmp = _mm_hadd_ps(tmp, tmp); tmp = _mm_hadd_ps(tmp, tmp); _mm_store_ss(&C[MATRIX_LINE * i + j], tmp); }}transpose(B);
Implementation Example
TLP Approachstruct thread_params {
pthread_t id;
float* a;
float* b;
float* c;int low;int high;
bool flag;
};………
int main(int argc, char** argv){ int ncores=sysconf(_SC_NPROCESSORS_ONLN); int stride = MATRIX_LINE / ncores; for (int j = 0; j < ncores; ++j){
pthread_attr_t attr; pthread_attr_init(&attr); thread_params* par = new thread_params; par->low=j*stride; par->high=j*stride+stride; par->a = A; par->b = B; par->c = C; pthread_create(&(par->id), &attr, runner, par); // set cpu affinity for thread // sched_setaffinity
}}
Implementation Example
TLP Approachint main(int argc, char** argv){….int completed = 0;while (true) { if (completed >= ncores) break; completed = 0; usleep(100000); for (int j=0; j<ncores; ++j){ if (p[j]->flag) completed++;
}}….}
void runner(void* p){thread_params* params = (thread_params*) p;int low = params->low; // unpack others valuesfor (int i = low; i < high; i++) {
for (int j = 0; j < MATRIX_LINE; j++){
float tmp = 0;
for (int k = 0; k < MATRIX_LINE; k++){ tmp +=
A[MATRIX_LINE * i + k] * B[(MATRIX_LINE * k) + j]; } C[i + j] = tmp; }}params->flag = true;pthread_exit(0);}
Implementation Performance
Simple SIMD TLP SIMD&TLP0
1000
2000
3000
4000
5000
6000
7000
8000
9000
8 cores4 cores
A case study
Using threads in Interactive Systems
• Research by XEROX PARC Palo Alto
• Analysis of two large interactive system: Cedar and GVX
• Goals: i. Identifing paradigms of thread usageii. architecture analysis of thread-based
environmentiii. pointing out the most important properties of
an interactive system
A case studyThread model
Mesa language
Multiple, lightweight, pre-emptively scheduled threads in shared address space, threads may have different priorities
FORK, JOIN, DETACH
Support to conditional variables and monitors: critical sections and mutexes
Finer grain for locks: directly on data structures
A case study
Three types of thread
1. Eternal: run forever, waiting for cond. var.
2. Worker: perform some computation
3. Transient: short life threads, forked off by long-lived threads
A case study
Dynamic analysis
Cedar GVX0
5
10
15
20
25
30
35
40
45
# threads idle
Fork rate max
# threads max
Switching intervals: (130/sec, 270/sec) vs. (33/sec, 60/sec)
A case study
Paradigms of thread usage Defer Work: forking for reducing latency
print documents
Pumps or slack processes: components of pipeline Preprocessing user input Request to X server
Sleepers and one-shots: wait for some event and then execute Blink cursor Double click
Deadlock avoiders: avoid violating lock order constraint Windows repainting
A case study
Paradigms of thread usage Task rejuvenation: recover a service from a bad
state, either forking a new thread or reporting the erroro Avoid fork overhead in input event dispatcher
of Cedar
Serializers: thread processing a queueo A window system with input events from many
sources
Concurrency exploiters: for using multiple processors
Encapsulated forks: a mix of previous paradigms, code modularity
A case study
Common Mistakes and Issueso Timeout hacks for compensate missing
NOTIFY
o IF instead of WHILE for monitors
o Handling resources consumption
o Slack processes may need hack YieldButNotToMe
o Using single-thread designed libraries in multi-threading environment: Xlib and XI
o Spurious lock
A case study
Xerox scientists’ conclusions
Interesting difficulties were discovered both in use and implementation of multi-threading environment
Starting point for new studies
Conclusion