threads and multi threading

Threads

Cesarano AntonioDel Monte Bonaventura

Università degli studi di Salerno

7th April 2014

Operating Systems II

Agenda

Introduction Threads models Multithreading: single-core Vs

multicore Implementation A Case Study Conclusions

CPU Trends

IntroductionWhat’s a Thread?

Memory: Heavy vs Light processes

Introduction

Why should I care about Threads?

Pro• Responsiveness• Resources

sharing• Economy• Scalability

Cons• Hard implementation• Synchronization• Critical section,

deadlock, livelock…

Introduction

Thread Models

Two kinds of Threads

User Threads

Kernel Threads

Thread ModelsUser-level Threads

Implemented in software library Pthread Win32 API

Pro:• Easy handling• Fast context switch• Trasparent to OS• No new address space, no need to change address space

Cons:• Do not benefit from multithreading or multiprocessing• Thread blocked

Process blocked

Thread Models Kernel-level

Threads Executed only in kernel mode, managed by OS Kthreadd children

Pro:• Resource Aware• No need to use a new address space• Thread blocked

Scheduled

Con:• Slower then User-threads

Thread Models

Thread implementation models:From many to oneFrom one to oneFrom many to many

Thread ModelsFrom many to one

Whole process is blocked if one thread is blocked Useless on multicore architectures

Thread ModelsFrom one to one

Works fine on multicore architectureso Many kernel threads = High overhead

Thread ModelsFrom many to many

Works fine on multicore architectures Less overhead then “one to one” model

MultithreadingMultitasking

Single core Symmetric Multi-Processor

MultiThreading

Multithreading

Multithreading

HyperThreading

Multithreading

How can We use multithreading architectures?

Thread Level

Parallelism

Data Level

Parallelism

Multithreading

Thread Level ParallelismMultithreading

Data Level ParallelismMultithreading

Granularity Coarse-grained:

Multithreading

Context switch on high latency event

Very fast thread-switching, no threads slow

down

Loss of throughput due to short stalls:

pipeline start-up

Granularity Fine-grained

Multithreading

Context switch on every cycle Interleaved execution of multiple threads: it can hide both short and long stalls

Rarely-stalling threads are slowed down

GranularityMultithreading

Context SwitchingSingle-core Vs Multi-core

Xthread_ctxtswitch: pusha movl esp, [eax] movl edx, esp popa ret

CPUESP

Thread 1regs

Thread 2

registers

Thread 1 TCB

SP: ....

Thread 2 TCB

SP: ....

Running Ready

Pushing old contextSingle-core Vs Multi-core


CPUESP

Thread 1regs

Thread 2

registers

Thread 1 TCB

SP: ....

Thread 2 TCB

SP: ....

Thread 1

registers

Running Ready

Saving old stack pointerSingle-core Vs Multi-core


CPUESP

Thread 1regs

Thread 2

registers

Thread 1 TCB

SP: ....

Thread 2 TCB

SP: ....

Thread 1

registers

Running Ready

Changing stack pointerSingle-core Vs Multi-core


CPUESP

Thread 1regs

Thread 2

registers

Thread 1 TCB

SP: ....

Thread 2 TCB

SP: ....

Thread 1

registers

Ready Running

Popping off thread #2 old context

Single-core Vs Multi-core


CPUESP

Thread 2 regs

Thread 1 TCB

SP: ....

Thread 2 TCB

SP: ....

Thread 1

registers

Ready Running

Done: returnSingle-core Vs Multi-core


CPUESP

Thread 2 regs

Thread 1 TCB

SP: ....

Thread 2 TCB

SP: ....

Thread 1

registers

Ready Running

RET pops of the returning address and it assigns its value to PC reg

Problems

Critical Section:When a thread A tries to access to a shared variable simultaneously to a thread B

Deadlock:When a process A is waiting for resource reserved to B, which is waiting for resource reserved to A

Race Condition: The result of an execution depens on the order of execution of different threads

More Issues

fork() and exec() system calls: to duplicate or to not deplicate all threads?

Signal handling in multithreading application.

Scheduler activation: kernel threads have to communicate with user thread, i.e.: upcalls

Thread cancellation: termination a thread before it has completed.

• Deferred cancellation

• Asynchronous cancellation: immediate

Designing a thread library

Multiprocessor support

Virtual processor

RealTime support

Memory Management

Provide functions library rather than a module

Portability

No Kernel mode

Implementation

Posix Thread Posix standard for threads: IEEE POSIX

1003.1c Library made up of a set of types and

procedure calls written in C, for UNIX platform

It supports:a) Thread management b) Mutexesc) Condition Variablesd) Synchronization between threads

using R/W locks and barries

Implementation

Thread Pool Different threads available in a pool

When a task arrives, it gets assigned to a free thread

Once a thread completes its service, it returns in the pool and awaits another work.

ImplementationPThred Lib base operations

pthread_create()- create and launch a new thread

pthread_exit()- destroy a running thread

pthread_attr_init()- set thread attributes to their default values

pthread_join()- the caller thread blocks and waits for another thread to finish

pthread_self()- it retrieves the id assigned to the calling thread

Implementation Example

N x N Matrix Multiplication


A simple algorithmfor (int i = 0; i < MATRIX_ELEMENTS; i += MATRIX_LINE){ for (int j = 0; j < MATRIX_LINE; ++j) {

float tmp = 0;for (int k = 0; k < MATRIX_LINE; k++){

tmp += A[i + k] * B[(MATRIX_LINE * k) + j];

}C[i + j] = tmp;

}}


SIMD Approachtranspose(B);for (int i = 0; i < MATRIX_LINE; i++) { for (int j = 0; j < MATRIX_LINE; j++){ __m128 tmp = _mm_setzero_ps(); for (int k = 0; k < MATRIX_LINE; k += 4){ tmp = _mm_add_ps(tmp, _mm_mul_ps(_mm_load_ps(&A[MATRIX_LINE * i + k]), _mm_load_ps(&B[MATRIX_LINE * j + k]))); } tmp = _mm_hadd_ps(tmp, tmp); tmp = _mm_hadd_ps(tmp, tmp); _mm_store_ss(&C[MATRIX_LINE * i + j], tmp); }}transpose(B);


TLP Approachstruct thread_params {

pthread_t id;

float* a;

float* b;

float* c;int low;int high;

bool flag;

};………

int main(int argc, char** argv){ int ncores=sysconf(_SC_NPROCESSORS_ONLN); int stride = MATRIX_LINE / ncores; for (int j = 0; j < ncores; ++j){

pthread_attr_t attr; pthread_attr_init(&attr); thread_params* par = new thread_params; par->low=j*stride; par->high=j*stride+stride; par->a = A; par->b = B; par->c = C; pthread_create(&(par->id), &attr, runner, par); // set cpu affinity for thread // sched_setaffinity

}}


TLP Approachint main(int argc, char** argv){….int completed = 0;while (true) { if (completed >= ncores) break; completed = 0; usleep(100000); for (int j=0; j<ncores; ++j){ if (p[j]->flag) completed++;

}}….}

void runner(void* p){thread_params* params = (thread_params*) p;int low = params->low; // unpack others valuesfor (int i = low; i < high; i++) {

for (int j = 0; j < MATRIX_LINE; j++){

float tmp = 0;

for (int k = 0; k < MATRIX_LINE; k++){ tmp +=

A[MATRIX_LINE * i + k] * B[(MATRIX_LINE * k) + j]; } C[i + j] = tmp; }}params->flag = true;pthread_exit(0);}

Implementation Performance

Simple SIMD TLP SIMD&TLP0

1000

2000

3000

4000

5000

6000

7000

8000

9000

8 cores4 cores

A case study

Using threads in Interactive Systems

• Research by XEROX PARC Palo Alto

• Analysis of two large interactive system: Cedar and GVX

• Goals: i. Identifing paradigms of thread usageii. architecture analysis of thread-based

environmentiii. pointing out the most important properties of

an interactive system

A case studyThread model

Mesa language

Multiple, lightweight, pre-emptively scheduled threads in shared address space, threads may have different priorities

FORK, JOIN, DETACH

Support to conditional variables and monitors: critical sections and mutexes

Finer grain for locks: directly on data structures

A case study

Three types of thread

1. Eternal: run forever, waiting for cond. var.

2. Worker: perform some computation

3. Transient: short life threads, forked off by long-lived threads

A case study

Dynamic analysis

Cedar GVX0

5

10

15

20

25

30

35

40

45

# threads idle

Fork rate max

# threads max

Switching intervals: (130/sec, 270/sec) vs. (33/sec, 60/sec)

A case study

Paradigms of thread usage Defer Work: forking for reducing latency

print documents

Pumps or slack processes: components of pipeline Preprocessing user input Request to X server

Sleepers and one-shots: wait for some event and then execute Blink cursor Double click

Deadlock avoiders: avoid violating lock order constraint Windows repainting

A case study

Paradigms of thread usage Task rejuvenation: recover a service from a bad

state, either forking a new thread or reporting the erroro Avoid fork overhead in input event dispatcher

of Cedar

Serializers: thread processing a queueo A window system with input events from many

sources

Concurrency exploiters: for using multiple processors

Encapsulated forks: a mix of previous paradigms, code modularity

A case study

Common Mistakes and Issueso Timeout hacks for compensate missing

NOTIFY

o IF instead of WHILE for monitors

o Handling resources consumption

o Slack processes may need hack YieldButNotToMe

o Using single-thread designed libraries in multi-threading environment: Xlib and XI

o Spurious lock

A case study

Xerox scientists’ conclusions

Interesting difficulties were discovered both in use and implementation of multi-threading environment

Starting point for new studies

Conclusion

threads and multi threading

Software