porting nanos on sdsm

Porting NANOS on SDSMPorting NANOS on SDSM

GOALPorting a shared memory environment

to distributed memory.What is missing to current SDSM ?

Christian PerezChristian Perez

Who am i ?Who am i ?

• December 1999 : PhD at LIP, ENS Lyon, FranceDecember 1999 : PhD at LIP, ENS Lyon, France– Data parallel languages, distributed Data parallel languages, distributed

memory, load balancing, preemptive thread memory, load balancing, preemptive thread migrationmigration

• Winter 1999/2000 : TMR at UPCWinter 1999/2000 : TMR at UPC– OpenMP, Nanos, SDSMOpenMP, Nanos, SDSM

• October 2000 : INRIA researcherOctober 2000 : INRIA researcher– Distributed programs, code couplingDistributed programs, code coupling

ContentsContents

• MotivationMotivation• Related worksRelated works• Nanos execution model (NthLib)Nanos execution model (NthLib)• Nanos on top of 2 SDSM (JIAJIA & DSM-PM2)Nanos on top of 2 SDSM (JIAJIA & DSM-PM2)• Missing SDSM functionalitiesMissing SDSM functionalities• ConclusionConclusion

MotivationMotivation

• OpenMP : emerging standardOpenMP : emerging standard– simplicity (no data distribution)simplicity (no data distribution)

• Cluster of machines (mono or Cluster of machines (mono or multiprocessors)multiprocessors)– excellent ratio performance / priceexcellent ratio performance / price

OpenMP on top of a cluster !OpenMP on top of a cluster !

OpenMP / Cluster : HOW ?OpenMP / Cluster : HOW ?

• OpenMP paradigm : shared memoryOpenMP paradigm : shared memory• Cluster paradigm : message passingCluster paradigm : message passing Use of software DSM system !Use of software DSM system !

Hardware DSM system : SCI (write: 2 Hardware DSM system : SCI (write: 2 s) s) specific hardwarespecific hardware not yet stablenot yet stable

Related workRelated work

• Several OpenMP/DSM implementationsSeveral OpenMP/DSM implementations– OpenMP NOW!, OmniOpenMP NOW!, Omni

• But,But,– Modification of OpenMP semanticsModification of OpenMP semantics– One level of parallelismOne level of parallelism– Do not exploit high performance Do not exploit high performance

networksnetworks

OpenMP on classical DSM OpenMP on classical DSM

• Compiler extracts shared data from stackCompiler extracts shared data from stack– Expensive local variable creationExpensive local variable creation

•shared memory allocationshared memory allocation• Modification of OpenMP standard :Modification of OpenMP standard :

– default should be default should be privateprivate instead of instead of being being sharedshared variables variables

– New synchronization primitives :New synchronization primitives :•condition variables & semaphorescondition variables & semaphores

OpenMP on classical DSMOpenMP on classical DSM

• One level of parallelism (SPMD)One level of parallelism (SPMD)

!$omp parallel do!$omp parallel dodo i = 1,4do i = 1,4

x(i) = x(i) + x(i+1)x(i) = x(i) + x(i+1)end doend do

barriercall schedule(lb, up, …)call schedule(lb, up, …)do i = lb, ubdo i = lb, ub

x(i) = x(i) + x(i+1)x(i) = x(i) + x(i+1)end doend docall dsm_barrier()call dsm_barrier()

Taken from pdplab.trc.rwcp.or.jp/pdperf/Omni/wgcc2k/Taken from pdplab.trc.rwcp.or.jp/pdperf/Omni/wgcc2k/

Omni compilation Omni compilation approachapproach

Our goalsOur goals

• Support OpenMP standardSupport OpenMP standard• High performanceHigh performance• Allow exploitation ofAllow exploitation of

– multithreading (SMP)multithreading (SMP)– high performance networkshigh performance networks

Nanos OpenMP compilerNanos OpenMP compiler

• Convert an OpenMP program to a task graphConvert an OpenMP program to a task graph• Communications via shared memoryCommunications via shared memory

!$omp parallel do!$omp parallel dodo i = 1,4do i = 1,4

x(i) = x(i) + x(i+1)x(i) = x(i) + x(i+1)end doend do

i=1,2i=1,2 i=3,4i=3,4

NthLib runtime supportNthLib runtime support

• Nanos compiler generates intermediate codesNanos compiler generates intermediate codes• Communications still via shared memoryCommunications still via shared memory

call call nthf_depaddnthf_depadd(…)(…) do nth_p = 1, procdo nth_p = 1, proc nth= nth= nthf_create_1snthf_create_1s(…,f,…)(…,f,…) donedone call nth_block()call nth_block()

subroutine f(…)subroutine f(…) x(i) = x(i) + x(i+1)x(i) = x(i) + x(i+1)

NthLib detailsNthLib details

• Assumes to run on top of kernel threadsAssumes to run on top of kernel threads• Provides user-level threads (QT)Provides user-level threads (QT)

• Stack management (allocate)Stack management (allocate)• Stack initialization (argument)Stack initialization (argument)• Explicit context switchExplicit context switch

Nthlib queuesNthlib queues

• Global/LocalGlobal/Local• Thread descriptorThread descriptor

– Rich functionalitiesRich functionalities• Work descriptorWork descriptor

– High performanceHigh performance

Nthlib : Nthlib : MemoryMemory managementmanagement

• Mutal exclusion Mutal exclusion mmapmmap allocation allocation • SLOT_SIZESLOT_SIZE stack alignment stack alignment

Nano-thread descriptorNano-thread descriptorSuccessorsSuccessors

StackStack

Guard zoneGuard zone

PortingPorting Nthlib to SDSM Nthlib to SDSM

• Data consistencyData consistency• Shared memory managementShared memory management• Nanos threadsNanos threads• JIAJIA implementationJIAJIA implementation• DSM-PM2 implementationDSM-PM2 implementation• Summary of DSM requirementsSummary of DSM requirements

Data consistencyData consistency• Mutual exclusion for defined data Mutual exclusion for defined data

structuresstructures Acquire/ReleaseAcquire/Release

• User level shared memory dataUser level shared memory data BarrierBarrier

Data consistencyData consistency• Mutual exclusion for defined data Mutual exclusion for defined data

structuresstructures Acquire/ReleaseAcquire/Release

• User level shared memory dataUser level shared memory data BarrierBarrier

barrier

Shared memory Shared memory managementmanagement• Asynchronous shared memory allocationAsynchronous shared memory allocation• Alignment parameter (> Alignment parameter (> PAGE_SIZEPAGE_SIZE))• Global variables/Global variables/commoncommon declarationdeclaration

Not yet supportedNot yet supported

Nano-threadsNano-threads

• Run-to-block execution modelRun-to-block execution model• Shared stacks (father/sons relationship)Shared stacks (father/sons relationship)• Implicit thread migration (scheduler)Implicit thread migration (scheduler)

JIAJIAJIAJIA• Developed at China by W. Hu, W. Shi & Z. TangDeveloped at China by W. Hu, W. Shi & Z. Tang• Public domain DSMPublic domain DSM• User level DSMUser level DSM• DSM : lock/unlock, barrier, cond. variablesDSM : lock/unlock, barrier, cond. variables• MP : send/receive, broadcast, reduceMP : send/receive, broadcast, reduce• Solaris, AIX, Irix, Linux, NT (not distributed)Solaris, AIX, Irix, Linux, NT (not distributed)

JIAJIA : Memory AllocationJIAJIA : Memory Allocation• No control of memory alignment (x2)No control of memory alignment (x2)• Synchronous memory allocation primitiveSynchronous memory allocation primitive

Development of an RPC versionDevelopment of an RPC version– Based on send/receive primitiveBased on send/receive primitive– Add of a user level message handlerAdd of a user level message handler ProblemsProblems– Global lockGlobal lock– Interference with JIAJIA blocking functionInterference with JIAJIA blocking function

JIAJIA : DiscussionJIAJIA : Discussion• Global barrier for data synchronizationGlobal barrier for data synchronization

Not multiple levels of parallelismNot multiple levels of parallelism• No thread awareNo thread aware

No efficient use of SMP nodesNo efficient use of SMP nodes

DSM/PM2DSM/PM2• Developed at LIP by G. Antoniu (PhD student)Developed at LIP by G. Antoniu (PhD student)• Public domainPublic domain• User level, module of PM2User level, module of PM2• Generic and multi-protocol DSMGeneric and multi-protocol DSM• DSM : lock/unlockDSM : lock/unlock• MP : LRPCMP : LRPC• Linux, Solaris, Irix (32 bits)Linux, Solaris, Irix (32 bits)

PM2 organizationPM2 organization

DSMMAD1TCPPVMMPISCIVIASBP

MAD2TCPMPISCIVIABIP

MARCELMONOSMPACTIVATON

PM2 TBX NTBX

http://www.pm2.org

DSM/PM2 : Memory DSM/PM2 : Memory AllocationAllocation• Only static memory allocationOnly static memory allocation

Build dynamic memory allocation primitiveBuild dynamic memory allocation primitive– Centralized memory allocation Centralized memory allocation – LRPC to Node 0LRPC to Node 0 Integration of alignment parameterIntegration of alignment parameter

Summer 2000 : dynamic memory allocation Summer 2000 : dynamic memory allocation ready !ready !

DSM/PM2 : marcel DSM/PM2 : marcel descriptordescriptor

Page boundarymarcel_t marcel_t

(sp&MASK)+SLOT_SIZE(sp&MASK)+SLOT_SIZE

NthLib requirement :NthLib requirement :a kernel thread a kernel thread many nano- many nano-threadsthreads

DSM/PM2 : marcel DSM/PM2 : marcel descriptordescriptor

Page boundarymarcel_t marcel_t

Page boundary

marcel_t* marcel_t*

*((sp&MASK)+SLOT_SIZE)*((sp&MASK)+SLOT_SIZE)

DSM/PM2 : Discussion DSM/PM2 : Discussion • Using page level sequential consistencyUsing page level sequential consistency

+ no need of barrier (Multiple levels of + no need of barrier (Multiple levels of parallelism)parallelism)– – False sharingFalse sharing Dedicated stack layoutDedicated stack layoutPage boundary

PadPadPage boundary

marcel_t* marcel_t*

DSM/PM2 : Discussion DSM/PM2 : Discussion (cont)(cont)• No alternate stack for signal handlerNo alternate stack for signal handler

Prefetch page before context switch : O(n)Prefetch page before context switch : O(n) Pad to next page before opening parallelismPad to next page before opening parallelism

PadPad

Page boundary

SharedShared

datadata

DSM/PM2 improvementDSM/PM2 improvement

• Availability of an asynchronous DSM mallocAvailability of an asynchronous DSM malloc• Lazy data consistency protocol in evaluationLazy data consistency protocol in evaluation

– eager consistency, multiple writereager consistency, multiple writer– scope consistencyscope consistency

• Support for stack in shared memory (LINUX)Support for stack in shared memory (LINUX)

DSM/PM2 shared stack DSM/PM2 shared stack supportsupportmarcel_t marcel_t

SEGV stack

SEGV stack SEGV stack

SEGV stack

DSM requirementDSM requirement

• Support of static global shared variablesSupport of static global shared variables– Efficient codeEfficient code

•remove one indirection levelremove one indirection level– Enable use of classical compilerEnable use of classical compiler

•Support for Support for commoncommon « Sharedization » of already allocated memory« Sharedization » of already allocated memory

dsm_to_shared(void* p, size_t size);dsm_to_shared(void* p, size_t size);

• Support for multiple level of parallelismSupport for multiple level of parallelism– Partial barrierPartial barrier

• group managementgroup management– Dependencies supportDependencies support

• like acquire/releaselike acquire/release but without lockbut without lock

•group managementgroup management– Dependencies supportDependencies support

barrier

barriers

barrier

start(1)start(1)

stop(1)stop(1)

update(update(11,,22))

start(2)start(2)

stop(2)stop(2)

Summary of DSM Summary of DSM requirementsrequirements• Support of static global shared variablesSupport of static global shared variables

« Sharedization » of already allocated « Sharedization » of already allocated memorymemory

• Acquire/release primitiveAcquire/release primitive• Partial barrier Partial barrier

group managementgroup management• Asynchronous shared memory allocationAsynchronous shared memory allocation• Alignment parameter to memory allocationAlignment parameter to memory allocation• Threads (SMP nodes)Threads (SMP nodes)• Optimized stack managementOptimized stack management

ConclusionConclusion

• Successfully port Nanos to 2 DSMSuccessfully port Nanos to 2 DSM JIAJIA & DSM-PM2JIAJIA & DSM-PM2

• DSM requirement to obtain performanceDSM requirement to obtain performance Support MIMD modelSupport MIMD model Automatic thread migrationAutomatic thread migration

• Performance ?Performance ?

porting nanos on sdsm

shared data

shared memory environment

distributed memory

openmp paradigm

openmp cluster

blocksubroutine f xi

openmp program

shared memorycluster

Documents

nanos border study

meeting - nanos web

revista nanos noviembre

sdsm&t cape laboratory brochure

goal 1 assessment record - sdsm&t

sdsm&t legacy news july

sdsm&t hardrock magazine fall 2011

nanos mayo 2016

1 dsm innovations - multiview software distributed shared...

مدل ریز مقیاس نمایی sdsm

the hardrock - sdsm&t

nov nanos solitaire crystal

sdsm&t accounting - work flow and efficiency

sdsm -...

u.d. nanos i gegants

sdsm&t hardrock magazine spring 2010

poruka sdsm 09

nanos survey

guide nanos en

sdsm&t iie november 2013 newsletter