cocol: concurrent communications library

19
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/281279038 CoCoL: Concurrent Communications Library CONFERENCE PAPER · AUGUST 2015 READS 35 2 AUTHORS: Kenneth Skovhede University of Copenhagen 14 PUBLICATIONS 9 CITATIONS SEE PROFILE Brian Vinter University of Copenhagen 92 PUBLICATIONS 280 CITATIONS SEE PROFILE All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately. Available from: Kenneth Skovhede Retrieved on: 05 February 2016

Upload: independent

Post on 13-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Seediscussions,stats,andauthorprofilesforthispublicationat:https://www.researchgate.net/publication/281279038

CoCoL:ConcurrentCommunicationsLibrary

CONFERENCEPAPER·AUGUST2015

READS

35

2AUTHORS:

KennethSkovhede

UniversityofCopenhagen

14PUBLICATIONS9CITATIONS

SEEPROFILE

BrianVinter

UniversityofCopenhagen

92PUBLICATIONS280CITATIONS

SEEPROFILE

Allin-textreferencesunderlinedinbluearelinkedtopublicationsonResearchGate,

lettingyouaccessandreadthemimmediately.

Availablefrom:KennethSkovhede

Retrievedon:05February2016

Communicating Process Architectures 2015P.H. Welch et al. (Eds.)Open Channel Publishing Ltd., 2015© 2015 The authors and Open Channel Publishing Ltd. All rights reserved.

1

CoCoL: Concurrent CommunicationsLibrary

Kenneth SKOVHEDE 1 and Brian VINTER

Niels Bohr Institute, University of Copenhagen

Abstract. In this paper we examine a new CSP inspired library for the Common In-termediate Language, dubbed CoCoL: Concurrent Communications Library. The useof the Common Intermediate Language makes the library accessible from a numberof languages, including C#, F#, Visual Basic and IronPython. The processes are basedon tasks and continuation callbacks, rather than threads, which enables networks withmillions of running processes on a single machine. The channels are based on re-quest queues with two-phase commit tickets, which enables external choice withoutcoordination among channels. We evaluate the performance of the library on differentoperating systems, and compare the performance with JCSP and C++CSP.

Keywords. CSP, concurrent programming, process oriented programming, C#, .Net,Common Intermediate Language

Introduction

Since C. A. Hoare introduced the CSP algebra [1] a large number of implementations havebeen implemented where the occam family [2,3] and later JCSP [4] have attracted attention.Where the occam family presents the user with a new language, designed to give easy accessto CSP features, the JCSP approach is to use the Java language and environment and add CSPfunctionality.

By introducing a new language for CSP, there are some obvious benefits, such as naturalconstructs for expressing processes, external choice, side-effect-free guaranteed, and othercentral CSP elements. When introducing the CSP elements into an existing language, thefeatures provided by the language limit the design freedom..

On the other hand, when adding CSP features to an existing language, the CSP imple-mentation can leverage the existing user base, rather than require newcomers to learn a newsyntax and semantics set. With an existing language there is usually also an existing eco-system with toolchains, support libraries etc. Another important benefit from adding CSPsupport to an existing language is that the user can choose a mixed approach, where onlyparts of the program is using CSP constructs, and other parts are using the native languageapproach, for example functional or object oriented.

With the Concurrent Communications Library (CoCoL) we choose the latter approach:implementing CSP functionality as a library for the Common Intermediate Language (CIL).

In CoCoL, we have experimented with implementing a communication oriented pro-gramming paradigm for the CIL languages, hosted entirely within the Common LanguageRuntime (CLR), and leveraging features of the CIL languages. With this paper we presentthe design considerations and measure the achieved performance compared to a number ofrelated libraries.

1Corresponding Author: Kenneth Skovhede, Niels Bohr Institute, Blegdamsvej 17, DK-2100 Copenhagen OE.Tel.: +45 35325209; E-mail: [email protected].

CPA 2015 preprint – the proceedings version will have other page numbers and may have minor differences.

2 K. Skovhede and B. Vinter / CoCoL: Concurrent Communications Library

Through the use of CoCoL, it is possible to use a CSP-like design approach from anyof the languages supported by the CIL, including C#, F# and VisualBasic. This enables theexisting users from a number of languages to apply CSP design principles without learning anew language.

1. Background

As CoCoL relies heavily on features found in the C# language and runtime, this section pro-vides an overview of some of the components in that environment. This section is by nomeans an exhaustive listing of all features, but seeks to provide the foundation for under-standing the implementation of CoCoL.

1.1. CLI Terminology

The Common Language Infrastructure (CLI) is a specification [5] for a runtime environment,which comprises Common Language Runtime (CLR), Common Type System (CTS), Com-mon Metadata format, and Common Intermediate Language (CIL).

CIL is an assembly-like bytecode, which is comparable to the Java bytecode, but withthe difference that CIL is designed to support a number of different languages. CIL is ex-ecuted by the CLR, which can be compared to the Java Virtual Machine (JVM). Like Javasource code is compiled into Java bytecode and executed by the JVM, languages such as C#,F# and VisualBasic are compiled into CIL and executed by the CLR. The JVM and CLRenvironments also share other traits, such as being based on JIT compilation, having garbagecollected memory, and differentiating between value-based and reference-based types [6,7].

Any language that compiles into CIL should also follow the Common Language Specifi-cation (CLS), which describes the rules a compatible language should observe. If a languageuses the CTS and honors the CLS, any other language in the CLI can use compiled methodsand types from that language, and vice versa. This interoperability feature is used in CLI toprovide a set of Standard Libraries, which provides common functionality, such as file access,network access, xml to all languages.

The most prominent implementations of the CLI is the Microsoft .Net Runtime, whichis available only on Windows. The open source Mono [8] implementation is feature completein terms of the CLI, and available on all major platforms, but does not implement all of thesupport libraries shipped with the .Net implementation.

1.2. Generics

In a strongly typed language, a method that needs to operate on any type of data is oftenusing the common Object type. As an example, a dynamic list (e.g. an ArrayList) cancontain all types of data, by forcing the caller to convert, or cast, the data to the Object

type before storing it, and then reversing the operation when retrieving it. This allows for asingle implementation of a dynamic list, but comes with a large overhead when the data isprimitives, such as integers, because these need to be boxed, that is, encapsulated in heapallocated objects that later needs to be de-allocated. This issue has compelled both Java andCIL to introduce generics, which can be considered a type-safe kind of templates.

Since CIL was introduced later than Java, it has less legacy code to support than Java.This has prompted Microsoft to introduce a compatibility breaking change in the type system,which allows for the use of type-safe generic types [9]. Java has chosen the type erasureapproach, which gives full backward compatibility, but completely removes the generic typeinformation from the resulting bytecode [10]. In contrast, CIL introduced the generics intothe type system, so the type information is available at runtime. The CIL runtime system uses

CPA 2015 preprint – the proceedings version will have other page numbers and may have minor differences.

K. Skovhede and B. Vinter / CoCoL: Concurrent Communications Library 3

the JIT compiler to generate typed classes based on the actual type. This allows the runtimeto completely avoid casting into the common object super-type. Listing 1 shows an exampleof a generic method, which returns the first element of an array. In Java, no type informationis preserved, so it would be required to use Object in place of T, and then cast the input.In the CIL version, the type information is present, so it can be transformed into the versionshown in Listing 2, with no casting required. In a C++ template setting, the transformationwould be performed at compile time, so a C++ binary must include the transformed versionsfor all used types. In CIL this is not required, because the expansion happens at runtime whenthe method is JIT compiled, which allows a library to export a function that takes any type asinput.

void T F i r s t ( T [ ] d a t a ) {i f ( d a t a == n u l l )

throw new N u l l E x c e p t i o n ( ) ;e l s e i f ( d a t a . Length == 0)

re turn d e f a u l t ( T ) ;e l s e

re turn d a t a [ 0 ] ;}

Listing 1: Generic function example

void i n t F i r s t ‘ ( i n t [ ] d a t a ) {i f ( d a t a == n u l l )

throw new N u l l E x c e p t i o n ( ) ;e l s e i f ( d a t a . Length == 0)

re turn 0 ;e l s e

re turn d a t a [ 0 ] ;}

Listing 2: Instantiated generic function

1.3. Delegates and Lambda Functions

What is known as higher-order functions or function pointers is implemented in CIL using adelegate, which encapsulates the context of the caller, that is the value of this [9]. Whenthe delegate is created, the this context and target method is stored in a lookup table,and a pointer to this memory area is returned1. When the delegate is invoked, the callbackfunction is invoked in the correct context. This makes it possible to create a callback methodthat automatically carries state, and can invoked a method on a specific object instance.

void T F i r s t ( T [ ] da t a ,Func<T , bool> t e s t ) {

i f ( d a t a == n u l l )throw new N u l l E x c e p t i o n ( ) ;

foreach ( v a r n in d a t a )i f ( t e s t ( n ) )

re turn n ;re turn d e f a u l t ( T ) ;

}

Listing 3: Generic function with delegate

v a r d a t a = new i n t [ ] { 1 , 2 , 3 } ;v a r n = 0 ;

v a r o n e p l u s = F i r s t ( da t a , x => {i f ( x > 0)

n ++;

re turn n >= 2 ;} ) ;

Listing 4: Example use of a lambda method

A related technique is anonymous methods, also known as lambda methods, which aremethods that can only be referenced by their handle (i.e. they have no name). In C# thekeyword => creates a lambda function, which can be combined with generics to implementcommon methods. As an example, consider the generic method in Listing 3, which accepts adelegate called test. In Listing 4, this method is used to pick the second non-zero integer inan array. Note that the variable n is declared outside the lambda function, but is still accessible

1Here, delegates refer to the pure function pointer-like feature in C#, not the delegate in F# which is equivalentto a lambda closure in C#

CPA 2015 preprint – the proceedings version will have other page numbers and may have minor differences.

4 K. Skovhede and B. Vinter / CoCoL: Concurrent Communications Library

from within, and can be used to keep state inside the lambda function. This reveals that whilethe lambda method looks like a delegate, it is in fact more complicated, as it needs to createa closure object that captures all accessible variables. The this context is used with thedelegate to invoke the closure instance and allow the delegate code to access the variablesinside the scope.

1.4. Continuations

Another use of lambda methods is to provide a callback method for long-running operations.This kind of callback is commonly referred to as continuations, as the code “continues” insidethe callback method. Continuations can be considered similar to an event-based mode, wherethe event “fires” once the long-running operation has completed.

Without callbacks, a multithreading approach would need to introduce a worker threadand handle communication with locks and monitors, which is known to be error prone anddifficult for novice programmers [11,12].

void Example ( ) {

v a r a = LoadUrl ( ) ;v a r b = Download ( a ) ;

f i l e . Wr i t e ( a , b ) ;}

Listing 5: Sequential code

void Example ( ) {

LoadUrl ( ( a ) => {Download ( ( b ) => {

f i l e . Wr i t e ( a , b ) ;} ) ;

} ) ;

}

Listing 6: Continuation with lambda func-tions

async void Example ( ) {

v a r a = a w a i t LoadUrl ( ) ;v a r b = a w a i t Download ( a ) ;

a w a i t f i l e . Wr i t e ( a , b ) ;

}

Listing 7: Finite state machine with await

c l a s s S t a t e {o b j e c t a , b ;i n t s t a t e = 0 ;

p u b l i c vo id Next ( ) {sw i t ch ( t h i s . s t a t e ) {

case 0 :t h i s . s t a t e = 1 ;LoadUrl ( t h i s . SetA ) ;break ;

case 1 :t h i s . s t a t e = 2 ;Download ( t h i s . a , t h i s . SetB ) ;break ;

case 2 :f i l e . Wr i t e ( t h i s . a , t h i s . b ) ;break ;

}}

void SetA ( o b j e c t a r g ) {t h i s . a = a r g ;t h i s . Next ( ) ;

}void SetB ( o b j e c t a r g ) {

t h i s . b = a r g ;t h i s . Next ( ) ;

}}

void Example ( ) {new S t a t e ( ) . Next ( ) ;

}

Listing 8: Callbacks with a finite state ma-chine

A simple program that loads a URL from (slow) storage, and then downloads the con-tent from the (slow) network can be written sequentially, as in Listing 5. Such a sequential

CPA 2015 preprint – the proceedings version will have other page numbers and may have minor differences.

K. Skovhede and B. Vinter / CoCoL: Concurrent Communications Library 5

program has the benefit that the flow is easy to follow, as each line is executed in full beforeadvancing to the next line, and local variables store the program state. With the use of lambdafunctions, a similar program can be written as a continuation-style, shown in Listing 6. Thisapproach means that the initiating thread is not blocked during the long-running operations,but it has the drawback that the program flow becomes harder to follow, especially if one ofthe methods needs to throw an exception. The continuation approach also complicates thestorage of state data (i.e. “a” and “b”), although the compiler handles this automatically.

A more structured approach is to use a finite state machine, as shown in Listing 8. Fromthe number of lines alone, it is clear that this approach is less intuitive and harder to use. Butit does define a strict program flow, which allows error handling to be introduced.

Fortunately, .Net 4.5 features two new keywords: async and await [13]. The async

keyword is simply used for ensuring backwards compatibility with older code, where awaitwas not a keyword and thus could be used for a variable name. By adding the async modifiedto a function name, the compiler interprets await as a keyword. The await keyword is themajor change, which automatically transforms the function into a finite state machine, andcaptures all variables in the scope into a new object instance. Each use of the await statementwill correspond to a state in the state machine, and the callback will point to the state object,such that once the operation completes, it will advance the state.

This rewrite is performed solely at compile time, and is thus as efficient as if it waswritten by hand. This allows a function that needs to use long-running calls to be rewrit-ten as a continuation based method with only the addition of async and await keywords,as illustrated in Listing 7. This does not solve the inherent problems found in concurrentprogramming, but fortunately this can be handled by a CSP-like channel approach!

1.5. Tasks

To further simplify the use of the await statement, CIL introduces a common Task class thatcan be called with await. Any method can thus signal that it is running asynchronously byreturning a Task object, and the caller can use await keyword, or call the Wait method onthe Task to suspend the thread until completion. If the Wait method is called, the executionbecomes sequential like that in Listing 5. If the await keyword is used instead, the programis written as shown in Listing 7, with the compiler automatically implementing it as shownin Listing 8.

A number of simple helper methods are also found in the Task class, such as WhenAlland WhenAny, which returns a Task representing some combination of other Tasks.

The Task class resembles the promise or future idea found in NodeJS [14], Smalltalk [15]and C++11 [16] among others. For Java, the java.util.concurrent.Future [17] class issimilar.

2. Implementation

The overall design approach has been to make the library API as simple and possible, anduse as many existing language and runtime features as possible. This resulted in a librarythat contains only a single implementation of a channel, and no implementation of a process.The channel is a generic type, such that it can be typed to transfer a specific type of data, forexample an integer.

2.1. Channels

The channel implementation supports multiple readers and multiple writers, that is, it is anany-to-any channel. Any communication on the channel is ordered, such that the first regis-

CPA 2015 preprint – the proceedings version will have other page numbers and may have minor differences.

6 K. Skovhede and B. Vinter / CoCoL: Concurrent Communications Library

tered reader will be guaranteed to read the first value written, and the same applies to the writ-ers. Each communication is also atomic, meaning that a communication will always notifyboth the reader and the writer of the communication, or do nothing.

The source code for the channel implementation is 300 source lines, where the entireCoCoL library is implemented in less than 1500 lines [18].

A key feature of the channels is that they are based on continuations, that is, ratherthan block the caller until the operation has completed, they return a Task (see 1.4 and 1.5).Internally, the returned Tasks are stored in queues, to ensure ordered responses. Since thequeues will have exactly one entry if the channels are used in a one-to-one manner, therewould be very little gained in implementing specialized versions for the channels.

A typical machine cannot run more than a few thousand threads, due to the memoryrequired for each thread stack. This is normally not a prohibitive limit, as it is far more threadsthan there are physical execution units in the system. But in a CSP context, it is common tocreate a large number of processes which stay inactive for long periods of time.

By mapping each process to a Task instead of a thread, they are stored as callbackmethods with no stack. As mentioned in section 1.4, the state machine encapsulates any localvariables that might otherwise be stored on the stack. This makes it possible to create millionsof processes with a moderate amount of memory.

The Task approach also allows the runtime system to choose how many concurrentthreads it will use to run the tasks. The default implementation leaves this decision to theThreadPool, which automatically adjusts the number of active threads, based on the numberof queued Tasks.

A related approach to ensuring processes waiting for communication do not require astack, can be found in the ProcessJ language, which can also be used to increase the numberof processes in Java [19]. The project is similar, in that it provides an environment, whichallows millions of processes. The difference is that the ProcessJ approach relies on a customcompiler and a custom language, whereas CoCoL is implemented with features already inthe CLR, and supports existing languages.

2.2. Integration With CIL

By using the built-in Task system found in the .Net 4.5 library it is possible to mix channelcommunications with other kinds of blocking operations. It is also possible to use the utilitymethods that operate on Tasks objects, notably the ability to wait for multiple Tasks tocomplete.

void P a r D e l t a ( ) {/ / Omi t t ed c h a n n e l d e c l a r a t i o n swhi le ( t rue ) {

v a r d a t a = a w a i t in . ReadAsync ( ) ;a w a i t Task . WhenAll (

outA . Wri teAsync ( d a t a ) ;outB . Wri teAsync ( d a t a ) ;

) ;}

}

Listing 9: Parallel Delta function

void S e q D e l t a ( ) {/ / Omi t t ed c h a n n e l d e c l a r a t i o n swhi le ( t rue ) {

v a r d a t a = a w a i t in . ReadAsync ( ) ;

a w a i t outA . Wri teAsync ( d a t a ) ;a w a i t outB . Wri teAsync ( d a t a ) ;

}}

Listing 10: Sequential Delta function

As an example it is possible to write the classic CSP Delta process, as the methodshown in Listing 9, which reads an input channel and copies the value to two or more outputchannels. Note that the implementation in Listing 9 awaits both write operations in parallel

CPA 2015 preprint – the proceedings version will have other page numbers and may have minor differences.

K. Skovhede and B. Vinter / CoCoL: Concurrent Communications Library 7

through the WhenAll helper, and thus writes to both output channels in any order, making it aParallel Delta. If a sequential delta is required it can be written as shown in Listing 10, wherethe state machine will ensure that channel “A” is written before channel “B”. The parallelversion uses a counter internally to wait for all the writes to complete. Our measurementsshow that this extra overhead is optimized away by the JIT compiler.

2.3. Additional Channel Features

CoCoL supports two complementary methods for getting a channel; both are found inthe ChannelManager factory class. An anonymous channel can be created with a call toCreateChannel, which will simply create a new channel instance, which can be passedaround. The method GetChannel takes a channel name as an argument, and will create achannel if no existing channels has the name; otherwise it will return the existing channelwith that name.

Other than the creation logic, there is no difference between the channels. If the channelsare shared in a complex call hierarchy with no easy way to pass the channel instance, thenamed approach may simplify this, but comes at the cost of managing a global (channel)namespace.

The channels default to being un-buffered, but can optionally be created as buffered im-plementations. In a buffered setup, a number of writes are allowed without having a desig-nated reader. When reading from a buffered channel, a buffered write is taken from the queue,and if there are pending writers, the next writer is allowed to write.

Additionally, the channel implementation also supports poison logic, which is namedRetire in CoCoL. A retired channel will return an exception on all non-buffered operationshappening after Retire() has been called.

To support a mixed-mode program, where only parts of the code are written to utilize theconcurrent model, a number of support methods are also present. One of the support methodsis the blocking mode extension that simply blocks on each task.

2.4. Processes

As mentioned, there is no explicit support for processes in CoCoL. Instead, the user canstart a process in a number of ways. One way is to simply start a thread and run the processas a normal thread. Another method is to create a class that implements the IProcess orIProcessAsync interface, and start it using the CoCoL loader methods, and yet anotherway is to simply run an asynchronous function. These different approaches are shown inListing 11, 12, 13, and 14. Once a process is running, it can simply exit the function to stoprunning. A thread can be joined, but if the process was started through the loader system, thecaller does not know when it stopped. This can be remedied by a signal, such as a channel,or any other inter-process communication method. If the process was started as shown inListing 12, the returned Task object can be (a-)waited upon.

In JCSP, the user would typically create instances of the different processes, and thenexecute them explicitly in either sequence or parallel. With CoCoL and asynchronous pro-gramming, the processes are started simply by calling a function, thus implicitly running allprocesses in parallel, but allowing the user to wait for completion.

CPA 2015 preprint – the proceedings version will have other page numbers and may have minor differences.

8 K. Skovhede and B. Vinter / CoCoL: Concurrent Communications Library

void Run ( o b j e c t d a t a ) {

whi le ( t rue ) {out . Wr i t e (

in . Read ( )) ;

}

}

new Thread ( Run ) . S t a r t ( ) ;

Listing 11: Thread process

async Task Run ( ) {

whi le ( t rue ) {a w a i t out . Wr i t e (

a w a i t in . ReadAsync ( )) ;

}

}

Run ( ) ;

Listing 12: Asynchronous process

c l a s s I d e n t i t y : I P r o c e s s {p u b l i c vo id Run ( ) {

whi le ( t rue ) {out . Wr i t e (

in . Read ( )) ;

}}

}

CoCoL . Loader. S t a r t F r o m T y p e s ( t y p e o f ( I d e n t i t y ) ) ;

Listing 13: IProcess process

c l a s s I d e n t i t y : I A s y n c P r o c e s s {p u b l i c Task RunAsync ( ) {

whi le ( t rue ) {a w a i t out . Wri teAsync (

a w a i t in . ReadAsync ( )) ;

}}

CoCoL . Loader. S t a r t F r o m T y p e s ( t y p e o f ( I d e n t i t y ) ) ;

Listing 14: IAsyncProcess process

2.5. Alternation With Two-Phase Commit

In CSP there is a construct known as external choice or alternation, which is used to choosebetween multiple available channels in a race-free manner. In JCSP and C++CSP this isimplemented with channel guards being passed to a method that chooses which channel touse.

bool O f f e r ( o b j e c t c a l l e r ) void Commit ( ){ {

Moni to r . E n t e r ( m lock ) ; m taken = t rue ;Moni to r . E x i t ( m lock ) ;

/ / Re tu r n and keep l o c k }i f ( ! m taken )

re turn t rue ;void Withdraw ( )

Moni to r . E x i t ( m lock ) ; {re turn f a l s e ; Moni to r . E x i t ( m lock ) ;

} }

Listing 15: Basic implementation of a two-phase-commit that allows a single operation

To keep CoCoL more in line with existing CIL terminology, the operations that per-form external choice are called ReadFromAny and WriteToAny. Rather than implementguards for the channels, the functions simply take a list, or array, of channels. To imple-ment the skip and timeout guards, the methods take a timeout argument, which can be either

CPA 2015 preprint – the proceedings version will have other page numbers and may have minor differences.

K. Skovhede and B. Vinter / CoCoL: Concurrent Communications Library 9

Timeout.Immediate for skip or any positive Timespan value for timeout. For ease of use,the default timeout is set to Timeout.Infinite, causing the operations to wait forever.

Task Read ( TwoPhaseCommit r e a d e r c o m m i t ){

v a r r e a d e r t a s k = new Task ( ) ;whi le ( ! w r i t e r q u e u e . Empty ( ) ) {

v a r w r i t e r t a s k , w r i t e r c o m m i t = w r i t e r q u e u e . PeekHead ( ) ;

i f ( w r i t e r c o m m i t . O f f e r ( ) ) {/ / W r i t e r agreed , check r e a d e ri f ( r e a d e r c o m m i t . O f f e r ( ) ) {

/ / Agreement , commit t o communica t ionr e a d e r c o m m i t . Commit ( ) ;w r i t e r c o m m i t . Commit ( ) ;

/ / Remove pend ing w r i t e rw r i t e r q u e u e . Dequeue ( ) ;

/ / Exchange v a l u er e a d e r t a s k . S e t V a l u e ( w r i t e r t a s k . GetValue ( ) ) ;

/ / S c h e d u l e c a l l b a c k sr e a d e r t a s k . S igna lReady ( ) ;w r i t e r t a s k . S igna lReady ( ) ;

/ / Communicat ion c o m p l e t ere turn r e a d e r t a s k ;

} e l s e {/ / Reader d e c l i n e d , n o t i f y w r i t e rw r i t e r c o m m i t . Withdraw ( ) ;

/ / Reader d e c l i n e d , so we s t o p t r y i n gre turn r e a d e r t a s k ;

}} e l s e {

/ / W r i t e r i s no l o n g e r a v a i l a b l e ,/ / so we remove i tw r i t e r q u e u e . Dequeue ( ) ;

}}

/ / No match ing w r i t e r , suspend r e a d e r/ / Note t h a t t h i s can cause ” l i t t e r i n g ”r e a d e r q u e u e . Enqueue ( r e a d e r t a s k , r e a d e r c o m m i t ) ;re turn r e a d e r t a s k ;

}

Listing 16: Pseudo code for a channel using two-phase logic

Inside the ReadFromAny and WriteToAny methods, the selection is performed by creat-ing a two-phase-commit object and, in turn, passing this, to each channel. This will registera pending read or write on all the channels in the list, which is the intent or voting phase ofthe two-phase commit protocol [20]. Once a channel has a matching operation, it will invokethe Offer method on the two-phase-commit object passed by both the read and the writeend. If one or both sides decline the offer (i.e. they have already communicated elsewhere),the declined requests are removed from the channel, and the Withdraw method is invoked.

CPA 2015 preprint – the proceedings version will have other page numbers and may have minor differences.

10 K. Skovhede and B. Vinter / CoCoL: Concurrent Communications Library

If both sides agree to take the offer, the Commit method is called and the channel enqueuesboth the reader and the writer callback methods.

The implementation of the TwoPhaseCommit object is thus very simple: a call to Offer

returns false if a communication has already completed, otherwise a lock is acquired andtrue is returned. The Withdraw method releases the lock, and the Commit method marks theinstance completed and releases the lock, thus allowing only a single communication to suc-ceed. A simplified version of the code in a two-phase-commit object is shown in Listing 15.The simplified Read function shown in Listing 16, illustrates that each channel end reliessolely on the two-phase-commit object for coordination, and thus each channel works thesame way, regardless of how many channels the read and write are registered with.

This approach scales to a large number of channels, because the channels themselves donot participate in the communication but rely on the TwoPhaseCommit object to handle thechoice. This also allows custom versions of external choice, should it be desired. A drawbackto this method is that there could potentially be many unused communication offers in thechannels, which will grow if the channel is repeatedly being used as part of a set, but neversucceeds. This can be fixed by probing items in the queues inside the channels if the queuesexceed a threshold.

2.6. Alternation

As described above, the two-phase commit implementation in Listing 15 allows a singlecommunication to continue. This makes it very simple to perform an external choice on anynumber of channels, by simply handing the same two-phase-commit instance to each channel.

The CSP priority alternation scheme can be implemented simply by registering the op-eration on each channel in the desired order. The first channel that is able to complete thecommunication will trigger the two-phase-commit instance as illustrated in Listing 16. Therandom alternation scheme can be implemented in the much the same way, by shuffling thechannel list, and then using the priority alternation method.

The fair alternation scheme requires that each channel receive an equal amount of com-munication. Another way of expressing this is to say that the channel priority is ordered, sothat the least communicating channel has the highest priority. This way of expressing therules for fair alternation transforms the problem into a question of sorting the channels.

A simple approach to sorting the channels is to keep a counter for each channel, andthen simply sort the list of channels, using the counter values in increasing order. While thisworks, it is not very efficient because changes in the usage counters are simple increments.

For an efficient solution, we have identified two different usage scenarios, shown inFigure 2 and Figure 1. In the first scenario, some channels communicate very often, othersless frequently, and in the latter, all channels communicate an equal amount.

From Figure 2 we can see that this sorted list can become an “almost-sorted list”, if thefirst channel with usage count 66 is used again. If any of the other channels are used, thelist remains sorted. For such a scenario, the bubble sort algorithm is very efficient, as it canterminate early, minimizing the number of swaps and compares.

In the scenario shown in Figure 1, a bubble sort could also be fairly efficient. However,we consider this scenario to be the most likely, and thus we have an optimization for this,which is to keep an index to the first element with the lowest usage count. If the channelbeing used is to the left of this index, we swap with the element at the index, and incrementthe index. If the channel being used is to the right, we apply bubble sort. Once the indexbecomes -1, we re-scan the list to find the new lowest index. This optimization means that,in a case where the communication always succeeds (i.e. there is always a waiting process),we do not need to bubble the first element all the way to the end of the list.

CPA 2015 preprint – the proceedings version will have other page numbers and may have minor differences.

K. Skovhede and B. Vinter / CoCoL: Concurrent Communications Library 11

This approach makes the actual sorting as limited as possible, with only a single counteras extra overhead. In the implementation, the counter array and channel arrays are kept sep-arate, with swaps performed on both, such that the channel list can be used without anycopying.

0 0 0 0 0 1 1 1 1 1

First element with lowest usage count

Sort order

Figure 1. The fair alternation usage list under balanced load.

0 4 6 12 32 42 66 66 85 92

First element with lowest usage count

Sort order

Figure 2. The fair alternation usage list under un-balanced load.

3. Results

To evaluate the performance of CoCoL we have chosen to compare with two existing CSPlibraries: C++CSP [21], and JCSP [4].

These libraries all include examples of the common CSP benchmarks: CommsTime andStressed ALT. This makes the performance results here somewhat comparable to those re-ported in 2003 [21].

The benchmarks are used mostly unmodified from the source, with minor modificationsto even out the differences such as number of iterations, problem size, etc.

To further expand on these results, and make an attempt at producing comparable cross-OS results, we have executed all benchmarks on the same hardware: an Apple MacBook Prowith an i7 2.8 Ghz processor and 16 GB 1600 MHz DDR3 RAM, running OSX 10.10.3.

To produce results from other operating systems, we have used the Parallels Desktop10.2.0 software to create virtual machines, and installed 64 bit Windows 8.1, and Ubuntu14.04.2 guest operating systems. While there is certainly an overhead associated with run-ning another operating system inside a virtual machine, we consider the results to be a goodindicator of the relative performance. We base this on the fact that the benchmarks are highlyCPU intensive and the Parallels software does not emulate the CPU, but presents it to theguest OS.

For the benchmarks running on the CLR, the most popular open source implementationis Mono [8], which is available on all tested operating systems. For all operating systems wehave used Mono version 4.0.1 and, additionally for Windows, we used the Microsoft .NetRuntime version 4.5.50709.

For Java and JCSP, we use the current Oracle JRE version 1.8.0 45 throughout the testson all three operating systems.

CPA 2015 preprint – the proceedings version will have other page numbers and may have minor differences.

12 K. Skovhede and B. Vinter / CoCoL: Concurrent Communications Library

3.1. CommsTime

The CommsTime benchmark is a classic micro-benchmark, with the purpose of giving ameasure of the communication overhead introduced by a channel communication. While thebenchmark is a bit simplistic it does give an measure of how much overhead each communi-cation adds. A schematic representation of the CommsTime network is shown in Figure 3.

Prefix Delta Consumer

Successor

Figure 3. The CommsTime network.

Identity

Prefix + Delta

Identity

Consumer

……

Identity

Identity

Identity

Figure 4. CommsTime scalable network.

3.1.1. CommsTime With CIL

To evaluate the different approaches to implementing channel communication in the Com-mon Intermediate language, we have implemented the CommsTime example in multiple con-figurations. The Await version uses the await keyword when reading and writing, and thusgets an automatic implementation of the finite state machine. The Building blocks version isalso using await statements, but uses pre-cooked processes for the Prefix, Delta and Suc-cessor processes similar to other CSP libraries. The Blocking version uses the asynchronouscommunication channels, but blocks on each call, thus requiring a thread for each process.The Minimal version is an experiment with using an extremely simple un-buffered channelbased on traditional locking and events. The BlockingCollection version is exploring the op-tion of using the BlockingCollection data structure found in the .Net 4.5 libraries. TheBlockingCollection looks similar to a channel, in that it supports a buffered collectionthat blocks readers and writers, and some set operations that are similar to the ReadFromAnyand WriteToAny methods. The results are shown in Figure 5 for running on all operatingsystems with Mono and the Microsoft .Net runtime on Windows.

From the results we can see that generally the Mono runtime is between 5 to 10 timesslower than the Microsoft .Net runtime. As expected, the version written with await and theversion using pre-cooked processes perform almost identically. If the CoCoL channels arefitted with locks to provide a blocking interface, we can see that the execution time is ap-proximately 5 times slower. If the CoCoL library were implemented with locks and events, itwould generally be slower, except for Mono on Windows and OSX. The BlockingCollectionis generally many times slower, except on OSX where it seems to use an OSX specific featureto obtain very fast execution times.

CPA 2015 preprint – the proceedings version will have other page numbers and may have minor differences.

K. Skovhede and B. Vinter / CoCoL: Concurrent Communications Library 13

0  

5  

10  

15  

20  

25  

30  

35  

40  

45  

Await   Building  blocks   Blocking   Minimal   BlockingCollec<on  

Microsecond

s  pr.  commun

ca0o

n  

Win  /  .Net  

Win  /  Mono  

OSX  /  Mono  

Linux  32  /  Mono  

Linux  64  /  Mono  

Figure 5. Communication time with CommsTime for a variety of similar approaches to implementing a com-munication channel in C#. Lower is better.

3.1.2. CommsTime Compared to Other Libraries

To evaluate how the CoCoL approach compares to existing libraries, we run the same Comm-sTime experiment with both JCSP and C++CSP. For Linux, we also measure with OpenJDK1.7.0 79, which is the default Java library on Ubuntu.

Due to outdated dependencies, we were only able to get C++CSP running on Linux.The 64-bit version compiles, but produces deadlocks and segmentation faults, thus we onlyinclude the 32-bit results. The combined results are shown in Figure 6.

Overall, the fastest implementation is C++CSP when using a single thread for all pro-cesses, resulting in a cooperative threading model with very fast switching. When C++CSPuses multiple threads for the processes, the context switches are causing the C++CSP im-plementations to be consistently slower than the Mono and JCSP version. Interestingly, the(correct) parallel version in JCSP is consistently faster than the sequential version. The Monoand JCSP versions are generally comparable, with neither being consistently faster. The .Netruntime is significantly faster than any of the Mono or Java based versions, being more thantwice as fast as the fastest implementation.

3.1.3. CommsTime Scaling

To evaluate the relative overhead when scaling systems beyond the small CommsTime ex-ample, we have implemented a variant of the CommsTime network where we introduce for-warding processes to form a communication ring as shown in Figure 4. With this ring-setupit becomes trivial to increase the number of processes, and thus experiment with the scalabil-ity of the systems. In Figure 7 we show how the communication overhead increases slowlywhen the number of processes and channels increase. The general tendency is that there islittle extra overhead from running more processes. When increasing the number of processesand channels by 5 orders of magnitude, the increase in communication time is doubled in theworst case.

CPA 2015 preprint – the proceedings version will have other page numbers and may have minor differences.

14 K. Skovhede and B. Vinter / CoCoL: Concurrent Communications Library

0  

2  

4  

6  

8  

10  

12  

14  

16  

18  

20  

Win   OSX   Linux  32  bit   Linux  64  bit  

Microsecond

s  pr.  commun

ica0

on   CoCoL  /  .Net  

CoCoL  /  Mono  

Java  /  Par  

Java  /  Seq  

OpenJDK  /  Par  

OpenJDK  /  Seq  

CPP  /  Par  /  single  

CPP  /  Par  /  mulH  

CPP  /  Seq  /  single  

CPP  /  Seq  /  mulH  

Figure 6. Communication time with CommsTime for three different libraries on different operating systems.Lower is better.

0  

2  

4  

6  

8  

10  

12  

14  

16  

18  

10   100   1000   10000   100000   1000000  

Microsecond

s  pr.  commun

ica0

on  

Ubuntu  32  /  Mono  

Win  /  Mono  

Ubuntu  64  /  Mono  

OSX  /  Mono  

Win  /  .Net  

Figure 7. Communication time when scaling the number of channels and processes in CommsTime. Lower isbetter.

3.2. Stressed Alt

The stressed alt benchmark is using a number of shared channels, with each channel having anumber of writers contending to write each channel. At the receiving end of the channels is asingle stressed reader, performing a fair read from the channel set, ensuring that no channelsstarve.

When the stressed alt benchmark runs with JCSP on OSX, there is an issue with scaling

CPA 2015 preprint – the proceedings version will have other page numbers and may have minor differences.

K. Skovhede and B. Vinter / CoCoL: Concurrent Communications Library 15

to more channels. When increasing from a 10x100 problem size to a 20x100 problem size,the communication time increases from 46 microseconds to 140 microseconds. This appearsto be an implementation detail in the JVM that causes exponential time increase. To make thefigure easier to read, this measurement has been excluded from Figure 8. Again, the Monoruntime is approximately 5 times slower than the .Net version. The .Net version performsalmost as good as the JCSP version, which has consistently good performance across alloperating systems, except OSX where some scaling issue appear.

One thing to note is that the JCSP version cannot handle more than approximately 3000processes on Linux. Even with 4GB system memory, an out-of-memory exception is thrownwhen the system attempts to start the many processes. Others report that the limit is around7000 processes [19], but do not specify machine details.

With the CoCoL library, it is possible to use a million processes on all platforms. With the.Net runtime it takes around 4 hours to complete 1000 rounds of 1 million communications.The test setup with 1 million processes using the Mono runtime have not been completed, aseach test would take approximately 20 hours to complete.

0  

5  

10  

15  

20  

25  

30  

35  

40  

45  

50  

Win  /  .Net   Win  /  Mono   OSX  /  Mono   Ubuntu  32  /  Mono  

Ubuntu  64  /  Mono  

Win  /  JCSP   OSX  /  JCSP   Ubuntu  32  /  JCSP  

Ubuntu  64  /  JCSP  

Microsecond

s  pr.  commun

ica0

on  

10x10  

10x100  

20x100  

100x100  

1000x1000  

Figure 8. Communication time for Stressed Alt with JCSP and CoCoL on different operating systems. Loweris better.

3.3. Mandelbrot

To investigate a slightly more realistic system, where each process has a varying, non-zero,amount of work to do, we have implemented a renderer for a Mandebrot fractal. The imple-mentation takes the image dimensions and iteration count as input and then forwards eachpixel to the workers. The workers forward each result pixel to a renderer, which assemblesthe picture.

We have implemented the process network in two flavors: static and dynamic. In thestatic setup, the processes are loaded initially, and wait for channel input. In the static setup,there are 32 workers, reading from a shared input channel, and writing to a shared outputchannel. In the dynamic setup, a worker is created pr. pixel, and the values are passed intothe constructor of each worker. The process that spawns the workers also collects all resultsthrough a shared channel.

CPA 2015 preprint – the proceedings version will have other page numbers and may have minor differences.

16 K. Skovhede and B. Vinter / CoCoL: Concurrent Communications Library

To evaluate how well the system scales with a large number of communications, we havevaried the size of the output image. For the dynamic approach, this means that as many as 4million short-lived processes are created.

0  

5  

10  

15  

20  

25  

Win  /  .Net  /  Sta0c  

Win  /  Mono  /  Sta0c  

OSX  /  Mono  /  Sta0c  

Ubuntu  32  /  

Mono  /  Sta0c  

Ubuntu  64  /  

Mono  /  Sta0c  

Win  /  .Net  /  Dynamic  

Win  /  Mono  /  Dynamic  

OSX  /  Mono  /  Dynamic  

Ubuntu  32  /  

Mono  /  Dynamic  

Ubuntu  64  /  

Mono  /  Dynamic  

Microsecond

s  pr.  pixel  

100x100  

500x500  

1000x1000  

2000x2000  

Figure 9. Time pr. pixel for a Mandelbrot renderer. Lower is better.

When running the benchmarks, the actual image drawing was disabled to reduce issueswith the graphics library, and the maximum number of iterations for each pixel was set to100. The results for various platforms are shown in Figure 9, and show that even when thesize of the image grows, the time to compute each pixel is almost constant. For most results,the 100x100 pixel setup is too small to allow running at full speed. The Mono runtime hasnearly identical performance across all operating systems.

In the static configuration, the .Net runtime is clearly fastest, with the slowest beingapproximately 5 times slower.

The dynamic setup is clearly slower than the static, but only with about a factor of 2 forthe Mono runtime. The .Net runtime has some issues handling this particular setup, and endsup 10 times slower than the static version. This is most likely an issue with the thread poolnot matching the increase/decrease in workload well enough.

4. Future Work

The primary focus on CoCoL has been to define a minimal API for a single machine. TheCSP system lends itself nicely to multiple machines, so this should be investigated.

With distributed processes it is required that data being passed on a channel is serial-izable for the network transport. Fortunately, CIL has a rich set of capabilities for serializ-ing and deserializing objects, which should make such efforts possible. When performing adistributed external choice, it is relatively straightforward to implement with the two-phase-commit approach, but requires an efficient distributed lock.

Many kinds of data, such as file handles and the channels themselves, cannot be serial-ized. One approach to solving this is to use the CIL remoting, which is a kind of RPC call,where a proxy object forwards calls and data to the process that contains the original entry.

CPA 2015 preprint – the proceedings version will have other page numbers and may have minor differences.

K. Skovhede and B. Vinter / CoCoL: Concurrent Communications Library 17

Once the serialization issues are resolved, the idea that processes are not depending on astack makes them portable. This portability follows from the finite state machine encapsula-tion, where all state and accessible variables are captured in an object instance. By serializingthe state object, it is possible to migrate the process without handling potential pointers to thelocal memory space. This allows inactive processes to be suspended and migrated to differentmachine to provide a workload balancing.

The external choice implementation lends itself to different forms of external choice.One such implementation could be a multicast operation, where the writer will atomicallywrite a value to n channels in a set of m channels, or not write at all.

5. Conclusion

CoCoL is a new library for handling concurrent programs without the need for traditionallocking constructs. By using CSP ideas, it becomes possible to build programs using tra-ditional CSP logic. By using a terminology that is unlike the traditional CSP wording andallowing a mixed paradigm approach, it is the authors’ hope that the library will becomepopular outside the CSP community.

As CoCoL uses the CIL runtime it works seamlessly across all major operating systems.Combined with the very small open source codebase, the authors’ consider it a candidate forteaching CSP-like concurrency.

The use of continuations and general language integration makes it possible to writemost common CSP examples in a single file. With the task-based parallelism it becomespossible to run millions of processes with moderate memory requirements.

All source code, including the benchmarks, can be found on the project website [18].

Acknowledgements

This research was supported by grant number 131-2014-5 from Innovation Fund Denmark.

References

[1] C.A.R. Hoare. Communicating Sequential Processes. Prentice-Hall, London, 1985. ISBN: 0-131-53271-5.

[2] Peter H Welch and Fred RM Barnes. Communicating mobile processes: introducing occam-pi. in 25 yearsof csp, volume 3525 of. Lecture Notes in Computer Science, pages 175–210.

[3] Peter H Welch and Fred RM Barnes. A csp model for mobile channels. In CPA, pages 17–33, 2008.[4] Peter H Welch, Neil CC Brown, James Moores, Kevin Chalmers, and Bernhard HC Sputh. Integrating and

extending jcsp. Communicating Process Architectures 2007, 65:349–370, 2007.[5] ECMA ECMA. 335: Common language infrastructure (cli). ECMA, Geneva (CH),, 2005.[6] Shyamal Suhana Chandra and Kailash Chandra. A comparison of java and c#. Journal of Computing

Sciences in Colleges, 20(3):238–254, 2005.[7] Jeremy Singer. JVM versus CLR: a comparative study. In Proceedings of the 2nd international conference

on Principles and practice of programming in Java, pages 167–169. Computer Science Press, Inc., 2003.[8] Mono Project. The mono project. http://www.mono-project.com/. [Online; accessed June 2015].[9] Anders Hejlsberg, Scott Wiltamuth, and Peter Golde. C# language specification. Addison-Wesley Long-

man Publishing Co., Inc., 2003.[10] Gilad Bracha, Martin Odersky, David Stoutamire, and Philip Wadler. Making the future safe for the past:

Adding genericity to the java programming language. Acm sigplan notices, 33(10):183–200, 1998.[11] D. Szafron, J. Schaeffer, and A. Edmonton. An experiment to measure the usability of parallel program-

ming systems. Concurrency Practice and Experience, 8(2):147–166, 1996.[12] L. Hochstein, J. Carver, F. Shull, S. Asgari, and V. Basili. Parallel programmer productivity: A case

study of novice parallel programmers. In Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005Conference, pages 35–35. IEEE, 2005.

CPA 2015 preprint – the proceedings version will have other page numbers and may have minor differences.

18 K. Skovhede and B. Vinter / CoCoL: Concurrent Communications Library

[13] Semih Okur, David L Hartveld, Danny Dig, and Arie van Deursen. A study and toolkit for asynchronousprogramming in c#. In Proceedings of the 36th International Conference on Software Engineering, pages1117–1127. ACM, 2014.

[14] Stefan Tilkov and Steve Vinoski. Node.js: Using javascript to build high-performance network programs.IEEE Internet Computing, 14(6):0080–83, 2010.

[15] Adele Goldberg and David Robson. Smalltalk-80: the language and its implementation. Addison-WesleyLongman Publishing Co., Inc., 1983.

[16] C++11. std::future namespace in c++11. http://en.cppreference.com/w/cpp/thread/future.[Online; accessed June 2015].

[17] Oracle. java.util.concurrent.future class. https://docs.oracle.com/javase/8/docs/api/java/

util/concurrent/Future.html. [Online; accessed June 2015].[18] K. Skovhede. Cocol source code. https://github.com/kenkendk/cocol. [Online; accessed June

2015].[19] Jan B. Pedersen and Andreas Stefik. Towards millions of processes on the jvm. 2014.[20] Butler Lampson and Howard Sturgis. Crash recovery in a distributed data storage system. Xerox Palo

Alto Research Center Palo Alto, California, 1979.[21] Neil C. C. Brown. C++CSP2: A Many-to-Many Threading Model for Multicore Architectures. In Alis-

tair A. McEwan, Wilson Ifill, and Peter H. Welch, editors, Communicating Process Architectures 2007,pages 183–205, jul 2007.

CPA 2015 preprint – the proceedings version will have other page numbers and may have minor differences.