the split-phase synchronisation technique: reducing the pessimism in the wcet analysis of...

10
The Split-Phase Synchronisation Technique: Reducing the Pessimism in the WCET Analysis of Parallelised Hard Real-Time Programs Mike Gerdes, Florian Kluge and Theo Ungerer University of Augsburg, Germany Email: {gerdes,kluge,ungerer}@informatik.uni-augsburg.de Christine Rochange University of Toulouse, France Email: [email protected] Abstract—In this paper we present the split-phase synchronisa- tion technique to reduce the pessimism in the WCET analysis of parallelised hard real-time (HRT) programs on embedded multi- core processors. We implemented the split-phase synchronisa- tion technique in the memory controller of the HRT capable MERASA multi-core processor. The split-phase synchronisation technique allows reordering memory requests and splitting of atomic RMW operations, while preserving atomicity, consistency and timing predictability. We determine the improvement of worst-case guarantees, that is the estimated upper bounds, for two parallelised HRT programs. We achieve a WCET improve- ment of up to 1.26 with the split-phase synchronisation technique, and an overall WCET improvement of up to 2.9 for parallel HRT programs with different software synchronisations. I. I NTRODUCTION Research in parallel programs and architectures was bound to the domain of high-performance computing for a long time. With the upcoming of multi-core processors, parallelisation became also important in other domains, namely desktop end- user systems and embedded systems. However, embedded systems have different needs and must fulfil other requirements than high-performance systems. Today’s HRT programs in the automotive, avionic or machinery industry are executed on single-core processors. The new trend of using multi- cores in safety-critical domains sparks off research on running HRT tasks in parallel with other tasks to execute mixed- critical application workloads. Our research goes even one step further: we target multi-core execution of parallelised HRT tasks without sacrificing timing guarantees. The threads of a parallelised program require synchronised access to shared data. Hence, it is essential for parallelised HRT programs to assert predictable access to shared re- sources as well as upper bounds to waiting times introduced by the execution of synchronisation primitives. Although it has been shown that parallelised HRT programs are timing analysable with static worst-case execution time (WCET) analysis tools [1] [2], it is an open problem to reduce the pessimism in the static WCET analysis introduced from inter- fering accesses to shared resources. This pessimism becomes apparent, for instance, in the worst-case latencies of memory requests in shared-memory multi-core processors. The contribution of this paper is to introduce the split-phase synchronisation technique to reduce this additional pessimism in the static WCET analysis of parallel HRT programs. Our technique aims at making the frequent case faster, that is it reduces the WCET of frequent (and fast) load/store accesses, while sacrificing worst-case performance of more seldom (and slower) synchronisation accesses. The split-phase synchroni- sation technique has been implemented in hardware allowing the reordering of memory accesses in the memory controller. We show that our proposal preserves consistency through weak ordering in hardware, and predictability by using HRT capable software synchronisation techniques as introduced in [1]. We discuss and motivate why we implement the synchronisation logic and split-phase synchronisation in an augmented memory controller, instead of locking the interconnect or implement- ing a dedicated shared memory for synchronisations at the interconnect. We evaluate the improvement of the worst-case guarantees and compare the WCET estimates from the static WCET analysis tool OTAWA [3] of different parallelised HRT programs with and without the split-phase synchronisation technique. In Section II we depict related work. In Section III we shortly present the modelled HRT capable MERASA multi- core processor [4], introduce worst-case memory latencies (WCMLs), and discuss consistency and atomicity require- ments on the hardware for synchronisations in HRT parallel programs. The split-phase synchronisation technique is then presented in Section IV, and evaluation results with the static WCET tool OTAWA are shown in Section V. II. RELATED WORK Monchiero et al. [5] present an augmented global memory controller, the Synchronisation-operation Buffer (SB), to re- duce contention for busy-waiting synchronisation primitives in future mobile systems with complex Networks-on-Chip (NoCs). Their main focus is on reducing contention, and there- fore enabling an efficient use of busy-waiting synchronisations like spin locks. The goal of their technique is to decrease the average-case execution time by speeding up slow synchroni- sation primitives, while also enabling a fine-grained synchro- nisation. Another approach, the Request-Store-Forward (RSF) model, has been proposed by Liu and Gaudiot [6] targeting many-core architectures in high-performance computing. The goal of the RSF technique is to provide a fine-grained synchro-

Upload: independent

Post on 03-Dec-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

The Split-Phase Synchronisation Technique:Reducing the Pessimism in the WCET Analysis of

Parallelised Hard Real-Time ProgramsMike Gerdes, Florian Kluge and Theo Ungerer

University of Augsburg, GermanyEmail: {gerdes,kluge,ungerer}@informatik.uni-augsburg.de

Christine RochangeUniversity of Toulouse, France

Email: [email protected]

Abstract—In this paper we present the split-phase synchronisa-tion technique to reduce the pessimism in the WCET analysis ofparallelised hard real-time (HRT) programs on embedded multi-core processors. We implemented the split-phase synchronisa-tion technique in the memory controller of the HRT capableMERASA multi-core processor. The split-phase synchronisationtechnique allows reordering memory requests and splitting ofatomic RMW operations, while preserving atomicity, consistencyand timing predictability. We determine the improvement ofworst-case guarantees, that is the estimated upper bounds, fortwo parallelised HRT programs. We achieve a WCET improve-ment of up to 1.26 with the split-phase synchronisation technique,and an overall WCET improvement of up to 2.9 for parallel HRTprograms with different software synchronisations.

I. INTRODUCTION

Research in parallel programs and architectures was boundto the domain of high-performance computing for a long time.With the upcoming of multi-core processors, parallelisationbecame also important in other domains, namely desktop end-user systems and embedded systems. However, embeddedsystems have different needs and must fulfil other requirementsthan high-performance systems. Today’s HRT programs inthe automotive, avionic or machinery industry are executedon single-core processors. The new trend of using multi-cores in safety-critical domains sparks off research on runningHRT tasks in parallel with other tasks to execute mixed-critical application workloads. Our research goes even one stepfurther: we target multi-core execution of parallelised HRTtasks without sacrificing timing guarantees.

The threads of a parallelised program require synchronisedaccess to shared data. Hence, it is essential for parallelisedHRT programs to assert predictable access to shared re-sources as well as upper bounds to waiting times introducedby the execution of synchronisation primitives. Although ithas been shown that parallelised HRT programs are timinganalysable with static worst-case execution time (WCET)analysis tools [1] [2], it is an open problem to reduce thepessimism in the static WCET analysis introduced from inter-fering accesses to shared resources. This pessimism becomesapparent, for instance, in the worst-case latencies of memoryrequests in shared-memory multi-core processors.

The contribution of this paper is to introduce the split-phasesynchronisation technique to reduce this additional pessimism

in the static WCET analysis of parallel HRT programs. Ourtechnique aims at making the frequent case faster, that is itreduces the WCET of frequent (and fast) load/store accesses,while sacrificing worst-case performance of more seldom (andslower) synchronisation accesses. The split-phase synchroni-sation technique has been implemented in hardware allowingthe reordering of memory accesses in the memory controller.We show that our proposal preserves consistency through weakordering in hardware, and predictability by using HRT capablesoftware synchronisation techniques as introduced in [1]. Wediscuss and motivate why we implement the synchronisationlogic and split-phase synchronisation in an augmented memorycontroller, instead of locking the interconnect or implement-ing a dedicated shared memory for synchronisations at theinterconnect. We evaluate the improvement of the worst-caseguarantees and compare the WCET estimates from the staticWCET analysis tool OTAWA [3] of different parallelised HRTprograms with and without the split-phase synchronisationtechnique.

In Section II we depict related work. In Section III weshortly present the modelled HRT capable MERASA multi-core processor [4], introduce worst-case memory latencies(WCMLs), and discuss consistency and atomicity require-ments on the hardware for synchronisations in HRT parallelprograms. The split-phase synchronisation technique is thenpresented in Section IV, and evaluation results with the staticWCET tool OTAWA are shown in Section V.

II. RELATED WORK

Monchiero et al. [5] present an augmented global memorycontroller, the Synchronisation-operation Buffer (SB), to re-duce contention for busy-waiting synchronisation primitivesin future mobile systems with complex Networks-on-Chip(NoCs). Their main focus is on reducing contention, and there-fore enabling an efficient use of busy-waiting synchronisationslike spin locks. The goal of their technique is to decrease theaverage-case execution time by speeding up slow synchroni-sation primitives, while also enabling a fine-grained synchro-nisation. Another approach, the Request-Store-Forward (RSF)model, has been proposed by Liu and Gaudiot [6] targetingmany-core architectures in high-performance computing. Thegoal of the RSF technique is to provide a fine-grained synchro-

nisation technique and to reduce contentions of busy-waitingand polling synchronisation methods. A synchronisation bufferimplemented in on-chip memory (e.g. a shared cache), keepstrack of synchronisations (requests), orders them (store), andnotifies (forward) the cores when the synchronisation accessis ready. By offloading this computation near the memory,they could use the waiting times in the cores to execute othertasks until they are notified that their synchronisation access isready. Contrarily to the above solutions, we focus on speedingup the worst-case performance of the frequent case of memoryoperations, like loads and stores, with our augmented memorycontroller and the split-phase synchronisation technique.

In this paper we do not go into detail about the physicalimplementation of DRAM accesses. The memory latenciesused in this paper have been derived from an FPGA prototypeof the MERASA architecture [4]. However, the authors of [7],[8], and [9] provide different solutions concerning predictableDRAM access and their detailed physical implementation.

Only very few publications have been targeting WCETanalysis of parallel HRT programs so far. Gustavsson etal. [10] present the chain of a possible static WCET analysisof multi-core architectures. They use timed automata to modelthe various components of a multi-core architecture, includingprivate and shared caches, but also software-level sharedresources like spin locks. The WCET of the parallel programis then derived by model checking. To estimate the WCMLs apredictable arbitration scheme for shared resources, that is theoff-chip memory, is mandatory. In the MERASA processor,which we use as WCET model in this paper, this is done by apredictable round-robin arbitration in the bus [11]. In a recentpublication [12], the authors refined the round-robin arbitrationproposing a harmonic round-robin arbitration. In that way,memory intensive programs are given access to the bus morefrequently by prioritising them in the bus scheduling. Furtherapproaches for predictable bus arbitration using a TDMAscheme are presented in [13], and [14]. In [15] the authors statea different method for estimating upper bounds for memorylatencies by linking task- and system-level analyses. In [16],we introduce basic principles of analysing the worst-casewaiting times in synchronisation functions. The idea is todetermine all the paths on which a thread holds any system-level or application-level synchronisation variable, and theirestimated WCETs are combined to compute the worst-casewaiting times at synchronisation points. In [2] we present firstresults on the static WCET analysis of an industrial, parallelHRT application. We considered a limited set of synchroni-sation functions based on test-and-set (TAS). In [1] we thenfurther investigated predictable HRT capable implementationof common software and hardware synchronisation techniques,and their impact on the program’s WCET. We used TAS, andFetch&Increment/Fetch&Decrement (F&I/F&D) as hardwareprimitives, and mutex locks, semaphores, and barriers assoftware synchronisation techniques. In the current paper, weconsider those above proposed synchronisation techniques inparallelised HRT programs and combine them with the split-phase synchronisation technique in a static WCET analysis.

Real-time bus arbitration

Core 1

D-SPM

Core 2 Core 4Core 3

I-SPM DSP D-ISPI-SPMI-SPM

Memory controller with synchronisation logic

Shared memory

I-SPMD-SPMD-SPMD-SPM

Fig. 1. Overview of the MERASA multi-core processor, stressing theembedded hardware synchronisation primitives in the memory controller.

III. PREDICTABILITY IN MULTI-CORE PROCESSORS

We use a WCET model of the bus-based, HRT capableSMT-multi-core MERASA processor [4], which has also beenimplemented as a SystemC simulator and FPGA prototype.The modelled MERASA processor features a configurablenumber of HRT capable cores and hardware thread slots.One hardware thread slot of each core is reserved for aHRT thread, and the other hardware thread slots are usedby non-hard real-time (NHRT) threads. The HRT threads areisolated in the cores [17], but the memory controller andinterconnect cannot isolate concurrent accesses of differentcores. Besides, a partitioning of global memory would impedethe use of a global address space, and hence narrow downthe programmability for users. Therefore, we have chosento allow shared resources. Interferences are handled by anupper bounding of accesses to shared resources like a real-timecapable bus [11] as interconnect to memory and cores, as wellas a real-time capable memory controller. As local memorieswe use scratchpad memories for each core, namely a datascratchpad (D-SPM) and a dynamic instruction scratchpad (I-SPM) [18], but no caches for the HRT threads. However, weallow caches to be used by NHRT threads. Fig. 1 depicts anoverview of the MERASA multi-core processor.

A. Worst-case Memory Latencies

The WCETs of parallelised HRT programs running onshared-memory multi-core processors are depending highly onthe knowledge of competing off-chip memory accesses andthe WCMLs. The latency for a memory request is split intothree parts: 1) the time the bus needs to dispatch the memoryrequest from a core to the memory controller, the so-calledbus cycle time; 2) the time the memory controller needs toexecute the memory request, which is depending on whichkind of memory request is executed, either a load, a store,a TAS, or a F&I/F&D operation; and 3) again the bus cycletime to return a value to the core that requested the memoryoperation. The memory requests from all cores to the globalshared memory are arbitrated by a real-time aware bus in theMERASA processor. The bus arbitrates accesses in a round-robin fashion, dispatching a waiting memory request of a coreto the memory controller. When a memory request from a

core is accepted and dispatched to the bus and subsequentlyto the memory controller, follow-up memory requests fromthe same core are dispatched after the previous access hasbeen finished. In the following, the WCMLs are defined asthe upper bound delay on a HRT memory request from whenit is ready to be dispatched to the shared memory (over thebus) until it is successfully finished and a following requestcould be dispatched. The bus is treated as full duplex, meaningthat a request from a core to the memory and a result fromthe memory to a core can be dispatched at the same time.

B. Consistency and Atomicity

Sequential consistency, introduced by Lamport in [19],has two requirements: (R1) Each processor issues memoryrequests in the order specified by its program, and (R2)memory requests from all processors issued to an individualmemory module are serviced from a single FIFO queue.The HRT capable MERASA multi-core processor fulfils thosetwo requirements through the arbitration in the cores (R1)and the augmented memory controller (R2) (see [4]). Inlater publications, the notion of weak ordering [20] has beenintroduced (see [21] for a further definition of weak ordering).The idea of weakly ordered systems is that they appearsequentially consistent by ordering accesses dispatched fromdifferent processors with explicit synchronisation operationsthat can be recognised by hardware. In detail, bringing therequirements for weakly ordered memory operations statedin [20] together with the MERASA multi-core processor:1) accesses to global synchronisation variables are stronglyordered, 2) no access to a synchronisation variable is issuedin a core before all previous global data accesses have beenperformed and 3) no access to global data is issued by a corebefore a previous access to a synchronisation variable has beenperformed. In Section IV-B we show that these requirementsstill hold with the split-phase synchronisation technique.

The use of synchronisation techniques (see [22] for a surveyon software synchronisations, and [23], [1] for implementa-tions of predictable synchronisation techniques), for instanceto avoid data races, is mandatory for functional correctness ofparallel programs. One possibility to use software synchroni-sation techniques in parallel HRT programs is with the supportof hardware-implemented RMW operations. A mandatoryrequirement for the implementation of RMW operations isatomicity. It ensures that an operation consisting of a read,a modification and a write cannot be interrupted, and will beexecuted completely. For a bus-based, shared-memory multi-core two different possibilities to implement atomicity forRMW operations are conceivable: 1) locking the interconnectand modify in cores, or 2) logic for atomic operations inthe memory. The latter one could be either implemented inthe memory controller of the shared global memory or as adedicated shared memory for synchronisations at the intercon-nect (e.g. like shared L2 caches in high-performance systems).In the following we discuss why we augmented the memorycontroller with the needed logic for the atomicity of RMWoperations and the split-phase synchronisation technique.

B M1Core 1 (RMW)

Core 2 (Load)

M2 M3... Mn B BC M1 M2 M3

... Mm B

BB M1 M2 M3... MnXXX X X X X X X X X X B......

(a) Locked interconnect

CB M1Core 1 (RMW)

Core 2 (Load)

M2 M3... Mn M1 M2 M3

... Mm B

B BM1 M2 M3... MnXX X X X X X X ......X B

(b) Augmented memory controller

CB M1Core 1 (RMW)

Core 2 (Load)

M2 M3... Mn M1 M2 M3

... Mm B

B BM1 M2 M3 MnX X X

...X X X...X B ...

...

(c) With split-phase synchronisation technique

Fig. 2. Memory access pattern for implementing RMW operations with alocked interconnect (a), with the augmented memory controller (b), and withthe split-phase synchronisation technique (c).

1) Locking the Interconnect: Fig. 2 depicts the impact ofa RMW operation (from core 1) on the memory latency of aload (from core 2) with a locked interconnect (see Fig. 2(a)),with the augmented memory controller (see Fig. 2(b)), andwith the split-phase synchronisation technique (see Fig. 2(c)).The blocks with a ’B’ depict bus accesses, M1, ...,Mn depictmemory accesses, ’C’ is the computation for the modificationin RMW operations, and the blocks with ’X’ depict idle cycles,e.g. when the bus is locked, or the memory controller is busywith an other memory access. When locking the interconnectfor RMW operations, the WCML of every memory accessincreases. For instance, Fig. 2(a) shows for two cores that theWCML of a load in core 2 is at least 3 cycles higher (theadditional bus cycles are labelled in Fig. 2(a)) than with anunlocked interconnect and the synchronisation logic embeddedin the memory controller (see Fig. 2(b)). These additionallatency cycles are adding up when scaling the number ofcores. Also note that the time the computation phase takesto manipulate the value in a RMW operation is depending onwhere this computation is done. The computation time mightbe higher if the computation is done in the core, whereas onecycle is possible if the computation is done in the augmentedmemory controller. Also, the additional latency adds up forevery memory access, thus the estimated WCET of the wholeprogram increases.

2) Augmented Memory Controller: The augmented mem-ory controller (see Fig. 1), described in more detail in [1],includes the needed logic for atomicity of RMW operations.It serves all memory request in the order they arrive (FIFO).Also, the memory controller recognises a synchronisationaccess, a RMW operation, and executes the load, the modifica-tion and the subsequent store atomically. We do not distinguishHRT and NHRT requests in the memory controller, as e.g.proposed in [7], because in our case it does not speed up theworst-case performance. For the WCML of a memory requestfrom one core we have to assume that all other concurrentmemory requests issued from the other cores are HRT requestsas well, therefore prioritising them does not cause any speedupin the worst-case. In Section V-A we describe the impact onthe WCMLs in more detail. Nonetheless, we isolate HRT andNHRT threads inside the SMT-cores [17], [4].

Real-time bus arbitration(including logic for synchronisations)

Core 1 Core 2 Core 4Core 3

Memory controller

Shared memory

Synchr. memory

D-SPM I-SPM DSP D-ISPI-SPMI-SPM I-SPMD-SPMD-SPMD-SPM

Fig. 3. Schematic overview of a dedicated synchronisation memory at thememory interconnect, including the synchronisation logic at the real-time bus.

3) Synchronisation Memory on the Interconnect: Fig. 3shows an additional possibility to achieve atomicity for RMWoperations by using a dedicated shared memory at the in-terconnect for synchronisation variables. The advantage ofthis approach is that faster load/store operations could beexecuted in parallel with slow RMW operations. The neededsynchronisation logic is nearly the same as for the augmentedmemory controller, but additional arbitration logic in the busis needed, as it is possible that requests from the off-chipmemory and the synchronisation memory finish in the samecycle. This also leads to the problem of a possible increase inthe WCMLs of loads/stores, because even if synchronisationmemory requests are served with less priority, they need tobe handled eventually as otherwise they cannot be boundedanymore. This would add an extra cycle to the WCML of everyload/store. Another drawback are the additional costs for theon-chip memory and the loss of flexibility e.g. as the numberof possible synchronisation variables is bound to the sizeof the synchronisation memory. Additional initialisation andmemory management for synchronisation variables, however,should not to be a problem. However, it might be a promisingapproach, as e.g. shown in [24] for NoC-based multi-coreprocessors. The authors of [24] present results on average syn-chronisation performance for a 16-core NoC-based multi-coreprocessor. The best results were achieved with a dedicated on-chip memory for synchronisation variables. They conclude thatfor future NoC-based multi-cores the trade-offs for area versusperformance should be taken into consideration. Though, inthis paper we favoured the approach of the augmented memorycontroller with the split-phase synchronisation technique asit promises higher flexibility, less hardware costs, and a lesscomplex bus arbitration.

IV. SPLIT-PHASE SYNCHRONISATION

The split-phase synchronisation technique is a modificationof the augmented memory controller to reduce the pessimismin the WCET for loads/stores introduced from slower (syn-chronisation) memory operations. To achieve reduced WCMLsfor loads/stores, we reorder memory operations in the aug-mented memory controller. We prioritise load/store operationsover RMW operations, while keeping sequential consistencywith weak ordering as defined in [21]. In Section IV-B we

show that the split-phase synchronisation technique maintainsconsistency and atomicity of RMW operations.

The split-phase synchronisation technique uses a similartechnique as the load-linked/store-conditional (LL/SC) prim-itive, which is e.g. used in the Alpha AXP [25], PowerPC,ARM, and MIPS architectures. The advantage of LL/SCover e.g. compare-and-swap (CAS) is that the two separatedinstructions only need two registers (address, data) instead ofthree. Most LL/SC implementations apply a coarse-grainedapproach, namely they do not monitor changes on the granu-larity of memory words, but lines of memory or even completememory pages. LL/SC was initially intended to scale well onlarge multiprocessors with distant shared memory. However,as the conditional store might fail for competing accesses, thelatency for a successful conditional store cannot be bounded.Thus, their use is not safe in HRT systems. Also, LL/SC is ahardware primitive, whereas the split-phase synchronisation isa technique used on all implemented RMW operations. It splitstheir load, modification and store phases to reduce the worst-case memory latencies of loads/stores by prioritising themover concurrent RMW operations, and uses a fine-grainedapproach monitoring accessed synchronisation variables inthe memory controller. Please note that the term split-phasesynchronisation is not related to the commonly known split-phase access introduced by Culler et al. in Split-C [26].

In the following we present a hardware implementation ofthe split-phase synchronisation technique in the augmentedmemory controller of the MERASA processor. Discussions onthe impact of the split-phase synchronisation technique on theWCMLs and the estimated WCET are presented in Section V.

A. Implementation in the Augmented Memory Controller

The split-phase synchronisation technique is implemented inthe augmented memory controller. In detail, we split the RMWoperations into three phases: A load phase, a modificationphase, and a store phase. We allow other memory operationsthat do not access the same variable to be brought forwardand executed before the store phase of the RMW operation.The target of the split-phase synchronisation is to achieveWCMLs for loads/stores that are, in a manner of speaking,the best-possible worst-case. That means that the WCMLs ofloads/stores only depends on concurrent (fast) loads/stores andnot on concurrent (slower) RMW operations from other cores.Memory requests are handled as described in Section III. Forthe split-phase synchronisation, further hardware changes inthe augmented memory controller are needed to allow thereordering while preserving atomicity (see Section IV-B). Thefollowing proposed implementation does not claim to be thebest possible technical solution. Further enhancements mightdecrease the needed logic and space, or even increase theaverage-case performance. It is mandatory that the logic of theadded register files can be executed as fast as possible, prefer-able in one cycle, to reduce the impact on WCMLs. From theworst-case timing analysis perspective we think it is sufficientto prove that a working technical implementation is possiblethat fulfils the requirements of consistency and atomicity for

reorder_buffer

SDRAM

Augmented Memory

Controller

Real-time Bus

2storeRMW b 01load a 02loadRMW b 0

mem_buffer

Incoming request

Reorder phase(after dispatch)

Dispatch

Update

2storeRMW b 1

Memory response

From and to cores

RMW modification feedback

3storeRMW b 1

3loadRMW b 1

4load c 0

3loadRMW b 13storeRMW b 1

b 4sync_buffer

synchronisation logic(modification phase)

Fig. 4. Schematic overview of the augmented memory controller withimplemented hardware for the split-phase synchronisation technique.

the split-phase synchronisation technique. Therefore, the mainfocus is not on the details of the technical implementation, buton the approval of predictable worst-case timing.

The proposed hardware implementation in the augmentedmemory controller uses two register files as FIFO buffersfor memory requests (see Fig. 4). One register file, themem buffer, is used to store all memory requests, whereas theother register file, the reorder buffer, is used as a temporarybuffer to reorder the load/store requests of split RMW oper-ations and load/store accesses on synchronisation variables.Also, a buffer sync buffer is used to store synchronisationvariables and a counter for each ongoing synchronisationaccess. Synchronisation accesses are either RMW operations,or also loads/stores on a synchronisation variable, as e.g. astore in the unlock operation of a TAS spin lock (see alsoSection V-B).

1) Incoming Requests: Memory requests are distinguishedbetween load/store and RMW operations in the augmentedmemory controller. In Fig. 4 we use the following syntax fordifferent memory requests: 1load a for a load from core 1 onmemory address a, 2loadRMW b and 2storeRMW b for the loadrespectively the store phase of a RMW operations on memoryaddress b from core 2. For an incoming load/store operationthe memory controller first checks if the load/store accessesa synchronisation variable that is already being accessed (andtherefore would be in the sync buffer). If not, the load/storeis just added to the mem buffer without setting the reorderflag. In the other case, it is added with the reorder flag set,and the counter of the accessed synchronisation variable isincremented in the sync buffer. When a RMW operation isdetected, the load and store accesses are split, and if noother synchronisation request on that variable is stored in thesync buffer, the memory address of the RMW operation isadded to the sync buffer with the counter set to two. In themem buffer both accesses are stored, where only the reorderflag of the storeRMW is set, but not for the loadRMW access.

On the other hand, if there is already an access to thatsynchronisation variable in the sync buffer, the counter forthat address will be increased by two (e.g. to four as depictedin Fig. 4 for the synchronisation variable b), and both splitaccesses are stored in the mem buffer with the reorder flagset. This is done as in the reordering phase we must assurethat this RMW operation must not start before the store phaseof the previous ongoing RMW operation on the same memoryaddress is completed to maintain atomicity.

2) Dispatching: Each time the memory controller is readyto dispatch a new request from the mem buffer, it checksits reorder flag. If the reorder flag is not set, that memoryrequest is dispatched. Else the next memory request withoutthe reorder flag set will be selected from the mem buffer anddispatched, and the reordering starts. If there is no requestwithout the flag set, the first entry is dispatched and alsothe reordering phase starts. Also, when a synchronisationaccess is dispatched, the counter of the corresponding memoryaddress in the sync buffer is decremented. Furthermore thesynchronisation logic is notified what kind of memory accessis currently processed. This is needed as for case 1), thememory access will be finished and dispatched directly to thecores over the real-time bus (dotted arrow in Fig. 4), e.g. fora normal load/store. Or, in case 2), for a RMW operationthat does not need the loaded value for modification, that is aTAS operation, the synchronisation logic removes the reorderflag of the corresponding store in the mem buffer. Finally, incase 3), the loaded value needs to be modified and then it istransferred to the corresponding store in the mem buffer forall other RMW operations, for instance F&I/F&D operations.

3) Reordering: In the reordering phase all accesses inthe mem buffer with the reorder flag set are moved to thereorder buffer. For the first access that is moved to thereorder buffer, e.g. 2storeRMW b in Fig. 4, the reorderflag is removed. Otherwise the waiting store of a RMWoperation might be deferred infinitely by incoming concurrentload/stores of other cores that would be executed beforethat waiting store (see also the worst-case access pattern inFig. 5). However, by removing the reorder flag, and with theFIFO policy of the mem buffer, we ensure that this accessis dispatched before all freshly incoming requests. When allaccesses in the mem buffer are processed, the accesses in thereorder buffer are appended to the mem buffer. For instance,in Fig. 4 the 4load c access would advance the 2storeRMW baccess in the reordering phase.

B. Consistency and Atomicity of RMW operations

A mandatory requirement is to maintain consistency andatomicity of RMW operations, meaning that the parallelprogram must still execute functionally correct when usingthe split-phase synchronisation technique. Atomicity of splitRMW operations is trivially satisfied. That means that 1) nei-ther the accessed variable is changed by other accesses than theongoing RMW operation, and 2) nor can the RMW operationfinish incomplete (e.g. meaning that the store phase never fin-ishes). Only load/store accesses to other variables are brought

forward and executed in between the load/modification phaseand the store phase of a split RMW operation. Therefore, theaccessed variable is not changed between the load/modificationphase and the store phase, and requirement 1) holds. Also,through the logic in the reordering phase it is asserted thatevery waiting memory request is dispatched eventually, thatis the waiting time for every access to be finished has anupper bound. So, 2) is also satisfied, and thus the split- phasesynchronisation technique does not breach the atomicity ofRMW operations.

We assume that the programmer takes care of explicitsynchronisation, e.g. critical sections are secured with locksand temporal dependencies are handled with barriers—bothimplemented with RMW operations as detailed in [1]. Also,we presume that the hardware and software implement weakconsistency as described in Section III-B. However, we mustassure that the split-phase synchronisation technique maintainsthe consistency model. The requirement 1), strongly orderedaccesses to synchronisation variables, is maintained by theuse of reorder flags in the augmented memory controller andatomicity of RMW operations. The other two requirements aretrivially maintained by the MERASA processor, because dueto in-order program execution in the cores, only one memoryrequest from a core can be dispatched at a time to the memorycontroller (see Section III). In this paper, we only assume onesingle memory controller. However, it is possible to extend ourapproach to architectures with multiple memory controllers, ifthe needed logic for the split-phase synchronisation techniquewould be implemented in each memory controller.

V. EVALUATION

Approaches to estimate the WCET of critical tasks havereceived much attention in the last fifteen years [27]. Thosebased on static analysis techniques aim at determining guar-anteed upper bounds on the real WCET, so called worst-case guarantees, taking into account the specificities of thetarget hardware. In this work, we use the open-source staticWCET analysis tool OTAWA that implements state-of-the-art algorithms for WCET analysis [3]. It supports the usedtarget multi-core architecture, the MERASA architecture, andaccounts for possible contentions on the shared bus andmemory controller by considering WCMLs. Please note thatconsidering WCMLs is safe only for processors that are freefrom timing anomalies [28]. Otherwise, all the possible latencyvalues should be considered.

A. WCMLs without split-phase synchronisation

To determine the WCMLs of different HRT memory re-quests, namely a load, a store, a TAS, or a F&I/F&D operation,two situations need to be covered. On the one hand, as weemploy SMT-cores, a HRT memory request might be delayedby a NHRT memory request on the same core that was, in theworst-case, dispatched just one cycle before the HRT memoryrequest is ready to be dispatched. Also, one must assume thatthis NHRT memory request is a RMW memory request, thatis the type of memory request that takes the longest time in

our architecture. In the following, this delay will be definedas Tmax. So, when analysing the WCML of a HRT memoryrequest from one core in a N -core processor, an additionaldelay of Tmax, introduced from a NHRT memory request, hasto be taken in account. On the other hand, additional delayson the analysed HRT memory request are introduced frommemory requests of other cores. For an N-core processor, itadds an additional delay of (N − 1) · Tmax, as in the worst-case the memory request of each of the other N − 1 coresare handled before the analysed HRT memory request. Also,the extra bus cycle TB to return a value from the memorycontroller to the core needs to be taken into account. Finally,THRT must be added, which is the time the HRT memoryrequest takes. The bus cycle time TB only needs to be takeninto account for the NHRT and HRT memory access of theanalysed core, as by employing a full duplex bus, the other buscycle times are hidden (see Fig. 5). In summary, the worst-casememory delay TWCML in the N -core MERASA processoradds up to:

TWCML =

HRT access︷ ︸︸ ︷THRT + TB +

NHRT access︷ ︸︸ ︷Tmax + TB

+ (N − 1) · Tmax︸ ︷︷ ︸Other N-1 cores

(1)

Equation 1 can be easily combined and rewritten as:

TWCML = THRT + 2 · TB +N · Tmax (2)

In the WCET model of the MERASA multi-core processor,the bus cycle time is assumed to be 1 cycle. A load isassumed to take 5 cycles in the memory controller, whereasa store takes 4 cycles. A store operation is handled fasterthan a load operation, as no actual return value needs to betransferred back to the core. However, a notification that thestore was successfully finished is returned over the bus to thecore, so the store operation will not spare the bus cycle timeafter the memory controller finishes the store operation. Thisnotification is needed, as only then the core dispatches thenext waiting memory access. The RMW operations, that isthe TAS and the F&I/F&D operations, consisting of a load, amodification, and a store, take more time. For a TAS operation,no actual modification needs to be done, therefore a TASoperation just needs to load a value, and then store back aconstant value (always a ’1’). Hence, a TAS operation takes9 cycles, that is the sum of the 5 cycles of a load operationand 4 cycles of a store operation. For a F&I respectively aF&D operation, the loaded value needs to be incremented ordecremented. Thus, an additional cycle is needed to modifythe loaded value before it is stored back. So, the time of aF&I/F&D operation sums up to 10 cycles, that is 5 cyclesfor the load operation, 1 cycle for the increment/decrement,and 4 cycles for the store. Including the bus cycle time, it ispossible to derive the WCML TWCML with the above depictedEquation 2. Table I presents the different WCMLs in a quad-core MERASA WCET model for a load, a store, and the twoimplemented RMW operations of a HRT thread without thesplit-phase synchronisation technique.

NHRT access starts...

HRT access ready... HRT access starts...

Core 1 -1 0 1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80

Core 2 -1 0 1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80

Core 3 -1 0 1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80

Core 4 -1 0 1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80

Load StoreBusfrom memory controllerto memory controller

F&I/F&D incr./decr. phaseF&I/F&D Load phase F&I/F&D Store phase

Legend:

Fig. 5. Worst-case Memory Latencies in a quad-core MERASA multi-core processor for a HRT RMW operation of Core 1 with the split-phase synchronisationtechnique.

B. WCMLs with split-phase synchronisation

We distinguish two cases to determine the WCMLs ofa HRT thread’s memory requests with the split-phase syn-chronisation technique: 1) load/store operations on non-synchronisation variables, and 2a) RMW operations respec-tively 2b) load/store operations on synchronisation variables.

By prioritising load/store operations in the augmented mem-ory controller with the split-phase synchronisation technique,the WCML of a load/store from Equation 2 decreases, thatis the load/store operation has to wait for the NHRT memoryrequest of its own core, and load/store operations of othercores, but not for RMW operations of other cores. As a loadoperation TL takes longer than a store operations, we haveto assume that in the worst-case the other cores issue loadoperations, or RMW operations on different synchronisationvariables for that their load phase will not have a reorder flagset. Therefore, for case 1), the WCML for a load/store onnon-synchronisation variables is calculated rather simply asfollows:TWCML = THRT + 2 · TB + Tmax + (N − 1) · TL

For the cases 2a) and 2b) the worst-case scenario is morecomplex. Fig. 5 depicts that worst-case scenario for case 2a),however, it also shows the case 2b) that, in the worst-case,finishes in cycle 60 (59) for a load (store) on a synchronisationvariable. To explain that worst-case scenario in detail, weintroduce an ordered list of operations σ, where Lσx is aload operation of core x, and LPσy and SPσy are the loadphase and store phase of a RMW operations of core y. SPσ

∗y

stands for the store phase of a RMW operation of core y withthe reorder flag set (see Section IV-A3), that is an operationwith lower priority. Now, we need to keep in mind that in thereorder phase of the split-phase synchronisation technique anoperation SPσ

∗y transforms into SPσy when the reorder flag

is deleted. With the consistency requirement in the MERASAprocessor that only one memory operation of a core can beactive at a time, and with N cores, we get an ordered list ofmemory operations Lσ2 >L σ3 > ... >L σN >LP σ1 >SP σ∗

1

in the memory controller, with Lσ2 >L σ3 meaning that Lσ2is executed before Lσ3. For the worst-case scenarios above(see Section V-A) that ordered list was never changed, as no

σ∗ operations were involved, that is memory operations onsynchronisation variables. Therefore, the worst-case was rathersimple to compute. For the cases 2a) and 2b) σ∗ operationsneed to be covered. The worst-case scenario for a memoryoperation of core 1 is then after cycle 14 in Fig. 5 as follows:LPσ2 >SP σ∗

2 > ... >LP σ∗N >SP σ∗

N >LP σ∗1 >SP σ∗

1 .Now, we need to assume that once one of the other coresfinishes its memory operation, it sends a new memory request.To represent the worst-case, these memory operations need tobe σ operations (e.g. the loads of core 2 in cycles 26, 40, 59),as then these memory operations are executed before the σ1operations (see cycles 54 and 74 of core 1 in Fig. 5). For σ∗

operations of the other cores this would not hold, as they wouldbe executed after the σ∗ operations of core 1, and thereforenot representing the worst-case. Taking this into account, inthe worst-case

∑N−1i=1 i operations Lσ are executed before the

SPσ∗1 operation. In summary, for an N-core processor with

N > 2, the WCML can then be computed as:

TWCML = 2 ·TB+(N+1) ·Tmax+N · (N − 1)

2·TL−(N−1)

For case 2b), as mentioned above, the WCML ofloads/stores on synchronisation variables is similar to theWCML of RMW operations. But, the store and modificationphase is omitted. An access to a synchronisation variable startsthen in the worst-case in the same cycle as the load phase ofa RMW operation in Core 1 as depicted in Fig. 5, but finishedalready in cycle 59 (store) respectively in cycle 60 (load). TheWCML is then calculated for N > 2 as:

TWCML = THRT+2 ·TB+N ·Tmax+(N −1) ·TL− (N −2)

TABLE IWCMLS WITH AND WITHOUT SPLIT-PHASE SYNCHRONISATION FOR A

HRT THREAD IN THE QUAD-CORE MERASA WCET MODEL.

Memory operation WCMLs WCMLs (with split-phase)

load 47 32store 46 31load/store (sync) 47/46 60/59TAS 51 79F&I/F&D 52 79

In Table I we depict the WCMLs of memory accesses ona quad-core MERASA processor with and without the split-phase synchronisation technique. The WCMLs for normalloads/stores is decreased by 15 cycles, whereas the WCMLincreases 13 cycles for a load/store on a synchronisationvariable respectively 27/28 cycles for a RMW operation.

C. Impact on Pessimism in the WCET

One major impact on the pessimism in the WCET stemsfrom the lack of knowledge on parallel accesses to sharedresources in parallel programs. From Table I we can calculatethe correlation of types of memory accesses and the impacton the estimated WCET in a quad-core MERASA processor.If n depicts the percentage of executed normal loads/stores,and m the percentage of executed RMW and load/storeoperations on synchronisation variables in the worst-case pathof a parallelised HRT program, the split-phase synchronisationtechnique produces better upper bounds if:

32 · n+ 79 ·m ≤ 47 · n+ 52 ·m⇒ 32 · n+ 79 · (1− n) ≤ 47 · n+ 52 · (1− n)

⇒ n ≥ 27

42≈ 64.3%

with n,m ∈ [0, 1] ∧ n+m = 1.Solving the inequation shows that if more than 64.3 %

of the executed memory operations are loads/stores, or, inother words, if less than 35.7 % of all executed memoryoperations in the worst-case path are operations on synchro-nisation variables, the split-phase synchronisation techniqueproduces lower upper bounds. However, this result givesonly a hint when looking at the source or binary code ofa parallel program, as it denotes the correlation betweenexecuted memory operations in the worst-case path of theprogram. Still, if we comprise that parallel programs mostlycontain only few synchronisation operations, e.g. many loadoperations are needed for instruction fetches, we can concludethat the split-phase synchronisation technique is beneficial forthe estimated WCETs of almost all parallelised programs.Certainly, this may not hold for a high number of cores, ase.g. the equation for the WCMLs of RMW operations onsynchronisation variables includes the number of cores N asa quadratic term. However, we think that 8 cores connectedover a shared bus to one memory controller is a feasible upperlimit for a shared-memory multi-core processor [4].

D. WCET Analysis of Parallel Programs

Our target architecture and system software include supportto start all the threads simultaneously so that the WCET of theprogram is the WCET of the longest running thread. Now, thedifficulty is to account for the waiting times at any synchroni-sation point. In [1], we show how these waiting times can beanalysed for a wide set of primitives and exploit those resultsin the context of full parallel programs. In brief, computingthe waiting time linked to a lock/semaphore synchronisationfunction consists in determining the worst-case time duringwhich the synchronisation variable could be held by another

thread. This is done by analysing the WCET of all the possiblepaths from any point where the variable is locked to any pointwhere it is released. As far as barriers are concerned, thelongest thread is, by definition, the one that reaches the barrierlast. Then this thread will not wait at this point. The approachis further detailed in [2]. To analyse the impact of the split-phase synchronisation technique on the WCETs of parallelprograms, we employed two different parallelised programs.A data-parallel version of a matrix multiplication (matmul),and a data-parallel, consumer-producer Integer Fast-Fourier-Transformation (IFFT).

We use a dynamically partitioned version of matmul, thatis the matrix multiplication A = B ·C, which has been parti-tioned into working units consisting of scalar multiplicationsin row i of Aij = Bi · Cj . Each row is computed by onethread, and getting the next row/working unit is secured byeither a mutex lock, a binary semaphore, or alternatively aticket lock. Matmul can be usually parallelised rather simplewithout any locks, e.g. statically. However, we have chosen adynamically partitioned version to study the effects of differentsoftware synchronisations and the split-phase synchronisationtechnique of a parallelised program with a rather balancedsynchronisation to computation ratio.

The IFFT program has been parallelised based on an integerversion of the iterative radix-2 algorithm, which is working inplace and stores all samples in an array. In our parallelised ver-sion, for N samples the pairwise combination and rearrangingin each of the k = log2(N) stages is done in parallel. Eachthread combines independently a pair of samples, and, as inthe above version of the matmul program, the fetching of thenext working unit is secured using a mutex lock, a ticket lock,or respectively a binary semaphore. After each stage, we use abarrier to assure that all threads finished their computation forthe current stage before beginning to compute the results in thenext stage. The barriers have been implemented either usingF&I barriers, or accordingly the subbarrier implementation.Details on the used synchronisation techniques are in [1].

E. Results

Wilhelm et al. [27] define timing predictability of a real-time system as the difference between an estimated lowerbound and an estimated upper bound, with lower bound ≤BCET ≤ WCET ≤ upper bound. In other words, if theupper bound can be estimated as tight as possible to theunknown WCET, while the lower bound is not changing, thetiming predictability increases. Also, they define worst-caseperformance as the real, but unknown, WCET, and worst-caseguarantee as the estimated upper bound. We define the WCETimprovement as WCETnew

WCETref, where WCETnew is the estimated

upper bound of a program’s version for that we calculatethe WCET improvement, whereas WCETref is the estimatedupper bound of the program’s reference implementation. Forexample, the reference upper bound could be the estimatedWCET of a single-threaded program, whereas the WCETnewis the estimated WCET of an n- threaded implementation ofthat program.

We do not estimate lower bounds, and therefore we cannotmake assumptions on the timing predictability as defined in[27]. For instance, with reference to the previous example,it might be possible to achieve better lower bounds for then-threaded program than for the single-threaded one, thus thetiming predictability might not change when getting lower up-per bounds. However, for dimensioning a HRT system, usuallyonly the worst-case guarantee is taken into account. Hence,achieving an improvement of worst-case guarantees coulddecrease the costs for over-dimensioning such a HRT system.For this reason, we delineate in this paper the improvementof worst-case guarantees of parallelised HRT programs withdifferent synchronisation techniques, and with and without thesplit-phase synchronisation technique.

The presented WCET estimates in Table II, that is the esti-mated upper bounds, have been derived from the WCET modelof the MERASA quad-core processor. The parallel programshave been implemented with three kinds of primitives to guardcritical sections: mutex locks, binary blocking semaphoresand ticket locks. In addition, IFFT includes synchronisationbarriers and was compiled with barriers implemented usingsubbarriers and conditionals [29] or F&I instructions. Detailson the used synchronisations have been already presentedin [1], and here we focus on the impact of the split-phasesynchronisation technique on the WCET improvement of thoseparallel programs.

Fig. 6 depicts the WCET improvement of the analysedfour-threaded IFFT program. The WCET improvement isnormalised on the reference WCET estimate derived from theparallelised IFFT with mutex locks, conditional subbarriers,and without the split-phase synchronisation technique. On theone hand, F&I barriers outperform subbarriers, and ticket locksoutperform binary semaphores and mutex locks, but the mainpoint is the improvement of WCET guarantees when using thesplit-phase synchronisation technique.

The results in Table II show an improved estimated WCETwhen using the split-phase synchronisation technique of up to1.23 (with ticket locks) for the IFFT program with conditionalsubbarriers, and a WCET improvement of up to 1.26, withF&I barriers and ticket locks. From Table II, the WCETimprovemet for matmul with the split-phase synchronisationis up to 1.47, that is for the matmul program with ticketlocks. Overall, when taking all software synchronisations intoconsideration, the WCET improvement using the split-phasesynchronisation technique is up to 2.9 for the parallelised IFFT

TABLE IIWCET ESTIMATES (# CYCLES) OF PARALLELISED HRT PROGRAMS

ANALYSED ON A QUAD-CORE MERASA PROCESSOR WITH AND WITHOUTTHE SPLIT-PHASE SYNCHRONISATION TECHNIQUE APPLIED.

Parallelised program mutex semaphore ticket lockmatmul (DIM=30) 1,347,342 1,041,525 938,312- with split-phase (DIM=30) 1,053,267 832,725 639,332IFFT (conditional subbarriers) 233,921 196,085 183,936- with split-phase 195,734 171,470 147,360IFFT (F&I barriers) 156,664 110,529 102,252- with split-phase 134,164 103,320 80,688

1 1.5 2 2.5 3

mutex lock

semaphore

ticket lock

1

1.2

1.3

1.2

1.4

1.6

1.5

2.1

2.3

1.7

2.3

2.9

WCET Improvement for parallelised IFFT

basic (subbarrier) split-phase (subbarrier)basic (F&I barrier) split-phase (F&I barrier)

Fig. 6. WCET improvements on a quad-core MERASA processor for theparallelised IFFT using three different software synchronisations, and theaugmented memory controllers with and without split-phase synchronisations.

program, that is the IFFT version with conditional subbarri-ers, mutex locks and without the split-phase synchronisationtechnique compared to the IFFT version with F&I barriers,ticket locks, and with the split-phase synchronisation techniqueused. For matmul there is a similar WCET improvement ofup to 2.1 for the version with ticket locks and split-phase syn-chronisation technique used, compared to the matmul versionwith mutex locks and without the split-phase synchronisationtechnique.

VI. CONCLUSION

Future performance requirements of safety-critical systemswill soon motivate the design of parallel programs running onmulti-cores. However, this will require predictable hardwareand software support, in particular to implement safe andefficient inter-thread synchronisation. Also, parallel programsintroduce pessimism in the WCET, because of the lackof information on synchronisation and waiting times. Inthis paper, we investigate a solution for such problems inHRT capable multi-core processors with the split-phasesynchronisation technique. True to the motto “make thefrequent case fast”, the split-phase synchronisation techniquereduces the WCMLs of frequent loads/stores while sacrificingthe worst-case performance of RMW operations. We showthat implementing such a technique in hardware is possible,and that consistency and atomicity is maintained. We evaluatethe gain in the worst-case guarantees of different parallelisedHRT programs as WCET improvement, and the split-phasesynchronisation achieves WCET improvements of up to 2.9for a parallelised IFFT program.

As future challenges to further reduce the pessimism and effortin static WCET analyses of parallelised HRT programs, we seethe need of an integrated approach of e.g. developing parallelprograms with parallel design patterns [30] which also providesome sort of annotations for the static WCET analysis. In thatway, for instance, the pessimism introduced by the problemof not knowing when what happens in parallel programs,especially for concurrent accesses to shared resources, shouldbe further reduced. Also, the use of parallel design patternsshould help programmers to better estimate the impact ofthe program’s design on its functional and non-functionalbehaviour. In [31], the authors claim that upcoming andtoday’s standards, e.g. ISO-26262 in the automotive domain,require to prove the correctness of non-functional behaviour,that is timing. We think that the use of timing analysable multi-core processors and the support of predictable HRT capablesynchronisation techniques in the RTOS is mandatory forproviding safe and low WCET guarantees with static WCETanalysis tools. For those reasons, we plan to investigate inour future work how selected parallel design patterns couldprovide significant information to improve the WCET analysesof parallel HRT programs. Also, we intend to implementHRT capable, timing predictable implementations of lock-free and wait-free data structures and evaluate their impacton the WCET guarantee in parallel programs. This might beespecially of high interest for future multi-core architectureswith high core numbers, which are not connected over a sharedbus, but a network-on-chip (NoC).

ACKNOWLEDGMENTS

Part of this research has been supported by the EC FP7project parMERASA under Grant Agreement No. 287519.

REFERENCES

[1] M. Gerdes, F. Kluge, T. Ungerer, C. Rochange, and P. Sainrat, “TimeAnalysable Synchronisation Techniques for Parallelised Hard Real-TimeApplications,” in Proc. of Design, Automation and Testing in Europe(DATE’12), March 2012, pp. 671 – 676.

[2] C. Rochange, A. Bonenfant, P. Sainrat, M. Gerdes, J. Wolf, T. Ungerer,Z. Petrov, and F. Mikulu, “WCET Analysis of a Parallel 3D MultigridSolver Executed on the MERASA Multi-Core,” in 10th Int’l Workshopon WCET Analysis (WCET 2010), vol. 268, July 2010, pp. 92–102.

[3] C. Ballabriga, H. Casse, C. Rochange, and P. Sainrat, “OTAWA: AnOpen Toolbox for Adaptive WCET Analysis,” in Software Technologiesfor Embedded and Ubiquitous Systems, 2011, vol. 6399, pp. 35–46.

[4] T. Ungerer, F. Cazorla, P. Sainrat, G. Bernat, Z. Petrov, C. Rochange,E. Quinones, M. Gerdes, M. Paolieri, J. Wolf, H. Casse, S. Uhrig, I. Gu-liashvili, M. Houston, F. Kluge, S. Metzlaff, and J. Mische, “MERASA:Multicore Execution of HRT Applications Supporting Analyzability,”IEEE Micro, vol. 30, pp. 66–75, 2010.

[5] M. Monchiero, G. Palermo, C. Silvano, and O. Villa, “An EfficientSynchronization Technique for Multiprocessor Systems on-Chip,” inProc. of MEDEA, 2005, pp. 33–40.

[6] S. Liu and J.-L. Gaudiot, “Synchronization Mechanisms on ModernMulti-core Architectures,” in Advances in Computer Systems Architec-ture. Springer Berlin/Heidelberg, 2007, vol. 4697, pp. 290–303.

[7] M. Paolieri, E. Quinones, F. Cazorla, and M. Valero, “An AnalyzableMemory Controller for Hard Real-Time CMPs,” Embedded SystemsLetters, IEEE, vol. 1, no. 4, pp. 86 –90, dec. 2009.

[8] B. Akesson, K. Goossens, and M. Ringhofer, “Predator: A PredictableSDRAM Memory Controller,” in Proc. of the 5th Int’l Conf. on HW/SWCodesign and System Synthesis (CODES+ISSS’07), 2007, pp. 251–256.

[9] J. Reineke, I. Liu, H. D. Patel, S. Kim, and E. A. Lee, “PRET DRAMcontroller: bank privatization for predictability and temporal isolation,”in Proc. of the 7th IEEE/ACM/IFIP Int’l Conf. on Hardware/SoftwareCodesign and System Synthesis (CODES+ISSS’11), 2011, pp. 99–108.

[10] A. Gustavsson, A. Ermedahl, B. Lisper, and P. Pettersson, “TowardsWCET Analysis of Multicore Architectures using UPPAAL,” in Proc.Int’l Workshop on WCET Analysis (WCET 2010), 2010, pp. 103–113.

[11] M. Paolieri, E. Quinones, F. J. Cazorla, G. Bernat, and M. Valero,“Hardware Support for WCET Analysis of Hard Real-Time MulticoreSystems,” in Proc. 36th Int’l Symposium on Computer Architecture(ISCA09), 2009, pp. 57–68.

[12] M.-K. Yoon, J.-E. Kim, and L. Sha, “Optimizing Tunable WCET withShared Resource Allocation and Arbitration in HRT Multicore Systems,”in Real-Time Systems Symposium (RTSS’11), 2011, pp. 227–238.

[13] A. Andrei, P. Eles, Z. Peng, and J. Rosen, “Predictable Implementationof Real-Time Applications on Multiprocessor Systems-on-Chip,” in 21stInt’l Conf. on VLSI Design (VLSID 2008), jan. 2008, pp. 103 –110.

[14] A. Schranzhofer, J.-J. Chen, and L. Thiele, “Timing Analysis for TDMAArbitration in Resource Sharing Systems,” in Real-Time and EmbeddedTechnology and Applications Symposium (RTAS), 2010, pp. 215–224.

[15] J. Staschulat, S. Schliecker, M. Ivers, and R. Ernst, “Analysis of MemoryLatencies in Multi-Processor Systems,” in 5th Intl. Workshop on Worst-Case Execution Time (WCET) Analysis, 2007.

[16] J. Wolf, M. Gerdes, F. Kluge, S. Uhrig, J. Mische, S. Metzlaff,C. Rochange, H. Casse, P. Sainrat, and T. Ungerer, “RTOS Support forParallel Execution of Hard Real-Time Applications on the MERASAMulti-core Processor,” in Proc. of IEEE ISORC’10, 2010, pp. 193–201.

[17] J. Mische, I. Guliashvili, S. Uhrig, and T. Ungerer, “How to Enhancea Superscalar Processor to Provide Hard Real-Time Capable In-OrderSMT,” in Proc. 23rd Int’l Conf. on Architecture of Computing Systems(ARCS’10), vol. 5974, February 2010, pp. 2–14.

[18] S. Metzlaff, I. Guliashvili, S. Uhrig, and T. Ungerer, “A DynamicInstruction Scratchpad Memory for Embedded Processors Managed byHardware,” 24th Int’l Conf. on ARCS, pp. 122–134, February 2011.

[19] L. Lamport, “How to Make a Multiprocessor Computer That CorrectlyExecutes Multiprocess Programs,” IEEE Trans. on Computers, vol. C-28, no. 9, pp. 690 –691, September 1979.

[20] M. Dubois, C. Scheurich, and F. A. Briggs, “Memory Access Bufferingin Multiprocessors,” in Proc. 13th Annual Int’l Symposium on ComputerArchitecture, vol. 14, no. 2, June 1986, pp. 434–442.

[21] S. V. Adve and M. D. Hill, “Weak Ordering - A New Definition,” in 25years of the Int’l Symposia on Computer Architecture (selected papers),ISCA’98, 1998, pp. 363–375.

[22] C. P. Kruskal, L. Rudolph, and M. Snir, “Efficient Synchronization ofMultiprocessors with Shared Memory,” ACM Trans. Program. Lang.Syst., vol. 10, pp. 579–601, October 1988.

[23] L. D. Molesky, C. Shen, and G. Zlokapa, “Predictable SynchronizationMechanisms for Multiprocessor Real-Time Systems,” Real-Time Sys-tems, vol. 2, pp. 163–180, 1990.

[24] G. Tian and O. Hammami, “Performance Measurements of Synchro-nization Mechanisms on 16PE NoC Based Multi-Core with DedicatedSynchronization and Data NoC,” in 16th IEEE Int’l Conf. on Electronics,Circuits, and Systems (ICECS 2009), December 2009, pp. 988 –991.

[25] R. L. Sites, “Alpha AXP architecture,” Commun. ACM, vol. 36, pp.33–44, February 1993.

[26] D. Culler, A. Dusseau, S. Goldstein, A. Krishnamurthy, S. Lumetta,T. von Eicken, and K. Yelick, “Parallel Programming in Split-C,” inProc. of Supercomputing ’93, nov. 1993, pp. 262 – 273.

[27] R. Wilhelm, J. Engblom, E. A., N. Holsti, S. Thesing, D. Whalley,G. Bernat, C. Ferdinand, R. Heckmann, T. Mitra, F. Mueller, I. Puaut,P. Puschner, J. Staschulat, and P. Stenstrom, “The Worst-case ExecutionTime Problem—Overview of Methods and Survey of Tools,” ACMTrans. on Embedded Computing Systems (TECS), vol. 7, no. 3, 2008.

[28] J. Reineke and R. Sen, “Sound and Efficient WCET Analysis in thePresence of Timing Anomalies,” in 9th Int’l Workshop on WCETAnalysis (WCET 2009), 2009.

[29] R. Marejka, “A Barrier for Threads,” SunOpsis - The Solaris 2.0Migration Support Centre Newsletter, vol. Vol. 4, no. 1, November 1994.

[30] B. L. Massingill, T. G. Mattson, and B. A. Sanders, “More patternsfor parallel application programs,” in Proceedings of the 8th PatternLanguages of Programs Workshop (PLoP 2001), September 2001.

[31] R. Johansson and T. Heurung, “ISO-26262 Implications on Timing ofAutomotive E/E System Design Processes,” SAE Technical Paper 2009-01-0743, 2009.