parallel programming and timing analysis on embedded multicores

88
Parallel Programming and Timing Analysis on Embedded Multicores Eugene Yip The University of Auckland Supervisors: Advisor: Dr. Partha Roop Dr. Alain Girault Dr. Morteza Biglari-Abhari

Upload: garran

Post on 23-Mar-2016

50 views

Category:

Documents


1 download

DESCRIPTION

Parallel Programming and Timing Analysis on Embedded Multicores. Eugene Yip The University of Auckland Supervisors:Advisor: Dr. Partha Roop Dr . Alain Girault Dr. Morteza Biglari-Abhari. Outline. Introduction ForeC language Timing analysis Results Conclusions. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parallel Programming and Timing Analysis  on Embedded Multicores

Parallel Programmingand Timing Analysis

on Embedded Multicores

Eugene YipThe University of Auckland

Supervisors: Advisor:Dr. Partha Roop Dr. Alain GiraultDr. Morteza Biglari-Abhari

Page 2: Parallel Programming and Timing Analysis  on Embedded Multicores

Outline

• Introduction• ForeC language• Timing analysis• Results• Conclusions

Page 3: Parallel Programming and Timing Analysis  on Embedded Multicores

Introduction

• Safety-critical systems:– Performs specific tasks.– Behave correctly at all times.– Compliance to strict safety standards. [IEC 61508, DO 178]

– Time-predictability useful in real-time designs.

[Paolieri et al 2011] Towards Functional-Safe Timing-Dependable Real-Time Architectures.

Page 4: Parallel Programming and Timing Analysis  on Embedded Multicores

Introduction

• Safety-critical systems:– Shift from single-core to multicore processors.– Better power and execution performance.

Coren

Core0

System bus

Resource Resource

Shared

Shared Shared[Blake et al 2009] A Survey of Multicore Processors.[Cullmann et al 2010] Predictability Considerations in the Design of Multi-Core Embedded Systems.

Page 5: Parallel Programming and Timing Analysis  on Embedded Multicores

Introduction• Parallel programming:– From super computers to mainstream computers.– Threaded programming model.– Frameworks designed for systems without

resource constraints or safety-concerns.– Improving average-case performance (flops),

not time-predictability.

Page 6: Parallel Programming and Timing Analysis  on Embedded Multicores

Introduction• Parallel programming:– Programmer responsible for shared resources.– Concurrency errors:• Deadlock• Race condition• Atomic violation• Order violation

– Non-deterministic thread interleaving.– Determinism essential for understanding and

debugging.

[McDowell et al 1989] Debugging Concurrent Programs.

Page 7: Parallel Programming and Timing Analysis  on Embedded Multicores

Introduction

• Synchronous languages– Deterministic concurrency.– Based on the synchrony hypothesis.– Threads execute in lock-step to a global clock.– Concurrency is logical. Typically compiled away.

[Benveniste et al 2003] The Synchronous Languages 12 Years Later.

Global ticks

Inputs

Outputs1 2 3 4

Page 8: Parallel Programming and Timing Analysis  on Embedded Multicores

Introduction

• Synchronous languages

Physical time1 2 3 4

Time between each tick

Must validate:max(Reaction time) < min(Time of each tick)

Reaction time

Defined by the timing requirements

of the system

[Benveniste et al 2003] The Synchronous Languages 12 Years Later.

Page 9: Parallel Programming and Timing Analysis  on Embedded Multicores

Introduction• Synchronous languages– Esterel– Lustre– Signal– Synchronous extensions to C:• PRET-C• ReactiveC with shared variables.• Synchronous C (SC – see Michael’s talk)• Esterel C Language

[Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs.[Boussinot 1993] Reactive Shared Variables Based Systems.[Hanxleden et al 2009] SyncCharts in C - A Proposal for Light-Weight, Deterministic Concurrency.[Lavagno et al 1999] ECL: A Specification Environment for System-Level Design.

Concurrent threads scheduled sequentially in a cooperatively manner.

Atomic execution of threads which ensures thread-safe access to shared variables.

Page 10: Parallel Programming and Timing Analysis  on Embedded Multicores

Introduction• Synchronous languages– Esterel– Lustre– Signal– Synchronous extensions to C:• PRET-C• ReactiveC with shared variables.• Synchronous C (SC – see Michael’s talk)• Esterel C Language

[Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs.[Boussinot 1993] Reactive Shared Variables Based Systems.[Hanxleden et al 2009] SyncCharts in C - A Proposal for Light-Weight, Deterministic Concurrency.[Lavagno et al 1999] ECL: A Specification Environment for System-Level Design.

Writes to shared variables are delayed to the end of the global tick.

At the end of the global tick, the writes are combined and assigned to the shared variable.

Associative and commutative “combine function”.

Page 11: Parallel Programming and Timing Analysis  on Embedded Multicores

Outline

• Introduction• ForeC language• Timing analysis• Results• Conclusions

Page 12: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language“Foresee”• Deterministic parallel programming of

embedded multicores.• C with a minimal set of synchronous

constructs for deterministic parallelism.• Fork/Join parallelism (explicit).• Shared memory model.• Deterministic thread communication using

shared variables.

Page 13: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language• Constructs:– par(t1, …, tn)• Fork threads t1 to tn to execute in parallel, in any order.• Parent thread is suspended, until all child threads

terminate.– thread t1(...) {b}• Thread definition.

– pause• Synchronisation barrier.• When a thread pauses, it completes a local tick.• When all threads pause, the program completes a

global tick.

Page 14: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

• Constructs:– abort {b} when (c)• Preempts the body b when the condition c is true. The

condition is checked before executing the body. – weak abort {b} when (c)• Preempts the body when the body reaches a pause and

the condition c is true. The condition is checked before executing the body.

Page 15: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

• Variable type qualifiers:– input• Variable gets its value from the environment.

– output• Variable emits its value to the environment.

Page 16: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language• Variable type qualifiers:– shared• Variable which may be accessed by multiple threads.• At the start of a thread’s local tick, it creates local

copies of shared variables that it accesses.• During the thread’s local tick, it modifies its local copy

(isolation).• At the end of the global tick, copies that have been

modified are combined using a commutative and associative function (combine function).• The combined result is committed back to the original

shared variable.

Page 17: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC languageshared int x = 0;void main(void) { x = 1; par(t0(), t1()); x = x - 1;}

thread t0(void) { x = 10; x = x + 1; pause; x = x + 1;}

thread t1(void) { x = x * 2 pause; x = x * 2;}

Page 18: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC languageshared int x = 0;void main(void) { x = 1; par(t0(), t1()); x = x - 1;}

thread t0(void) { x = 10; x = x + 1; pause; x = x + 1;}

thread t1(void) { x = x * 2 pause; x = x * 2;}

x = x + 1 x = x * 2

x = x + 1 x = x * 2

t1t0

x = 10

x = 1

main

x = x - 1

Fork

Join

Computation

Condition

Pause

Abort

Graph End

Graph Start

Concurrent Control-Flow Graph

Page 19: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

x = x + 1 x = x * 2

x = x + 1 x = x * 2

t1t0

x = 10

x = 1

main

x = x - 1

• Sequential control-flow along a single path.

• Parallel control-flow along branches from a fork node.

• Global tick ends when all threads pause or terminate.

Page 20: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

x = x + 1 x = x * 2

x = x + 1 x = x * 2

t1t0

x = 10

x = 1

main

x = x - 1

State of the shared variables

0

Global: x

Page 21: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

x = x + 1 x = x * 2

x = x + 1 x = x * 2

t1t0

x = 10

x = 1

main

x = x - 1

State of the shared variables

0

Global: x Thread main creates a local copy of x.

Page 22: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

x = x + 1 x = x * 2

x = x + 1 x = x * 2

t1t0

x = 10

x = 1

main

x = x - 1

State of the shared variables

Thread main creates a local copy of x.0

Global: x

0

main

Page 23: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

x = x + 1 x = x * 2

x = x + 1 x = x * 2

t1t0

x = 10

x = 1

main

x = x - 1

State of the shared variables

0

Global: x

1

main

Page 24: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

x = x + 1 x = x * 2

x = x + 1 x = x * 2

t1t0

x = 10

x = 1

main

x = x - 1

State of the shared variables

0

Global: x

1

mainThreads t0 and t1 take over main’s copy of the shared variable x.

Page 25: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

x = x + 1 x = x * 2

x = x + 1 x = x * 2

t1t0

x = 10

x = 1

main

x = x - 1

State of the shared variables

0

Global: x

Threads t0 and t1 take over main’s copy of the shared variable x.

1

t0

1

t1

Page 26: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

x = x + 1 x = x * 2

x = x + 1 x = x * 2

t1t0

x = 10

x = 1

main

x = x - 1

State of the shared variables

0

Global: x

10

t0

1

t1

Page 27: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

x = x + 1 x = x * 2

x = x + 1 x = x * 2

t1t0

x = 10

x = 1

main

x = x - 1

State of the shared variables

0

Global: x

11

t0

1

t1

Page 28: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

x = x + 1 x = x * 2

x = x + 1 x = x * 2

t1t0

x = 10

x = 1

main

x = x - 1

State of the shared variables

0

Global: x

11

t0

2

t1

Page 29: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

x = x + 1 x = x * 2

x = x + 1 x = x * 2

t1t0

x = 10

x = 1

main

x = x - 1

State of the shared variables

0

Global: x

11

t0

2

t1

Global tick is reached.• Combine the copies of x together using a

(programmer defined) associative and commutative function.

• Assume the combine function for x implements summation.

Page 30: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

x = x + 1 x = x * 2

x = x + 1 x = x * 2

t1t0

x = 10

x = 1

main

x = x - 1

State of the shared variables

0

Global: x

11

t0

2

t1

• Assign the combined value back to x.

Page 31: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

x = x + 1 x = x * 2

x = x + 1 x = x * 2

t1t0

x = 10

x = 1

main

x = x - 1

State of the shared variables

13

Global: x

• Assign the combined value back to x.

Page 32: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

State of the shared variables

13

Global: x

Next global tick.• Active threads create a copy of x.

x = x + 1 x = x * 2

x = x + 1 x = x * 2

t1t0

x = 10

x = 1

main

x = x - 1

13

t0

13

t1

Page 33: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

State of the shared variables

13

Global: x

x = x + 1 x = x * 2

x = x + 1 x = x * 2

t1t0

x = 10

x = 1

main

x = x - 1

14

t0

13

t1

Page 34: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

State of the shared variables

13

Global: x

x = x + 1 x = x * 2

x = x + 1 x = x * 2

t1t0

x = 10

x = 1

main

x = x - 1

14

t0

26

t1

Page 35: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

State of the shared variables

13

Global: x

x = x + 1 x = x * 2

x = x + 1 x = x * 2

t1t0

x = 10

x = 1

main

x = x - 1

14

t0

26

t1

Threads t0 and t1 terminate and join back to the parent thread main.• Local copies of x are combined into a

single copy and given back to the parent thread main.

Page 36: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

State of the shared variables

13

Global: x

Threads t0 and t1 terminate and join back to the parent thread main.• Local copies of x are combined into a

single copy and given back to the parent thread main.

x = x + 1 x = x * 2

x = x + 1 x = x * 2

t1t0

x = 10

x = 1

main

x = x - 1

40

main

Page 37: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

State of the shared variables

x = x + 1 x = x * 2

x = x + 1 x = x * 2

t1t0

x = 10

x = 1

main

x = x - 1

13

Global: x

39

main

Page 38: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

State of the shared variables

x = x + 1 x = x * 2

x = x + 1 x = x * 2

t1t0

x = 10

x = 1

main

x = x - 1

39

Global: x

Page 39: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

• Shared variables.– Threads modify local copies of shared variables.– Isolates thread execution behaviour.– Order/interleaving of thread execution has no

impact on the final result.– Prevents concurrency errors.– Associative and commutative combine functions.• Order of combining doesn’t matter.

Page 40: Parallel Programming and Timing Analysis  on Embedded Multicores

Scheduling

• Light-weight static scheduling.– Take advantage of multicore performance while

delivering time-predictability.– Thread allocation and scheduling order on each

core decided at compile time by the programmer.– Cooperative (non-preemptive) scheduling.– Fork/join semantics and notion of a global tick is

preserved via synchronisation.

Page 41: Parallel Programming and Timing Analysis  on Embedded Multicores

Scheduling

• One core to perform housekeeping tasks at the end of the global tick.– Combining shared variables.– Emitting outputs.– Sampling inputs and start the next global tick.

Page 42: Parallel Programming and Timing Analysis  on Embedded Multicores

Outline

• Introduction• ForeC language• Timing analysis• Results• Conclusions

Page 43: Parallel Programming and Timing Analysis  on Embedded Multicores

Timing analysis

• Compute the program’s worst-case reaction time (WCRT).

Physical time1 2 3 4

Time between each tick

Must validate:max(Reaction time) < min(Time of each tick)

Reaction time

Defined by the timing requirements

of the system

Page 44: Parallel Programming and Timing Analysis  on Embedded Multicores

Timing analysis

Existing approaches for synchronous programs.• Integer Linear Programming (ILP)• Max-Plus• Model Checking

Page 45: Parallel Programming and Timing Analysis  on Embedded Multicores

Timing analysis

Existing approaches for synchronous programs.• Integer Linear Programming (ILP)– Execution time of the program described as a set

of integer equations.– Solving ILP is known to be NP-hard.

• Max-Plus• Model Checking

[Ju et al 2010] Timing Analysis of Esterel Programs on General-Purpose Multiprocessors.

Page 46: Parallel Programming and Timing Analysis  on Embedded Multicores

Timing analysis

Existing approaches for synchronous programs.• Integer Linear Programming (ILP)• Max-Plus– Compute the WCRT of each thread.– Using the thread WCRTs, the WCRT of the program

is computed.– Assumes there is a global tick where all threads

execute their worst-case.• Model Checking

Page 47: Parallel Programming and Timing Analysis  on Embedded Multicores

Timing analysisExisting approaches for synchronous programs.• Integer Linear Programming (ILP)• Max-Plus• Model Checking– Eliminate false paths by explicit path exploration

(reachability over the program’s CFG).– Binary search: Check the WCRT is less than “x”.– State-space explosion problem.– Trades-off analysis time for precision.– Provides execution trace for the WCRT.

Page 48: Parallel Programming and Timing Analysis  on Embedded Multicores

Timing analysis

• Our approach using Reachability:– Same benefits as model checking, but a binary

search of the WCRT is not required.– To handle state-space explosion:• Reduce the program’s CCFG before analysis.

Program binary

(annotated)

Find the global ticks

(Reachability)WCRT

Reconstruct the program’s

CCFG

Page 49: Parallel Programming and Timing Analysis  on Embedded Multicores

Timing analysis

• Programs will execute on the following multicore:

Core0

TDMA Shared Bus

Global memory

Datamemory

Instruction memory Core

nDatamemory

Instruction memory

Page 50: Parallel Programming and Timing Analysis  on Embedded Multicores

Timing analysis

• Computing the execution time:1. Overlapping of thread execution time from

parallelism and inter-core synchronizations.2. Scheduling overheads.3. Variable delay in accessing the shared bus.

Page 51: Parallel Programming and Timing Analysis  on Embedded Multicores

Timing analysis

1. Overlapping of thread execution time from parallelism and inter-core synchronisations.• An integer counter to track each core’s execution time.• Synchronisation occurs when forking/joining, and ending

the global tick.• Advance the execution time of participating cores.

Core 1: Core 2:main t2t1

Core 1 Core 2main

t2t1

x = x + 1 x = x * 2

t1t0

x = 10

x = 1

main

Page 52: Parallel Programming and Timing Analysis  on Embedded Multicores

Timing analysis

2. Scheduling overheads.– Synchronisation: Fork/join and global tick.• Via global memory.

– Thread context-switching .• Copying of shared variables at the start and end of a

thread’s local tick via global memory.

SynchronisationThread context-switch

Core 1 Core 2main

t2t1

Global tick

Page 53: Parallel Programming and Timing Analysis  on Embedded Multicores

Timing analysis

2. Scheduling overheads.– Required scheduling routines statically known.– Analyse the scheduling control-flow.– Compute the execution time for each scheduling

overhead. Core 1 Core 2main

t1

Core 1 Core 2main

t2t1t2

Page 54: Parallel Programming and Timing Analysis  on Embedded Multicores

Timing analysis

3. Variable delay in accessing the shared bus.– Global memory accessed by scheduling routines.– TDMA bus delay has to be considered.

Core 1 Core 2main

t1 t2

Page 55: Parallel Programming and Timing Analysis  on Embedded Multicores

Timing analysis

3. Variable delay in accessing the shared bus.– Global memory accessed by scheduling routines.– TDMA bus delay has to be considered.

121212121212

Core 1 Core 2

slotsCore 1 Core 2

main

t1 t2

Page 56: Parallel Programming and Timing Analysis  on Embedded Multicores

Timing analysis

3. Variable delay in accessing the shared bus.– Global memory accessed by scheduling routines.– TDMA bus delay has to be considered.

121212121212

Core 1 Core 2main

t1

Core 1 Core 2main

t1 t2t2

Page 57: Parallel Programming and Timing Analysis  on Embedded Multicores

Timing analysis

• CCFG optimisations:– merge: Reduce the number of CFG nodes that

need to be traversed for each local tick.– merge-b: Reduce the number of alternate paths

between CFG nodes.

Page 58: Parallel Programming and Timing Analysis  on Embedded Multicores

Timing analysis

• CCFG optimisations:– merge: Reduce the number of CFG nodes that

need to be traversed for each local tick.

cost = 1

cost = 4

cost = 3

cost = 1

Page 59: Parallel Programming and Timing Analysis  on Embedded Multicores

Timing analysis

• CCFG optimisations:– merge: Reduce the number of CFG nodes that

need to be traversed for each local tick.

cost = 1

cost = 4

cost = 3

cost = 1

Page 60: Parallel Programming and Timing Analysis  on Embedded Multicores

Timing analysis

• CCFG optimisations:– merge: Reduce the number of CFG nodes that

need to be traversed for each local tick.

cost = 1

cost = 4

cost = 3

cost = 1

cost= 1 + 3= 4

cost= 1 + 4 + 1= 6

merge

Page 61: Parallel Programming and Timing Analysis  on Embedded Multicores

Timing analysis

• CCFG optimisations:– merge-b: Reduce the number of possible paths

between CFG nodes.• Reduces the number of reachable global ticks.

cost = 1

cost = 4

cost = 3

cost = 1

cost= 1 + 3= 4

cost= 1 + 4 + 1= 6

merge

Page 62: Parallel Programming and Timing Analysis  on Embedded Multicores

Timing analysis

• CCFG optimisations:– merge-b: Reduce the number of possible paths

between CFG nodes.• Reduces the number of reachable global ticks.

cost = 1

cost = 4

cost = 3

cost = 1

cost= 1 + 3= 4

cost= 1 + 4 + 1= 6

cost = 6

merge merge-b

Page 63: Parallel Programming and Timing Analysis  on Embedded Multicores

Outline

• Introduction• ForeC language• Timing analysis• Results• Conclusions

Page 64: Parallel Programming and Timing Analysis  on Embedded Multicores

Results

• For the proposed reachability-based timing analysis, we demonstrate:– the precision of the computed WCRT.– the efficiency of the analysis, in terms of analysis

time.

Page 65: Parallel Programming and Timing Analysis  on Embedded Multicores

Results

• Timing analysis tool:

Program binary

(annotated)

Explicit path exploration

(Reachability)

Implicit path exploration(Max-Plus)

Taking into account the 3 factors

WCRTProgram CCFG (optimisations)

Page 66: Parallel Programming and Timing Analysis  on Embedded Multicores

Results

• Multicore simulator (Xilinx MicroBlaze):– Based on http://www.jwhitham.org/c/smmu.html

and extended to be cycle-accurate and support multiple cores and a TDMA bus.

Core0

TDMA Shared Bus

Global memory

Datamemory

Instruction memory Core

nDatamemory

Instruction memory16KB

16KB

32KB5 cycles

1 cycle

5 cycles/core(Bus schedule round = 5 * no. cores)

Page 67: Parallel Programming and Timing Analysis  on Embedded Multicores

Results

• Mix of control/data computations, thread structure and computation load.

* [Pop et al 2011] A Stream-Computing Extension to OpenMP.# [Nemer et al 2006] A Free Real-Time Benchmark.

*

*#

Benchmark programs.

Page 68: Parallel Programming and Timing Analysis  on Embedded Multicores

Results

• Each benchmark program was distributed over varying number of cores.– Up to the maximum number of parallel threads.

• Observed the WCRT:– Test vectors to elicit different execution paths.

• Computed the WCRT:– Reachability– Max-Plus

Page 69: Parallel Programming and Timing Analysis  on Embedded Multicores

802.11a Results• WCRT decreases

until 5 cores.• Global memory

increasingly expensive.

• Scheduling overheads.

1 2 3 4 5 6 7 8 9 100

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000Observed

Reachability

MaxPlus

Cores

WC

RT

(clo

ck cy

cles

)

Page 70: Parallel Programming and Timing Analysis  on Embedded Multicores

802.11a Results

1 2 3 4 5 6 7 8 9 100

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000Observed

Reachability

MaxPlus

Cores

WC

RT

(clo

ck cy

cles

)

Reachability:• ~2% over-

estimation.• Benefit of explicit

path exploration.

Page 71: Parallel Programming and Timing Analysis  on Embedded Multicores

802.11a ResultsMax-Plus:• Loss of execution

context: Uses only the thread WCRTs.

• Assumes one global tick where all threads execute their worst-case.

• Max execution time of the scheduling routines.1 2 3 4 5 6 7 8 9 10

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000Observed

Reachability

MaxPlus

Cores

WC

RT

(clo

ck cy

cles

)

Page 72: Parallel Programming and Timing Analysis  on Embedded Multicores

802.11a ResultsBoth approaches:• Estimation of

synchronisation cost is conservative. Assumed that the receive only starts after the last sender.

1 2 3 4 5 6 7 8 9 100

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000Observed

Reachability

MaxPlus

Cores

WC

RT

(clo

ck cy

cles

)

Page 73: Parallel Programming and Timing Analysis  on Embedded Multicores

802.11a Results

1 2 3 4 5 6 7 8 9 100

500

1,000

1,500

2,000

2,500

Cores

Ana

lysi

s Tim

e (s

econ

ds)

Max-Plus takes less than 2 seconds.Reachability

Page 74: Parallel Programming and Timing Analysis  on Embedded Multicores

802.11a Results

1 2 3 4 5 6 7 8 9 100

500

1,000

1,500

2,000

2,500

Cores

Ana

lysi

s Tim

e (se

cond

s)

Reachability (merge)

Reachabilitymerge:• Reduction of ~9.34x

Page 75: Parallel Programming and Timing Analysis  on Embedded Multicores

802.11a Results

1 2 3 4 5 6 7 8 9 100

500

1,000

1,500

2,000

2,500

Cores

Ana

lysi

s Tim

e (se

cond

s)

Reachability (merge)Reachability (merge-b)

Reachabilitymerge:• Reduction of ~9.34x

Page 76: Parallel Programming and Timing Analysis  on Embedded Multicores

802.11a Results

1 2 3 4 5 6 7 8 9 100

500

1,000

1,500

2,000

2,500

Cores

Ana

lysi

s Tim

e (se

cond

s)

Reachability (merge)Reachability (merge-b)

Reachabilitymerge:• Reduction of ~9.34xmerge-b:• Reduction of ~342x• Less than 7 sec.

Page 77: Parallel Programming and Timing Analysis  on Embedded Multicores

802.11a Results

Reduction in states reduction in analysis time

Number of global ticks explored.

Page 78: Parallel Programming and Timing Analysis  on Embedded Multicores

Results

Reachability:• ~1 to 8% over-estimation.• Loss in precision mainly from over-estimating the synchronisation

costs.

1 2 3 40

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

45,000

FmRadio

Cores

1 2 3 4 5 6 70

1,000

2,000

3,000

4,000

5,000

6,000

7,000

Fly by Wire

Cores

1 2 3 4 5 6 7 80

20,000

40,000

60,000

80,000

100,000

120,000

140,000

Life

Cores1 2 3 4 5 6 7 8

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

Matrix

ObservedReachabilityMaxPlus

Cores

Page 79: Parallel Programming and Timing Analysis  on Embedded Multicores

Results

Max-Plus:• Over-estimation very dependent on program structure.• FmRadio and Life very imprecise. Loops iterating over par

statement(s) multiple times. Over-estimations are multiplied.• Matrix quite precise. Executes in one global tick. Thus, thread

WCRT assumption is valid.

1 2 3 40

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

45,000

FmRadio

Cores

1 2 3 4 5 6 70

1,000

2,000

3,000

4,000

5,000

6,000

7,000

Fly by Wire

Cores

1 2 3 4 5 6 7 80

20,000

40,000

60,000

80,000

100,000

120,000

140,000

Life

Cores1 2 3 4 5 6 7 8

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

Matrix

ObservedReachabilityMaxPlus

Cores

Page 80: Parallel Programming and Timing Analysis  on Embedded Multicores

Results

• Timing trace of the WCRT.– For each core: Thread start/end time, context-

switching, fork/join, ...– Can be used to tune the thread distribution.– Used to find good thread distributions for each

benchmark program.

Page 81: Parallel Programming and Timing Analysis  on Embedded Multicores

Outline

• Introduction• ForeC language• Timing analysis• Results• Conclusions

Page 82: Parallel Programming and Timing Analysis  on Embedded Multicores

Conclusions

• ForeC language for deterministic parallel programming.

• Based on synchronous framework.• Able to achieve WCRT speedup while

providing time-predictability.• Very precise, fast and scalable timing analysis

for multicore programs using reachability.

Page 83: Parallel Programming and Timing Analysis  on Embedded Multicores

Future work

• Complete the formal semantics of ForeC.• Prune additional infeasible paths using value

analysis.• WCRT-guided, automatic thread distribution.• Cache hierarchy in the analysis.

Page 84: Parallel Programming and Timing Analysis  on Embedded Multicores

Questions?

Page 85: Parallel Programming and Timing Analysis  on Embedded Multicores

Introduction

• Existing parallel programming solutions.– Shared memory model.• OpenMP, Pthreads• Intel Cilk Plus, Thread Building Blocks• Unified Parallel C, ParC, X10

– Message passing model.• MPI, SHIM

– Provides ways to manage shared resources but not prevent concurrency errors.

[OpenMP] http://openmp.org [Pthreads] https://computing.llnl.gov/tutorials/pthreads/ [X10] http://x10-lang.org/[Intel Cilk Plus] http://software.intel.com/en-us/intel-cilk-plus [Intel Thread Building Blocks] http://threadingbuildingblocks.org/[Unified Parallel C] http://upc.lbl.gov/ [Ben-Asher et al] ParC – An Extension of C for Shared Memory Parallel Processing.[MPI] http://www.mcs.anl.gov/research/projects/mpi/ [SHIM] SHIM: A Language for Hardware/Software Integration.

Page 86: Parallel Programming and Timing Analysis  on Embedded Multicores

Introduction

• Deterministic runtime support.– Pthreads• dOS, Grace, Kendo, CoreDet, Dthreads.

– OpenMP• Deterministic OMP

– Concept of logical time.– Each logical time step broken into an execution

and communication phase.

[Bergan et al 2010] Deterministic Process Groups in dOS.[Olszewski et al 2009] Kendo: Efficient Deterministic Multithreading in Software. [Bergan et al 2010] CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution.[Liu et al 2011] Dthreads: Efficient Deterministic Multithreading.[Aviram 2012] Deterministic OpenMP.

Page 87: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

• Behaviour of shared variables is similar to:• Intel Cilk+ (Reducers)• Unified Parallel C (Collectives)• DOMP (Workspace consistency)• Grace (Copy-on-write)• Dthreads (Copy-on-write)

Page 88: Parallel Programming and Timing Analysis  on Embedded Multicores

ForeC language

• Parallel programming patterns:– Specifying an appropriate combine function.– Sacrifice for deterministic parallel programs.– Map-reduce– Scatter-gather– Software pipelining– Delayed broadcast or point-to-point

communication.