using run-time checking to provide safety and progress for distributed cyber-physical systems

1

Using Run-Time Checking to Provide Safety andProgress for Distributed Cyber-Physical SystemsStanley Bak, Fardin Abdi Taghi Abad, Zhenqi Huang, Marco Caccamo

Presentor: Renato Mancusu

• Interconnected systems that physically affect each other

• State of each node is a function of control inputs of other nodes based on system connection graph

Distributed Coordination

Images : http://geospatial.blogs.com/geospatial/2009/07/alternative-energy-green-nonemitting-clean-renewable-or-low-carbon-.htmlhttp://www.thewatertreatments.com/water/distribution-system/

2

http://geospatial.blogs.com/geospatial/2009/07/alternative-energy-green-nonemitting-clean-renewable-or-low-carbon-.html




http://www.thewatertreatments.com/water/distribution-system/

• Distributed systems rely on communication– Reaching the desired

state– Functionality and

stability

Communication; An Essential Component

3

Communication Faults

Violation of Safety

• Unreliable Communication – unbounded message delays

and drops– Impossible to achieve

consensus in lossy network

• One approach:– Use middleware that

provides guarantees of communication and latency

– If the guarantees can not be met, an error is raised to the high-level logic

• Problem: Scalability

Limits of Distributed Coordination

4Image: “A Swarm of Nano Quadrotors”, UPENN, http://www.youtube.com/watch?v=YQIMGV5vtd4

http://www.youtube.com/watch?v=YQIMGV5vtd4


• Goal: Examine fundamental requirements for safety in distributed systems with unreliable communication– Safety: global invariant (for example, collisions are

avoided)

• Goal: Provide a mechanism for safe progress, if the communication works adequately well– Progress: all distributed agents follow the same goal

Paper Goals

5Image: “A Swarm of Nano Quadrotors”, UPENN, http://www.youtube.com/watch?v=YQIMGV5vtd4



• A coordinating distributed system is safe under unreliable communication (arbitrary delays, unbounded packetloss), if and only if both:– Condition 1: The system is safe if no communication takes place– Condition 2: For each message m that is received by any node,

the system remains safe if no other messages are ever received after m

• Proof intuition:

Formal details in the paper

Safety Theorem

6

• Condition 2 is difficult to check ahead of time, since it’s quantified for every message– “Condition 2: For each message m that is received, the

system remains safe if no other messages are ever received after m”

• To build a usable system with this result, we check this condition at runtime, and drop messages which violate it– Of course, dropping messages impacts progress; more on

progress will be discussed in the second goal of the paper

Runtime Checking

7

8

Proposed ArchitecturePerform a safety test on each command (check condition 2)

Safe commands

pass

Unsafe commands are filtered

• Progress depends on the notion of compatible actions. These are actions which all agents can take that are globally safe.

• When put together, compatible action chains allow for global progress towards a goal. The rate of progress depends on the quality of the communication channel.

Safe Progress

9

10

Example System

• A flock of vehicles moves along a path with fixed offsets

• The user can input “detour points”, which redirect the motion of the flock

• Collisions should be avoided always• Detour points should be reached, communication

permitting

11

Non-Compatible Actions

A new waypoint for the flock is entered

Collision may occur due to a communication fault

Compatible Actions – Iteratively Approach Goal

Compatible Actions are Robust to Communication

Failures

New Detour point entered by operator

Desired final path generated for the flock

Paths generated for all the followers

Paths sent to followers!Tractor 1 did not receive the path

Tractor 1 did not receive the new path but safety is maintained!

21

Vehicle Flocking Application

• We created the vehicle flocking system within StarL, a Java-based environment for testing vehicle flocking algorithms

• StarL code can be run on a Roomba flock in UIUC, or the built-in simulator

• Effects from the communication (time, packetloss) can be simulated and have been evaluated in the paper

• Video: https://www.youtube.com/watch?v=dIGU8OTfCh8

https://www.youtube.com/watch?v=dIGU8OTfCh8

https://www.youtube.com/watch?v=dIGU8OTfCh8

22

Vehicle Flocking Measurement

• We measured the effect of packetloss and vehicle count on convergence time and number of messages sent

23

Future Extensions• Replace runtime reachability checks

with ahead-of-time computation• Propose a progress framework

where commands do not originate from a centralized coordinator

• Implementation on a large swarm of robots

• Provide fundamental requirements for safety in distributed systems with unreliable communication

• Provide a mechanism for safe progress, if the communication works adequately well

• Evaluate the proposed techniques on a vehicle flocking scenario

Review

24

using run-time checking to provide safety and progress for distributed cyber-physical systems

Documents

guarantees of communication

stability communication

communication channel

distributed agents

global progress

usable system

messages impacts progress

function of control