better decision making in presence of faults: formal modelling and analysis

50
Better decision making in presence of faults: formal modelling and analysis Professor Muffy Calder Dr Michele Sevegnani Computing Science 1 December 2013

Upload: aitana

Post on 25-Feb-2016

28 views

Category:

Documents


1 download

DESCRIPTION

Better decision making in presence of faults: formal modelling and analysis . Professor Muffy Calder Dr Michele Sevegnani. Computing Science . Better decision making in presence of faults: do I need to fix a fault now or can I wait until tomorrow? . Professor Muffy Calder FREng - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Better decision making in presence of faults:  formal modelling and analysis

Better decision making in presence of faults: formal modelling and analysis

Professor Muffy Calder Dr Michele Sevegnani

Computing Science

1 December 2013

Page 2: Better decision making in presence of faults:  formal modelling and analysis

Better decision making in presence of faults: do I need to fix a fault now or can I wait until tomorrow?

Professor Muffy Calder FREng Dr Michele Sevegnani

Computing Science

2 December 2013

Page 3: Better decision making in presence of faults:  formal modelling and analysis

A stochastic event-based model and analysis of the NATS communications links monitoring system

Professor Muffy Calder Dr Michele Sevegnani

Computing Science

3 December 2013

Page 4: Better decision making in presence of faults:  formal modelling and analysis

4

Outline• Who am I

Part I• Why model; what to model

Part II• How to model and analyse

Part III• Results for example sites and sectors; inference from field data• Decision making

Part IV• Implementation and GUI; how to use the model(s)

Part V• Conclusions; next steps

Page 5: Better decision making in presence of faults:  formal modelling and analysis

5

Who am I - Related work Domestic Network and Policy Management • real-time analysis of policies and configurations – spatial and temporal (on router)

Feature interactions in advanced telecomms • logical properties (off-line, on-line)

Homecare sensor system: assessing configurations for usability, interaction modality • real-time logical analysis (on system hub)

Populations of users of ubiquitous computing/mobile apps• stochastic models and logical analysis of actual use – from user traces Cellular biology• signalling pathways for coordination/cancer; phosphorylation is signal

Page 6: Better decision making in presence of faults:  formal modelling and analysis

6

Part I: Why model Motivation

• engineering team maintain large number of complex systems, many different management systems, reliance on experience

• a low level fault can give rise to a plethora of alarms • systems do not allow easy visualisation, interrogation of current state, or

prediction of future• need to quantify criticality or urgency• need to relate asset behaviour to service behaviour

Page 7: Better decision making in presence of faults:  formal modelling and analysis

7

Why model Event based, stochastic modelling based on monitored behaviours

• quantify service quality across different sectors and dynamically changing assets/systems

• experiment with different monitoring strategies and system architectures• experiment with different strategies for repair and maintenance • visualise criticality • better decision making: ATC users, engineers, technical staff, management

Quantify how system • is designed to meet requirements• actually meets requirements

Page 8: Better decision making in presence of faults:  formal modelling and analysis

8

Why model Analysis allows us to answer questions like:

• What is probability of no service from a given degraded configuration in a given frequency/sector/site over next 48 hours?

• What proportion of time is the service functioning, in the long run?

sector RRR mean repair times: 20h, 15h

Page 9: Better decision making in presence of faults:  formal modelling and analysis

9

Why model Analysis allows us to answer questions like:

• What is probability of no service from a given degraded configuration in a given frequency/sector/site over next 48 hours?

• What is the effect of an intervention?

Page 10: Better decision making in presence of faults:  formal modelling and analysis

10

What to model Monitoring Systems• radar - communication links - oceanic routes - local machines - voice -

weather - power lines Communication links monitoring• civilian, military, emergency, oceanic frequencies • sectors, sites, frequencies and channels• 35 sectors, each with set of frequencies (+ emergency)• 17 sites, each with antennas (channels) that (send) Tx and (receive) Rx on

different frequencies • redundancy:

• a frequency is covered by more than one site• each site has main channel A and backup channel B

• site environment: powerline status, comm link status, flooding, intrusion

Page 11: Better decision making in presence of faults:  formal modelling and analysis

11

What to model Monitoring system colour codes

Green functioningRed faulty -- alarm goes off Blue under maintenanceAmber not fully functioning/reduced redundancy

(e.g. a frequency when one antenna is down)

We model sectors (comprising) sites( comprising) channels

Page 12: Better decision making in presence of faults:  formal modelling and analysis

12

What to model Event-based, parameterised model

parameters:- number of sites in a sector- rates of events in a site - state of Tx and Rx in a site

assumptions:- events are independent, unless explicitly linked

Page 13: Better decision making in presence of faults:  formal modelling and analysis

Overview of project

Field Data

Event Rates

Parameterised Model

CTMC for counter abstraction of

subsystem

Static Analysis

Safety CasesBusiness Cases

Prediction

Predictive temporal properties

e.g. transient probability of no service

Validation

Predictive temporal properties

e.g. steady state probability of no service, reduced redundancy, etc.

inference

PRISMmodel checker

PRISMmodel checker

GUI

13

possibleaction(s)

Page 14: Better decision making in presence of faults:  formal modelling and analysis

14

Part II: How to model Principles

• model observed/recorded events between discrete states• an events occurs with a rate• rate determines probability of reaching a state by a given time• possibility of race conditions

k

l

Page 15: Better decision making in presence of faults:  formal modelling and analysis

15

How to model Principles

• model observed/recorded events between discrete states• an events occurs with a rate• rate determines probability of reaching a state by a given time• possibility of race conditions

• At rate k:

• Continuous time Markov chain

ktetP 1)(

Page 16: Better decision making in presence of faults:  formal modelling and analysis

16

Simple Example: Markov chain

S

F

M

S serviceableF faultyM under maintenance rate1

rate3

rate2

rate4

A Markov chain has no memory.

A rate only depends on the current state, not how we got to a state.

We can reason about paths.

We can reason about the probability, over time, to reach a state.

e.g. what is probability to reach state M in 4 hours?

Page 17: Better decision making in presence of faults:  formal modelling and analysis

17

Overview of modelchannel component: (counter abstraction) A and B channels

Tx or Rx SS

SF

FF

SM

FM

MM

A/B CHANNELS

S serviceableF faultyM under maintenance

Page 18: Better decision making in presence of faults:  formal modelling and analysis

18

Overview of modelchannel component

SS

SF

FF

SM

FM

MM

A/B CHANNELS

S serviceableF faultyM under maintenance

reduced redundancy

no service

Page 19: Better decision making in presence of faults:  formal modelling and analysis

19

Overview of modelchannel component

SS

SF

FF

SM

FM

MM

E

A/B CHANNELS

S serviceableF faultyM under maintenanceE external site failure

Page 20: Better decision making in presence of faults:  formal modelling and analysis

20

Overview of modelsite environment component

S serviceableF faultyM under maintenanceE external site failure

E0

E2

E1

SITE ENVIRONMENT

Synchronise red events green events

Page 21: Better decision making in presence of faults:  formal modelling and analysis

21

Overview of modelA site consists of 3 concurrent components: Tx, Rx, Env

At any moment, a site is in a configuration

Examples:

(SS,SS,E0) green - serviceable site(SF,SS,E1) amber - reduced redundancy site(FF,*,*) red - reduced redundancy site

NB: Not all configurations are reachable.

S serviceableF faultyM under maintenanceE external site failure

Page 22: Better decision making in presence of faults:  formal modelling and analysis

22

Overview of model

Every component is represented in PRISM by a generic module.

Modules for• channel (pair) • site environment • site • n-ary sector (n= 2…5)

Rates of events vary from site to site (sector to sector).

Page 23: Better decision making in presence of faults:  formal modelling and analysis

23

AnalysisUse (stochastic) logic for analysis to

• validate long run behaviour against (long run) observations What is the % time in a no service state? E.g. 8.5 E -4What is the % time in a reduced redundancy state? E.g. 30%

• predict a transient behaviourWhat is the probability of being in a no service state over the next t hours?

P =? [F<=T noservice_sector(X)]How does the probability change over those t hours?How does the probability distribution depend on the current state?

Possible action: If the prediction from the current state is unacceptable, then change state to one with a more acceptable prediction.

Page 24: Better decision making in presence of faults:  formal modelling and analysis

24

Transient behaviour from which current state?

• Distance (in time) to no service configurations depends on current configuration

The colour code adopted by the monitoring system does not allow to quantify this distance or to compare possible current configurations. The model allows us to do this!

TIME

Page 25: Better decision making in presence of faults:  formal modelling and analysis

25

Transient behaviour from which current state?

• Distance (in time) to no service configurations depends on current configuration

The colour code adopted by the monitoring system does not allow to quantify this distance or to compare possible current configurations. The model allows us to do this!

TIME

Page 26: Better decision making in presence of faults:  formal modelling and analysis

26

Transient behaviour from which current state?

• Distance (in time) to no service configurations depends on current configuration

The stochastic model allows to measure the distance precisely:

TIME

Page 27: Better decision making in presence of faults:  formal modelling and analysis

27

Transient behaviour from which current state?

• Distance (in time) to no service configurations depends on current configuration

Important: distance does not depend on the number of transitions but on the rates on the transitions.

TIME

Page 28: Better decision making in presence of faults:  formal modelling and analysis

28

Part III: Analysis results for example sectors/sites

• Example sector with three sitesFIR sector (sites CGL, WHD and LWH).Each site consists of the synchronisation of two channel components (Tx/Rx) and a site environment component.Total number of components for FIR: 6 channels and 3 environments.Event rates are inferred from historical data (Feb 2012 – Feb 2013) - maintenance and failure data for FIR sector and individual sites.

• Analysis of the transient behaviour from different sector statesOut of all the possible configurations (389,017 states) we compare the expected behaviour of selected states over the next 48 hours.

A state represents a configuration of a sector, e.g. three sites, two of which are serviceable and one is no service.

Page 29: Better decision making in presence of faults:  formal modelling and analysis

29

Inference from field data • All the events occurred in sector FIR from Feb 2012 to Feb 2013 are

counted and categorised• Total number of alarms: 61• Total number of site events: 24

• Events are used to derive transition rates:• Mean inter-failure time: 452 h• Mean repair time: 23 h• Response: 57 m• Site event: 1107 h• Percentage of quick repairs: 15%• Site failure: extremely rare event 1 every 11.33 years

Page 30: Better decision making in presence of faults:  formal modelling and analysis

30

Analysis results for example sectors/sites

• Selected sector configurations

• Configuration W corresponds to configuration (Tx,Rx,Env) = (SS,SS,E0)• Configuration N corresponds to configuration (Tx,Rx,Env) = (FF,*,*), (FM,*,*), (MM,*,*), (E,E,E2)• Configuration R corresponds to configuration (Tx,Rx,Env) = (SF,(SM|SF |SS),(E0|E1)), (SM,(SM|SF |SS),(E0|E1))

Site CGL Site WHD Site LWH

W W W

W W N

W N N

W W R

W R R

R R R

R R N

R N N

W= serviceable (working) siteR = reduced redundancy siteN = no service site

Page 31: Better decision making in presence of faults:  formal modelling and analysis

Examples

And how to interpret results….

31

Page 32: Better decision making in presence of faults:  formal modelling and analysis

Steady state for example sectors/sites

No Service is 1.03E-832

Page 33: Better decision making in presence of faults:  formal modelling and analysis

3 Sites 2 Sites 1 Site

Ratio R 13.46% 9.03% 4.45%Ratio W 86.54% 91.09% 95.51%Ratio N 0.00% 0.00% 0.03%

11.54% 6.19% 2.88% 88.46% 93.80% 96.85% 0.00% 0.00% 0.26%

ValidationCompare historical data (over 1 year) model steady state analysis

Historical data

Steady state analysis

33

Page 34: Better decision making in presence of faults:  formal modelling and analysis

34

Transient properties

Page 35: Better decision making in presence of faults:  formal modelling and analysis

35

Analysis results for example sectors/sites

Prediction of sector FIR from states (W,W,W), (W,W,R), (W,W,N)

1 6 11 16 21 26 31 36 41 460.00E+00

1.00E-07

2.00E-07

3.00E-07

4.00E-07

5.00E-07

6.00E-07

7.00E-07

8.00E-07

WWW WWR WWN

Page 36: Better decision making in presence of faults:  formal modelling and analysis

36

Analysis results for example sectors/sites

Prediction of sector FIR with states (W,R,R), (R,R,R)

1 6 11 16 21 26 31 36 41 460.00E+00

5.00E-05

1.00E-04

1.50E-04

2.00E-04

2.50E-04

3.00E-04

3.50E-04

4.00E-04

WRR RRR

Page 37: Better decision making in presence of faults:  formal modelling and analysis

37

Analysis results for example sectors/sites

Prediction of sector FIR with states (R,N,N) (W,R,N), (R,R,N), (W,N,N),

1 6 11 16 21 26 31 36 41 460

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

RNN WRN RRN WNN

Page 38: Better decision making in presence of faults:  formal modelling and analysis

Make an intervention to move into a better state.

38

Decision making

Page 39: Better decision making in presence of faults:  formal modelling and analysis

39

Analysis from RRR solid line, mean repair time is 20 hours (unsafe at 20 hrs) dashed line, mean repair time is 15 hours (unsafe at 34 hrs) WRR is state with one site repaired

Decision making

Page 40: Better decision making in presence of faults:  formal modelling and analysis

40

WRR dashed line – under standard assumptions solid line, site repaired after 20 hours 20 = random value when mean repair time is 15 hours

Decision making

Page 41: Better decision making in presence of faults:  formal modelling and analysis

41

Idea: catalogue of scenarios and interventions

Decision making- real-time support

Page 42: Better decision making in presence of faults:  formal modelling and analysis

42

Part IV: Implementation• Implementation of the model in PRISM (probabilistic model checker)

Source code is a text file

Example module Channel_A_Tx status_A_Tx: [0..4]; //0 = servicable, 1 = faulty, 2 = repairing, 3 = under maintenance, 4 = site-event [] status_A_Tx =0 -> rate_failure:(status_A_Tx'=1); [] status_A_Tx =1 -> rate_ack:(status_A_Tx'=2); [] status_A_Tx =2 -> rate_repair:(status_A_Tx'=0) + rate_send_fix:(status_A_Tx'=3); [] status_A_Tx =3 -> rate_fix:(status_A_Tx'=0); [event] status_A_Tx =0 | status_A_Tx =1 | status_A_Tx =2 -> (status_A_Tx'=4); [fix] status_A_Tx =4 -> (status_A_Tx'=0); endmodule

PRISM is freely available software www.prismmodelchecker.org

(32/64 bit Windows, linux, MacOS -- Java)

Page 43: Better decision making in presence of faults:  formal modelling and analysis

GUI

43

Client-server architecture based on a nodejs web server and a web interface.

Page 44: Better decision making in presence of faults:  formal modelling and analysis

44

How to use the model(s)

Set the rates

Select the number of sites in the sector

Select initial configuration and duration for a predictive model

Tick this box for a steady state analysis

Page 45: Better decision making in presence of faults:  formal modelling and analysis

Android implementation

45

Page 46: Better decision making in presence of faults:  formal modelling and analysis

46

More on field data • Scheduled maintenance

• The model assumes stochastic failure rates

• Combined failures• Failure events in the Tx and Rx modules are assumed independent• However, only 16% of the faults (over the entire dataset) affect only one module

• Quick repairs• The entries of the database do not record if the fault was repaired locally (quick repair)

or if an engineering team call was required• Even when a fault is fixed quickly locally, the equipment is often monitored for some time

• Site failures• More data is required for statistical significance.• Positive result: the data confirms these are extremely rare events.

Page 47: Better decision making in presence of faults:  formal modelling and analysis

47

Conclusions What we have done• Entire framework is implemented• Instantiated for communications subsystem• Parameterised model driven by a bespoke GUI• Parameter instances derived from field data• Model validation, leading to … • Model as predictor “can I wait 4 hours to fix problem at site X?”

What we uncovered • Some issues about field data

• retrieval from SAP• formats for recording (free text)

What we published“Do I need to fix a failed component now, or can I wait until tomorrow” Submitted to: 10th European Dependable Computing Conference

Page 48: Better decision making in presence of faults:  formal modelling and analysis

48

Next steps 1. Field data• More data – longitudinal and spatial• Automated inference from data, every time dataset updated• Automated updating of model from inferred rates

Fully automated, self-updating model of monitoring system

2. Extend model to more sectors, subsystems etc• model all sectors; model all subsystems• include spatial aspects; frequency redundancy

3. Decision making• Role of online model; catalogue of scenarios

4. Modify model• dependent events and scheduled maintenance• experiment with other types of formalisms

5. Feedback into other processes • alignment with safety and business cases; SLAs; ticketing

Page 49: Better decision making in presence of faults:  formal modelling and analysis

49

Thank you

Page 50: Better decision making in presence of faults:  formal modelling and analysis

Representation

50