resiliency and self-healing visa holopainen, [email protected]

Resiliency and self-healing

Visa Holopainen, [email protected]

Reinforcement Learning for Autonomic Network Repair, M. Littman, N. Ravi, E. Fenson, R. Howard, 2004

Reinforcement learning– Used to solve Markov decision problems (MDPs)

States, actions, rewards, transitions, transition probabilities

– Agent explores an environment in which it perceives its current state and takes actions to reach new states

– A reward is assosiated to every state– Reinforcement learning tries to find a policy for maximizing

cumulative reward for a task

(Simplified) Reinforcement Learning example

Which direction should the agent move?

Goal

State

Agent

Goal

State

Reinforcement Learning example (cont)

Agent makes random moves until a Goal state is reached

Goal

State

+ Agent

|

V

|

V

Goal

State


Now a policy is associated with the state from which the goal state was reached

Goal

State

Goal

State


Now if at some point state S (that has policy associated to it) is reached from state S’, a policy is assigned to S’ also

Goal

State

SS’

Goal

State


After some amount of iterations the optimal policies have been formed

Goal

State

Goal

State


The corresponding state rewards

Goal

State-1-2-3

-1-2-3-2

-2-3-2-1

-3-2-1

Goal

State

Implemented concept

Reinforcement learning is used to restore network connectivity after a failure

Starting state: no connectivity, Goal state: connectivity

Actions: PingGateway, PingIP, DNSLookup, UseCachedIP, FixIP, RenewLease, UseCachedIP

Learned policy in the picture Prototype implemented Nice concept but not very useful…

Approaches to Building Self HealingSystems using Dependency Analysis, J. Gao, G. Kar, P. Kermani, 2004

Problems– Is there a way to automatically determine the root cause(s)

of a downgraded performance of i.e. an Internet shopping site

– Provided that the root cause(s) can be determined, are there some ways to automatically fix this problem

Architecture

Distributed System– A typical multi-tier e-Business

system (web access, database) The Monitoring System

– Includes monitoring agents that monitor 1) the response time of the system from user’s perspective and 2) the application components (servlets, EJBs,…)

The Dependency Matrix– Which transactions depend on

which system components Self-healing Engine

– Launched when a performance problem is noticed by monitoring system

Problem description

Based on previous work a dependency matrix can be formed

The matrix informs which customer transactions depend on which system resources

Using this matrix the system resource that causes a preformance problem can be tracked

The initial goal was to minimize the needed transactions to find the root cause of a problem

This problem is found to be NP-hard -> a heuristic solution is presented

Solution

No solution can be guaranteed to be found if two or more matrix columns are similar

Assume that 1) all matrix colums are different and 2) there is only one broken system component

– Now the solution can be found by the following algorithmThe set of all resources is denoted S. The set of all transactions is

denoted T

1) Run all transactions one by one

2) If a trasaction succeeds then remove all resources that this trasaction depends on from S.

3) Finally only one resource is left in S. This is the broken resource.

Solution (cont)

If the fixed set of customer transactions cannot locate the root cause of performance problem, synthetic transactions need to be created and executed

Many practical difficuties exists in doing so No testing

Ensembles of Models for Automated Diagnosis of System Performance Problems, S. Zhang, I. Cohen, M. Goldszmidt, J. Symons, A. Fox, 2005

Ensemble = collection SLA contains Service Level Objectives (SLO)

– SLO example: “Server downtime < X sec in a day” Problem: Which system metrics correlate with SLO

violations?– Example system metrics: CPU metrics, Memory, I/O,

Network activity coming in and out of servers, Swapspace usage, Paging, etc…

Tree Augmented Naïve Bayes (TAN) models– Determine which low-level metrics most likely contributed to

an SLO violation– A mapping function is learned by the algorithm

TAN model example

”Given SLO state (SLO violation) S, what is the most predictive set of system-level metrics for S”

Combinations of metrics more predictive of SLO violations than individual metrics

Small numbers of metrics (3-8) usually sufficient to predict SLO violation

Multiple TAN models

TAN models that are built using data collected under some conditions don't work well on data collected under different conditions -> need to maintain multiple TAN models

The model that best suits the current conditions is chosen by using Brier score

– Brier score is similar to Mean Squared Error (MSE) and offers a fine grained evaluation of a model

Results

Ensembles of models outperform single model

Also do slightly better than workload specific approach

– Indicates that some workload conditions too complex for single model

BA = Balanced AccuracyFA = False AlertsDet = Detections

TAN summary

Ensemble of models perform better than single model

The approach allows for rapid adaptation to changing conditions

No domain specific knowledge is required Different workloads seem to be characterized

by different metric-attribution “signatures” (future work)

Towards Autonomic Web Services: Achieving Self-Healing Using Web Services, S. Gurguis, A. Zeid, 2005

CBE-log is a representation format into which log files of all different applications can be converted

Diagnosis Engine selects a set of repair actions

The Symptoms Database is an XML-file containing symptoms and recovery actions

Rule Engine decides which repair actions should be taken based on the Policy Database

No prototype implemented

A typical record in the Symptom Database presented in the picture

Possible application: legacy systems

Reflection, Self-Awareness and Self-Healing in OpenORB, G. Blair, G. Coulson, et al. 2002

OMG (Object Management Group)– An open membership, not-for-profit consortium that produces and

maintains computer industry specifications for interoperable enterprise applications

OMG CORBA (Common Object Request Broker Architecture)– Open, vendor-independent architecture and infrastructure that computer

applications use to work together over networks– Supports communication between different types of operating systems,

programming languages and networks– Interfaces defined in OMG IDL (Interface Definition Language)– Mappings exists between IDL and C, C++, Java, COBOL, Smalltalk, Ada,

Lisp, Python, and IDLscript OpenORB

– Provides a Java implementation of the OMG CORBA 2.4.2 specification

Example, OMG IDL <-> C mappings

OpenORB self-healing

Meta-interface supports access to the underlying platform Open ORB supports the ability to discover meta-information

about the current system, both in terms of its structure and ongoing behaviour

System properties can also be adapted by using the appropriate meta-interfaces

Management component can be introduced (dynamically) into the various meta-space models

??

Measuring the Effectiveness of Self-Healing Autonomic Systems, A. Brown, C. Redlin, 2005

SPEC (Standard Performance Evaluation Group)– Non-profit corporation that maintains a standardized set of relevant

benchmarks applicable to the newest generation of high-performance computers

SPEC jAppServer2004– Benchmark for measuring the performance of J2EE application

servers– An end-to-end application which exercises all major J2EE technologies

Based on jAppServer2004 a benchmarking system was created that is capable of quantifying the autonomic self-healing capability of a large-scale J2EE software solution

The system is used in various production environments

The Architecture

30 different types of disturbances representing common failure modes can be injected into the SUT– Component shutdowns, data loss,

resource exhaustion, load surges, operator errors, ...

Two metrics are used to evaluate SUT’s self-healing capacity1) How effectively the SUT heals itself

Basically measured by counting how many requests the jAppServer2004 gets right in case of disturbance while compared to normal working conditions

2) How autonomic the healing response is

A 90-question survey is used

The Survey

The 90-question survey assigns points to the SUT based on the level of automation present in its response to each disturbance (based on IBMs autonomic computing maturity model)

– 0 points for a basic manual response, 1 point for a managed response, 2 for predictive, 4 for adaptive, and 8 for autonomic

“...Our baseline run on SUT #1 resulted in an average healing effectiveness score of 0.79 and an autonomic maturity score of 0.15 (both out of 1.0), indicating a relatively low level of autonomic self-healing capability. In comparison, SUT #2 attained an effectiveness score of 0.83 and a maturity score of 0.22. Comparing the two results indicates that SUT #2’s system management technology provided a small—but measurable—improvement in autonomic capability...”

Personal Autonomic Computing Self-Healing Tool, R. Sterritt, S. Chung, 2004

A self-healing tool consisting of pulse monitor and a health monitor

Used in PC-environment Pulse Monitoring application (PBM) is an UDP-based peer-to-

peer application which1) Checks whether hosts are providing a ‘heartbeat’ or not and 2) Indicates the health level of the system (state of processes)3) Reboots a neighbor if no heartbeat is heard from it (security?)

Health Monitoring runs on a host and restarts a process on the same host if it’s not responding

Combines three old concepts: watchdog processes, hello-mechanism, and remote control

The Architecture

Pulse Monitor (Java) communicates with platform-specific Health Monitor (C) through JNI

Main monitor monitors Pulse monitor and Health monitor

Testing

A proof-of-concept prototype system was built on MS. Windows platform

Future topics: more autonomic functionality & supported platforms

Maybe useful when human administration not possible (sensor networks?)

Conclusions

1) Reinforcement Learning for Autonomic Network Repair– Learn autonomically the best sequence of actions to repair a network

outage– Prototype implemented and tested (useful?)

2) Approaches to Building Self Healing Systems using Dependency Analysis

– Determine the root-cause of downgraded performance and try to fix it– No testing, use 3. instead?

3) Ensembles of Models for Automated Diagnosis of System Performance Problems

– Suitable (tested) system for (Hewlett Packard) server systems– Pinpoints causes of SLO violation

4) Towards Autonomic Web Services: Achieving Self-Healing Using Web Services

– Autonomic web server healing system– No testing

Conclusions

1) Reflection, Self-Awareness and Self-Healing in OpenORB– ?

2) Measuring the Effectiveness of Self-Healing Autonomic Systems

– Suitable system for J2EE server systems– Provides users with a quantitative way to measure the self-

healing capability of their IT systems– Implemented and in use

3) Personal Autonomic Computing Self-Healing Tool– Enables a group of PCs to monitor the health of each other– Applications?– Prototype implemented

Overall much discussion about server self-healing

resiliency and self-healing visa holopainen, [email protected]

Documents

goal state

statereinforcement learning

failurestarting state

current state

point state s

system resourcesusing

needed transactions

root causes