pod-diagnosis: error detection and diagnosis of sporadic operations on cloud applications

33
NICTA Copyright 2012 From imagination to impact POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud Applications Dr. Liming Zhu [email protected] Principal Researcher, NICTA/UNSW April, 2014 at Berkeley AMPLab

Upload: liming-zhu

Post on 20-Aug-2015

1.104 views

Category:

Software


0 download

TRANSCRIPT

NICTA Copyright 2012 From imagination to impact

POD-Diagnosis: Error Detection and Diagnosis of

Sporadic Operations on Cloud Applications

Dr. Liming Zhu

[email protected]

Principal Researcher, NICTA/UNSW

April, 2014 at Berkeley AMPLab

NICTA Copyright 2012 From imagination to impact 2

Outline

• Dependable Cloud Operation

• Approach: Process-Oriented Dependability (POD)– POD-Diagnosis– Undo/Recovery Planning using AI Planning– Modeling and Analysis using DTMC

• Connections with AMPLab BDAS

NICTA Copyright 2012 From imagination to impact

Dependable Cloud Operation: Motivation

• Sporadic operations cause most outages– Deployment, reconfiguration, (rolling) upgrade, rollback…

• as opposed to normal operations

– DevOps-related: continuous integration/deploy/delivery• Etsy.com: 25 full deployments per day at 10 commits per deploy

– Other drivers: resource sharing, micro services/partition migration, backup/recovery, auto-mitigation itself…

• Limited control & visibility during sporadic operation– Heavy reliance on Cloud APIs– Limited visibility and exception handling capabilities

3

NICTA Copyright 2012 From imagination to impact 4

Dependable Cloud Operation: Challenges

• Our Context– Large-scale web/enterprise operation in Cloud– Distributed data analytics in Cloud (Hadoop/Spark)

• Goal: detect, diagnose and react to errors occurring during a sporadic cloud operation

• Challenges 1. Anomaly detection during sporadic operations

2. Undo/Recovery planning for recovery

3. Modelling and analysis of sporadic operation

NICTA Copyright 2012 From imagination to impact 5

Sporadic Operation Example: Rolling Upgrade

- Have 100 servers in cloud with version 1 software

- Upgrade 10 servers at a time to version 2 software

- No downtime or redundancy cost

- Potentially take a long time to complete with errors during the operation with other interfering operations

NICTA Copyright 2012 From imagination to impact 6

Challenge 1: Anomaly Detection

• Traditional anomaly-based error detection is designed for “normal operation”– significant false positives OR disable all monitoring

during sporadic operation

• Continuous changes to the production systems– From months at scheduled downtime to hours at all times – Multiple operations at the same time

• Quality of automation scripts + human – fully testing the operation (scripts + human) in uncertain

cloud environment is very difficult

NICTA Copyright 2012 From imagination to impact 7

Our Approach: Use Process Context• Offline: treat an operation as a process

– Process discovered automatically from logs/scripts• Clustering of log lines and process mining

– Intermediary step outcomes specified as assertions

• Online: use process context– Process context: process/instance/step ids, expected states…

– Errors detected by examining logs and monitoring data• Assertions evaluations integration with monitoring facilities• Compliance checking against expected processes using logs

– Detected errors are further diagnosed for (root) causes• Examining a fault tree to locate potential root causes• Performing more diagnostic tests and on-demand assertions

X. Xu, L Zhu, et. al. "POD-Diagnosis: Error Diagnosis of Sporadic Operations on Cloud Applications,” 44nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2014.

NICTA Copyright 2012 From imagination to impact 8

Example: Rolling Upgrade Using Asgard

Read by

Operator

Process Mining Service

Cont

rols

Outputs Create SnapshotCheck AZs

Create instance from snapshot

Create AMI from instance

Evaluate AMI

Discovered Model

Asgard Log dataLog dataGeneratesOffline

Online

NICTA Copyright 2012 From imagination to impact 9

Process Mining Service: how it works

• Process Mining: Discovery1. Collect the logs (using Logstash)

2. Filter the logs

3. Calculating string distance (Levenshtein distance) between each pair of log lines

4. Cluster the log lines

5. Look at the dendrogram to decide on threshold

6. Name & combine clusters

7. Derive regular expressions for the clusters

8. Classify the log lines using the regular expressions and cluster names

9. Import altered log into process mining tools

10. Apply different process discovery algorithms

11. If anything requires changes, go back to the respective steps and redo from there

NICTA Copyright 2012 From imagination to impact 10

POD-Detection: Error Detection

Error Detection Service has two methods for detecting errors:• Assertion Checking• Conformance Checking

NICTA Copyright 2012 From imagination to impact 11

Assertion Checking: how it works

Log line:

Assertions:

NICTA Copyright 2012 From imagination to impact 12

Assertion Checking: how it works

Log line:• Remove ...

Assertions:• i has been de-registered

from ELB• i has been removed from

ASG• there is 1 less instance of

v1

NICTA Copyright 2012 From imagination to impact 13

Assertion Checking: how it works

Log line:• Remove ...• Terminate ...

Assertions:• i successfully terminated

NICTA Copyright 2012 From imagination to impact 14

Assertion Checking: how it works

Log line:• Remove ...• Terminate ...• Wait ...

Assertions:• Next log line should appear

within 17m35s (95 percentile)

NICTA Copyright 2012 From imagination to impact 15

Assertion Checking: how it works

Log line:• Remove ...• Terminate ...• Wait ...• New instance ...

Assertions:

• i‘ successfully launched

NICTA Copyright 2012 From imagination to impact 16

Conformance Checking: how it works

Log lines:

NICTA Copyright 2012 From imagination to impact 17

Conformance Checking: how it works

Log lines:• Remove ...

NICTA Copyright 2012 From imagination to impact 18

Conformance Checking: how it works

Log lines:• Remove ...• Terminate ...

NICTA Copyright 2012 From imagination to impact 19

Conformance Checking: how it works

Log lines:• Remove ...• Terminate ...• Wait ...

NICTA Copyright 2012 From imagination to impact 20

Conformance Checking: how it works

Log lines:• Remove ...• Terminate ...• Wait ...• Terminate ...???

NICTA Copyright 2012 From imagination to impact 21

POD-Diagnosis: how it works

• Fault trees are built as knowledge base

• Process context used for fault tree pruning

• On-demand diagnosis tests to locate the (root) causes

NICTA Copyright 2012 From imagination to impact 22

Evaluation: POD-Detection/Diagnosis

• Experiments– Rolling upgrade of 100+ node cluster in AWS

• Fault injection+ confounding processes: random kill, scaling-in..

• Detected errors– Assertion checking: known errors and global errors

• Examples: key management, launch configuration, images…

– Compliance checking: unknown errors• skipping activities or undone activities

• Time and precision– Compared with Asgard/Monitoring internal mechanisms

• Detected more errors earlier

– Diagnosis: limited to known causes in the fault tree• 95 percentile less than 4s; accuracy ranges 80%~100%

NICTA Copyright 2012 From imagination to impact 23

Evaluation: POD-Detection/Diagnosis

NICTA Copyright 2012 From imagination to impact 24

Other Related Research

Challenges 1. Anomaly detection during sporadic operations2. Undo/Recovery planning 3. Modelling and analysis of sporadic operation

NICTA Copyright 2012 From imagination to impact 25

Challenge 2: Undo/Recovery Planning

NICTA Copyright 2012 From imagination to impact

Undo/Undoability Approach in a Nutshell

• Goal: undo support for “indirect control” setting– Problem 1: some actions are

irreversible, e.g., delete– Problem 2: undo ≠ copy back

previous state of memory• Have to call the right actions on the

right resources in the right order– Problem 3: partly irreversible

operations, e.g. on Amazon WS: • Stopping a machine disassociates an

elastic IP address (if any), and releases internal IP / public DNS

• Starting the machine isn’t undo: elastic IP is dangling, internal IP / public DNS / timestamps are different

• Solution components:

Replace “do” with “pseudo-do”

Undo System based on AI Planning• Outcome: sequence of undo actions

Undoability Checking:• Is the operation I’m about to execute

undoable?• Learn which aspects can be fully undone

for each operation (whole domain)• If not, can we abstract / change so that

undoability is given?

Projection (of a domain)

26

Ingo Weber et. al. Supporting undoability in systems operations. In USENIX LISA'13: Large Installation System Administration Conference, Washington, DC, USA, November 2013.

NICTA Copyright 2012 From imagination to impact 27

Undoability Checking Approach

Operation(s) to execute (e.g., script, command)

Resources andproperties required

to be undoable

Define

Tool user(e.g., sys admin)

Tool providerFull domain model

(e.g., AWS)

ProjectionSpecification

Generate

Undoability CheckerDefine

Apply Projection

Generate

Projecteddomainmodel

Per operation: Generate pre and

post-states

Check undoability per pre-post state pair

· Undoability (yes/no)· List of causes if not

undoableResult

Feedback

For each pair: call

AI Planner

NICTA Copyright 2012 From imagination to impact 28

Challenge 3: Modeling and Analysis

• Approach: Model as stochastic processes – Discrete/Continuous Markov Chain (DTMC/CTMC)

• Forward states: net successful operations• Backward states: failure or deliberate rollback/undo • A family of g-k chains with different parameters

– g: rolling-upgrade wave granularity. k: no. of failure/rollback per wave

Daniel Sun & L Zhu, et. al. ” Understanding Rolling Upgrade” 33th International Symposium on Reliable Distributed Systems (SRDS), 2014 (submitted)

NICTA Copyright 2012 From imagination to impact 29

Model used for

Predictions- e.g. completion time, failure rate impact

Optimization and Decision Problems- e.g. when to activate new versions to guarantee a 99.99% success

NICTA Copyright 2012 From imagination to impact 30

Connection with AMPLab BDAS

NICTA Copyright 2012 From imagination to impact 31

Projects Related to BDAS (1/2)

1. Log/Metrics analysis in POD-Diagnosis– Currently using Spark/MLBase– Voluminous log/events into Spark Streaming

2. Dependable deployment/operation of BDAS– POD applied to Hadoop before, maybe BDAS?

3. Multi-level granularity access for data analytics– Australian Urban Research Infrastructure Network (AURIN)

• Portal to provide transport-related data to international researchers• Cluster sharing for in-portal pre-processing and analytics• de-anonymization concerns and different views for the same data

– Evaluating how BDAS can support this

NICTA Copyright 2012 From imagination to impact 32

Projects Related to BDAS (2/2)

Redacted

4. Data scientist workflow and local exploration

5. Distributed machine learning

NICTA Copyright 2012 From imagination to impact 33

Team Acknowledgement

• Researchers – Len Bass– Alan Fekete– Anna Liu– Daniel Sun– Hiroshi Wada– Ingo Weber– Sherry Xu– Liming Zhu

• Engineers– Adnene Guabtni– Chao Li

• Students– Amer Abdalamer – Ahmed Alqahtani– Mostafa Farshchi– Min Fu– Jin Li– Matthew Sladescu– Donna Xu– DongYao Wu