g-rca: a generic root cause analysis platform for service quality management in large ip networks

18
G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks He Yan, Lee Breslau, Zihui Ge, Dan Massey, Dan Pei, Jennifer Yates

Upload: deirdre-breslin

Post on 31-Dec-2015

20 views

Category:

Documents


0 download

DESCRIPTION

G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks. He Yan, Lee Breslau, Zihui Ge, Dan Massey, Dan Pei, Jennifer Yates. Abstract. Best effort networks --> QoS Manage end-to-end service quality as a whole Generic Root Cause Analysis (G-RCA) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks

G-RCA: A Generic Root Cause Analysis Platform for Service

Quality Management in Large IP Networks

He Yan, Lee Breslau, Zihui Ge, Dan Massey, Dan Pei, Jennifer Yates

Page 2: G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks

Abstract

● Best effort networks --> QoS● Manage end-to-end service quality as a whole

● Generic Root Cause Analysis (G-RCA)o Service Quality Management (SQM)

● FCAPS

Page 3: G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks

Introduction

• Finding root to errors– transient errors

• Gather information for network operators

• Helps Service Quality Management (SQM) for ISPs.

Page 4: G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks

G-RCA Architecture

• Consists of five main components.

• G-RCA determines where and when to look for diagnostic events.

• Used for:– Troubleshoot

ongoing networks

– Investigate past behavior.

Page 5: G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks

Data Collection and Management

• Proactively collects data from network, such as alarms, logs and performance measurements.

• Uses a data collector and database to store data

• “Events”

– event-name, location type, retrieval process and information

Page 6: G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks

Service Dependency Model

● Figure 2 used to include network elements associated with a problem

● Hard to realize theory

o Traffic sampling data

o Snapshots of router configs

Page 7: G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks

Spatial-Temporal Correlation (1)● How to relate what has happened to service problem?● G-RCA defines a temporal and spatial joining rule● Temporal Joining Rule

○ Defines a time window to allow symptom and diagnostic event to be joined.

○ 6 parameters for symptom & diagnostic event

■ Left expansion margin

■ Right expansion margin

■ Expanding option (Start/End, Start/Start or End/End)

Page 8: G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks

Spatial-Temporal Correlation (2)

○ Symptom and diagnostic event are joint when the windows overlap.

Page 9: G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks

Spatial-Temporal Correlation (3)● Spatial Joining Rule

○ Symptom event location type○ Diagnostic event location type○ Joining level

● Joining level○ Link symptom locations and diagnostic event locations together

● Model diagnostic signatures using diagnosis graph● A symptom and diagnostic event pair is called diagnosis rule● G-RCA evaluates the time and location conditions and

collected data● Determine whether diagnostic signature is present

Page 10: G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks

Reasoning LogicRule-Based Reasoning Module

• Priority value in the diagnosis graph – Assigned by operator– Higher value means more confidence on the diagnostic

event to be the real root cause– Can be examined by G-RCA’s Result Browser

• How does rule-based reasoning work?

Page 11: G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks

Diagnosis graph for BGP flaps root cause analysis

Page 12: G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks

Bayesian Inference• Determining the root cause is to identify the one producing the

following maximum likelihood ratio:

• When the features are conditionally independent– The second term can be decoupled to

• Parameters configuration (ratios of: and ) – bootstrap using the rule-based reasoning – define a fuzzy type of discrete values

• Low, Medium, and High, which corresponds to values 2, 100, and 20 000.

Potential root causes:

classes

A set of rpresence or absence of the diagnostic evidence and symptom events themselves : features

First term Second term

Page 13: G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks

Comparison• In the operational practice,rule-based reasoning logic is often

preferred over Bayesian inference– Easier to configure– Gives simple and direct association between the diagnosed

root cause and the evidence – Effective in most applications

• However, there are a few cases where Bayesian inference is preferred – Root cause condition is unobservable

Page 14: G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks

Domain Knowledge Building

● Issue: The specification of a diagnosis graph for a SQM application offered by an operator, especially the initial version, can be inaccurate and incomplete.

● G-RCA addresses this concern regarding incomplete diagnosis graph through iteratively using the Correlation Tester and Result Browser.○ Firstly, operator filters out the symptom events with known

root causes with the root cause classification capability provided in the result browser.

○ Secondly, operator could focus on the rest of symptom events by comparing with other suspected diagnostic events that occur at the same time and that are spatially related to the service problem.

Page 15: G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks

Domain Knowledge Building

● On one hand, the second step can be done via manual drill-down and data exploration capability in the result browser;

● On the other hand, operators can also to run the correlation tester blindly between the symptom events without known root causes and each type of suspected diagnostic graph.

● As G-RCA emphasizes usability, the newly uncovered diagnosis rules need to be verified by operators before incorporating into the diagnosis graph.

Page 16: G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks

Introduction of G-RCA Applications• The key advantage of G-RCA in SQM is its capability to be rapidly

customized into different RCA applications in the ISP’s network.

• In this section, the following three case studies are included in order to demonstrate effectiveness of G-RCA

– 1) customer BGP flaps

– 2) end-to-end throughput management in a CDN service

– 3) network PIM flaps in multicast VPN

Page 17: G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks

BGP Flaps Root Cause Analysis

Purpose: Understanding the root cause of flaps.● Achieving this using G-RCA by constructing application specific

events and rules.○ Starting by constructing our BGP flap-specific events. ○ Adding a few application-specific diagnosis rules. ○ Specifying priorities for different diagnosis rules for BGP flaps

RCA. (Please refer to the figure of “Diagnosis graph for BGP flaps root cause analysis” shown in the previous slides)

Application-specific events for BGP flaps root cause analysis

Page 18: G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks

Conclusion1. It captures the layered network model in its knowledge library, by implementing

- temporal/spatial correlation, - rule-based reasoning, and - Bayesian inference.

2. Domain knowledge in existing RCA application can be refined by the interaction between the RCA engine and the Correlation Tester.

3. In order to analyse a large number of service quality issues and classify trend their root causes, it proactively collects all types of data from different sources and normalize them in real time.