rick schlichting - inpe

48
© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. Diagnosis in Practice Rick Schlichting Executive Director, Software Systems Research AT&T Labs Florham Park, NJ 07932, US 1

Upload: others

Post on 02-Jan-2022

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Diagnosis in Practice

Rick Schlichting

Executive Director, Software Systems Research

AT&T Labs

Florham Park, NJ 07932, US

1

Page 2: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Special thanks to Matti Hiltunen and Jen Yates from AT&T for slides and advice on the talk.

2

Page 3: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Diagnosis

Three fundamental steps: 1.  Detecting that something is wrong 2.  Figuring out what is causing the problem (diagnosis, finger

pointing, root cause analysis) 3.  Fixing it (repair)

But — lots of questions and issues in realistic networked systems:

•  What are “something” and “wrong”?

•  Is the source of the problem under my control?

•  Is it possible for me to fix the problem?

•  How to deal with scale?

•  ….

3

Page 4: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

1970 1980 1990 2000

system diagnosis membership

PMC model (Preparata, Metze & Chien, 1967)

Probabilistic diagnosis Distributed diagnosis

Context

System Diagnosis

Group Membership

Failure Diagnosis

failure diagnosis

Page 5: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Talk

Focus on enterprise- and ISP-scale infrastructure and services

Two themes:

•  Reality

•  Two example research projects that attempt to deal with the challenges

Goals:

•  A better understanding of the issues involved

•  Energized to work on the challenges!

5

Page 6: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

So what’s the reality?

So What’s the Reality?

➥ Scale and Complexity

Page 7: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

The Visible Internet

7

Page 8: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Global Networks (AT&T) Over 20 Petabytes of Traffic Average Business Day

MPLS-based Services in 140+ countries at 3800+ service nodes 35+ data centers on 4 continents Over 875,000 route miles

Page 9: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Page 9

IP Networks

PoP: Point-of-Presence P: Backbone (core) Router PE: Provider Edge Router CE: Customer Edge Router

Rough stats: 100s of offices 100s of Ps, 1000s of PEs, 10000s of CEs 100,000s of transport facilities CPE Access

Intercity

Metro

CE

CPE CE

PE

P

Customer facing PE interfaces

P

P

P P

P P

P P

PoP PE

PE

PE

OC-48, OC-192, OC-768 DWDM

DWDM systems

LEC

PoP PoP

P

P PE

Page 10: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Access Networks

MSP

AT&T Local Network

Hig

hes

t

Multi-Services Platform (MSP)

Metro WDM

MSP

DSL

Cab

le

Plan

t

LEC Colo

LEC L

oop

Cable Modem

Cable Head End

AT&T Local Node

Line of Sight Wireless (Microwave)

AT&T Local Node

WDM

WiFi

AT&T Core Network

VSAT Systems

AT&T Core Pop

MPLS Network / VoIP

PSTN

per Customer Lowest Highest Bandwidth

LEC loop Most Prevalent

LEC LSO

Low

est

AT&T Core Pop

VSAT Head End

WiFi Access Pt.

Cellular Voice & Data

Non-Line of Sight Fixed Wireless via WiMax

Wir

ele

ss

Wir

elin

e

Page 11: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

SMS Gateway

SMS Gateway

SMS Gateway

Invizeon HTTP

Gateway

SOAP Auth Services

EMN Server1

EMN Server2

EMN Server4

Alerting Web Service

Infolet

ACK Infolet

Oracle 1

HTTP Gateway

Gateways Servers

Provisioning Web Service

Infolet

Enterprise Applications

SMS Gateway

FAX Gateway

Voice Gateway

Ap

ach

e

HTTPS

JMS Service Sonic 2

JMS Service Sonic 1

Oracle 2

MD MD ACK

Alert

MD MD ACK

Alert

MD MD

Alert

MD MD

Alert

Replicated

SOAP

EMN Server3

Billing Infolet ? Email

Gateway Email

Gateway

SM

TP

Srv

U

NIM

OB

ILE I

nc.

V

oic

eg

en

ie

MD – Mobile device

Page 12: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Page 12

Execution Infrastructure

Dynamic shared environments

•  Cloud computing and utility computing.

•  VMs

No longer as simple as server + OS

•  Layers upon layers of software

Page 13: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Massive scale 100s of offices, 1000s of routers,

10,000s of interfaces, Millions of consumers

Immense software complexity Scale, Bugs, Interactions

Diverse technologies and vendors Layer-1, Layer-2, Switches, Routers,

IP, Multicast, MPLS, wireless access points Continuous evolution

Upgrades, Installations

Applications Scale, sensitivity

Page 14: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Other Practical Implications

Size: •  Potentially multiple independent problems may be present in the

system at the same time •  Data volume

Data quality: •  Limited forms of event data available (e.g., trouble tickets, monitor

outputs, some logs, only failure data) •  Data available for only part of the system •  Gaps in data, delay in data collection •  Correlation of data from heterogeneous systems

Page 15: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

So what’s the reality?

So What’s the Reality?

➥ Spectrum of issues

Page 16: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Specially Designed Recovery Equipment More than a $0.5 Billion investment in over 600 trailers and support vehicles

containing network technology, infrastructure, and support elements.

AT&T Network Disaster Recovery

Page 17: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Security Network Traffic Analysis Customer Business Issues

Issues and Challenges:

•  Networks are becoming more complex

–  Thousands of vulnerable targets and hundreds of applications

–  Different threats such as internal misuse, zero-day attacks, etc.

–  Variety of security services 24x7 at low cost

•  Disruptive network attacks could negatively impact business

Solution should address:

•  Ability to analyze security network traffic

•  Alert and Notification functionality

•  Capacity to track and report historical traffic

•  Minimum “Total Cost of Ownership” solutions

17

Page 18: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

The AT&T Internet Protect® Service Family

18

AT&T Internet Protect® My Internet Protect DDoS Defense

Private Intranet Protect

Scope •  Security alerts & notification for identified threats impacting AT&T IP Network

•  Security alerts & notification for identified threats entering customer’s network

•  Security alerts and mitigation of identified DDoS attacks impacting customer’s network

•  Security alerts, notification, and analysis for identified threats within customer’s network

Features •  Notification via email/paging

•  Mitigation recommendations

•  Tracks private network IP addresses

•  Helps detect profile and misuse anomalies

•  Tracks private network IP addresses

•  Helps detect profile and misuse anomalies

•  DDoS mitigation

•  Alerts based on unsampled network data

Benefits •  Early warning of identified possible attacks

•  Requires no additional hardware or software

•  Measure anomaly traffic volume flowing towards the private network

•  Requires no additional hardware or software

•  Designed to help pro-actively eliminate DDoS attacks before they penetrate private network

•  Requires no additional hardware or software

•  Perform security analysis within and outside private network

•  Requires no additional hardware or software

Reports •  Alert Summary •  Alert Details •  Port Summary

•  Traffic Summary •  TCP Application

Summary •  UDP Application

Summary

•  Traffic Summary •  TCP Application

Summary •  UDP Application

Summary

•  Host Details •  Service Details •  Group Details •  Compliance with

customer’s security policies

Requirements •  None •  AT&T Internet Protect®

•  MIS, or GMIS •  AT&T Internet Protect®

•  MIS •  AT&T Internet Protect®

•  EVPN

Page 19: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Failures at different layers •  Network, hardware, software, software configuration

Different failure types •  Crash, performance, omission, quality, value, …

Permanent and transient failures

Chronic failures •  low impact, repeating

➥  Focus increasingly on service quality management (SQM)

•  individual component or network element -> end-to-end service quality – Machine, router, link, line-card -> User perceived performance in a CDN or VPN service

• hard failure -> transient problem –  Fiber cut, router failure, crash -> transient performance degradation, protocol flap

Page 20: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

So what’s the reality?

So What’s the Reality?

➥ Current practice

Page 21: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Global Network | Center-based Operations

Switched Voice

Voice Switching

Transport

Transport

IP Backbone

IP Backbone

Mobility

Mobility

Networked Service

Operations

Networked Service

Operations

Tier 1 Alarms

Network and Service Applications Reliability Centers Tier 1 & 2 Break-Fix

Incident Management COE Professional Outage Management

Advanced Technical Services Advanced Technical Support

Global Network Operations Center Overarching Command & Control

ESCALA

TIO

N

TH

RESH

OLD

SS

Page 22: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

AT&T Global Networks Operations Center (GNOC)

Page 23: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Page 23 7

Customization

FinancialFinancialBilling DataCollection

NonSNMP

Auto-Ticketing

Data Warehouse

Event

Correlation

Root Cause

Analysis

Incidents - AOTS

Service Delivery GPS, EFMS, NC3

Web Service / Methods

ICMP, Polling,

SNMP Traps

Thre

shol

dA

larm

sEnhanced

Event

Handling

Architecture

Configuration

PMOSS

XML GatewayE-Bonds

B2BiG F P

EMS’s

Performance Reporting & Capacity Planning

SNMP DevicesLANFRADSRoutersG3 PBX

Unix, NTApplications

iGEMS Architecture based on ITILPortal for Customers & Operations

Auto Notifications

C-BusSNMPTrap

Gateway

HSPS

CBB

Other

Stewardship & SLA Reports

Performance & Capacity Trending and Thresholds

Historical

Operation

Reports

Surveillance

Planning

BusinessAutomation

RUBY

CTP TA

G F P

http://www.naspi.org/meetings/workgroup/2008_march/presentations/ibm_network_managment_nicoletti.pdf

Page 24: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

G-RCA: A Generic Root Cause Analysis Platform

He Yan1, Lee Breslau2, Zihui Ge2, Dan Massey1, Dan Pei2, Jennifer Yates2

1Colorado State Univ, 2AT&T

24

ACM CoNEXT 2010, Nov 30-Dec 3, 2010, Philadelphia, PA, US

Page 25: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Service Quality Management (SQM) and Root Cause Analysis (RCA)

Customer location A

Customer location B

SYSLOG, SNMP, Router

Configuration, Router

Command logs

performance degradation

between location A and B.

ISP

Customer location C

??? SYSLOG, SNMP, Router

Configuration, Router

Command logs

SYSLOG, SNMP, Router

Configuration, Router

Command logs

CPU LOAD

SYSLOG, SNMP, Router

Configuration, Router

Command logs

SYSLOG, SNMP, Router

Configuration, Router

Command logs

SONET/DWDM

Alarms and logs

SONET/DWDM

Alarms and logs

SONET/DWDM

Alarms and logs

SONET/DWDM

Alarms and logs

SONET/DWDM

Alarms and logs

SONET/DWDM

Alarms and logs

SONET/DWDM

Alarms and logs

Other Customer

Other Customer

Other Customer Other

Customer Other Customer

Other Customer

Other Customer

Other Customer

CDN VPN

VOIP Mobility

IPTV

Page 26: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Challenges in RCA for SQM

Complex service dependency model

•  The quality of a VPN connection across the ISP network depends on the status of the routers and links along the network path carrying the traffic, which is dynamically determined based on the link weights

Ever-changing environment •  New services (e.g., multicast VPN), new technologies (e.g., MPLS TE), new devices

(e.g., OC768 line cards) are introduced into ISP networks at a fast rate

•  Operators need new RCA tools for new services

•  RCA tool needs to deal with imperfect domain Knowledge

Data Collection •  Proactively collect data (alarms, logs and performance measurements) to enable the

analysis of historical transient service disruptions

Scalability •  Analyze a large number of transient service disruptions over extended period to

determine the primary root causes

•  Dig into millions of alarms and logs distributed all over the network.

Page 27: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

G-RCA Design

Generic RCA Engine Temporal/

Spatial Correlation

All events and their associated root causes

New Diagnosis Rule Miner

Network Operators

layer-1 alarms

router logs

SNMP MIBs and Traps routing

data

router command logs

router configurations

end to end measurements

Data Collector

Events Definition

Service Dependency

Model

Application Diagnosis Graph

Only events with unknown root causes

Result Browser • Classify/Trend

Root Causes • Manual Drilling-

Through Data

Network Operators

New diagnosis rules

Verified new diagnosis rules

Spatial Model

Root cause reasoning

RCA Knowledge Library

Page 28: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

RCA Knowledge Library

➥ End result is Application Diagnostic Graph

Spatial Model •  Topological information, cross-layer dependency, logical and physical device

association

•  Dynamic routing •  Mapping is time dependent

Event •  A signature that captures a particular type of network condition, e.g. Interface flap or

router reboot

•  G-RCA pre-defines and implements a wide range of commonly used events.

Service Dependency Model •  Consists of diagnosis rules •  Symptom and diagnostic events are picked from event definitions, e.g.: interface flap

-> line protocol flap

•  G-RCA pre-defines the commonly used diagnosis rules

28

Page 29: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Generic RCA Engine

Temporal Correlation Module •  Given a symptom event instance, find out the temporally correlated the

diagnostic event instances •  Imperfect timing

Spatial Correlation Module •  Given a symptom event instance, find out the spatially correlated

diagnostic event instances •  Indirect mapping

Root Cause Reasoning Module • Determine the root cause of a symptom event instance based on

temporally and spatially correlated diagnostic event instances.

• Rule-based decision-tree-like reasoning – associate a priority value for each edge in the diagnosis graph – The higher the priority value, the more likely the diagnostic event to be the real root

cause

29

Page 30: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Example: BGP flaps root cause analysis

Minimize the number of eBGP session flaps

A challenging problem across the trust domain

eBGP Session

Page 31: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Application Diagnosis Graph

interface flap

line protocol

flap

ebgp flap

BGP hold timer

expiration

Layer-1 outage

router reboot customer

reset session

Layer-1 flaps

200 200

120

120

150

150

18

0

180

CPU high (average)

CPU high

(spike)

Page 32: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

5 provider edge routers, each with several hundred eBGP sessions with customer routers

An interesting discovery!

Operational Experience

Page 33: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Key Take Away

A generic, automated and scalable root cause analysis platform for SQM in large IP networks

33

Page 34: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Draco: Statistical Diagnosis in VOIP Systems Soila Pertet Kavula1, Kaustubh Joshi2, Matti Hiltunen2, Scott Daniels2, Rajeev Gandhi1, Priya Narasimhan1

1Carnegie-Mellon Univ, 2AT&T

34

http://www.pdl.cmu.edu/Publications/

Page 35: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Motivation

Business VoIP services – 10+ service types, 200+ network elements – Millions of calls per day, growing rapidly

Faults occur continuously in the system – Minor incidents + major incidents, e.g., failed upgrades, can increase fault rate

– Faults cause failed calls (blocked or cut-off calls)

Research question – How to diagnose faults that have never happened before? – Faults due to combinations of unexpected events?

Page 36: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

VoIP SIP Call Flow

IP Base Element (IPBE)

1. Calling party sends SIP invite

Call Control Element (CCE)

2. Get customer charge number & IP

Application Server (AS)

3. Executes service logic.

4. Send routing number for destination

Gateway Server

Exchange (GSX)

6. Call established.

5. Send invite to destination. Call rings.

Call Setup

Data Transfer

Page 37: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

VoIP Call Detail Records (CDRs)

Network elements record call outcome in CDRs – Timestamp, name of network element, service type – Telephone numbers, customer IP addresses – Outcome of call (successful, blocked, or cut-off call)

Blocked calls: Fail during setup

IPBE CCE AS ×

IPBE CCE AS

GSX

×

Cut-off calls: Failure after setup

Page 38: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Adopt a technique from software debugging literature: – Liu, C., Lian, Z., and Han, J. “How Bayesians Debug”. In Proc. 6th IEEE Int.

Conf. on Data Mining. Dec. 2006.

Operates on CDR “attributes” – Service types, defect codes, network element names – Customer IP addresses – Easily extended to include additional call attributes

– e.g. software versions, QOS data

➥ Identifies call attributes most correlated with failed calls

Steps: 1.  Extract call attributes from CDRs and model using a truth table 2.  Compute distribution of each attribute in failed and successful calls 3.  Iteratively select attributes that best differentiate failed from successful calls 4.  Rank problems based on severity

Iterative Bayesian Approach

Page 39: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

“Truth table” Call Representation

ny4ny01sdh ph4pa0102bap at4ga03wap ny4ny02gh Call Outcome 1 1 0 1 SUCCESS 1 0 1 1 FAIL

Successful Call • ny4ny01sdh • ph4pa0102bap • ny4ny02gh •  ….

Failed Call • ny4ny01sdh • at4ga03wap • ny4ny02gh •  ….

Page 40: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Suspect Attribute Identification

In each call: – Assume each attribute has a stable but unknown occurrence probability. – Reflects service volume, call distribution, routing, etc.

Bayesian estimation: – Construct failure/success distributions for attribute occurrence probabilities: • P[Attribute occurs in failed calls], • P[Attribute occurs in successful call]

– Each CDR updates either failure or success distribution

Distribution divergence – Compute the difference between success and failure distributions – Attribute with largest divergence is chosen as the suspect attribute

Page 41: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Success and Failure Attribute Distributions

Page 42: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Iterative Bayesian approach

Blo.P.ATTEMPT. SIP.1.2.3.4.5.orig ny4ny02gh

IP: 123.456.789.0 at4ga03wap

ph4pa04wap ny4ny01sdh FlexReach

1.   Select top k attributes prevalent in defective vs. successful calls

2.   Select attribute that best explains difference in defective and successful calls

Merge attributes present in same set of calls

3.   Update counts

4.   Iterate steps 1-3

5.   Diagnosis is path from root to leaf

Page 43: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Diagnosis Tool Architecture

Network Elements Raw Call Logs

Defect code: 1

Hostname: SVR3

Service: Service1

TN: 555-555-5555

Start: 2010/12/01 06:45

....

Data Collectors

(consolidated end-to-end logs)

Diagnosis

EngineData Collectors

Draco

Defect

History

43

Datasets are 10s of millions of records!

Page 44: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Operator Dashboard

Defect Signature1

Sample callsDefect code: 1 Origin: IP Service: Service1 TN: 555-555-5555 Start: 2010/12/01 06:45 End: 2010/12/01 06:50

Defect history

2010/12/20

+ +

SRV3PHONE1

1. Browse defects

Platform defectsNon-platform defectsAll defectsService A

By date

By type

2. View ranked list of defects

Calls affected: 100 % of total defects: 50%

Date - 2011/1/1Total calls: 1000000Total defects: 1000

3. View samples of calls affected

4. View defect frequency for day

5. Identify chronic problems

2010/12/23

00:00 24:00

Service B

44

Page 45: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Case Study

Time in GMT

00:00 12:00 00:00 12:00 00:00 12:00 00:00 12:00 00:00 12:00 00:00 12:00

Faile

d ca

lls

Day1 Day2 Day3 Day4 Day5 Day6

Incident 1Incident 2

Chronic nightly problem due tounsupported fax codec Unrelated chronic server

problem emergesCustomers stop usingunsupported fax codec

Problematic server reset

45

Used to assist in identifying root cases of chronic defects:

•  Low impact defects that persist for days or weeks •  Six months of use

Success stories

Page 46: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Key take Away

A top down statistical approach that minimizes the need for domain expertise and implemented in a scalable tool

46

Page 47: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

Conclusions

Lessons: • Scale and complexity make the reality daunting! • Examples—G-RCA and Draco—illustrate different techniques for

dealing with these. • Importance of data handling. • Service quality as a key metric. • Value of working with real data and real systems.

➥  Many years of research left!

47

Page 48: Rick Schlichting - INPE

© 2011 AT&T Intellectual Property. All rights reserved. AT&T, AT&T logo and all other marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies.

48