mplane reasoner(s) & analysis modules pedro casas ftw vienna mplane final workshop 30 november...

mPlane Reasoner(s) & Analysis Modules

Pedro CasasFTW Vienna

mPlane final workshop30 November 2015, Heidelberg

Supervisor

WP4

mPlane itervative measurement

Measurement Layer

mInterface mInterface mInterface mInterface mInterface mInterfacee

mProbe 1 mProbe 2 mProbe N legacyProbe 1 legacyProbe 2 legacyProbe N

WP2Raw data

CoordinationAutomationAnalysis

WP3

Repo

sito

ry a

nd A

naly

sis

Laye

r

legacyDB 1

legacyDB 2

legacyDB N

mPlaneRepository DBStream

Blockmon

Data collection& processing

Intelligent Reasoner

Mo

du

le 1

Mo

du

le 2

Mo

du

le N

Analysis Modules

Outline

The Useful – Coordination and Analysis

The mPlane Reasoner(s)

Analysis Modules

WP4 Overview

Intelligent Reasoner for Iterative and Adaptive Analysis Guides and automates the iterative measurement and exploration, diagnosis

process

Monitoring Data Analysis Modules Complex data analysis, high visibility, filter data accessed at Repos, very specific

data (low volume) from probes

Supervisor The glue of the mPlane protocol Provides centralized control of

distributed measurement framework

Useful

The Reasoner is responsible for driving the measurement analysis process, which by nature is iterative, and ideally adaptive (learning).

Depending on the use-case, the Reasoner has different roles:

In the case of troubleshooting support iteratively find the Root Causes of the associated problems

In the case of generic measurement analysis automate the iterative process

Each use case defines/instantiates a specific Reasoner addressing its goals

Still, generic design rules of a specific Reasoner can be reused in other use cases

The mPlane Reasoner

The Reasoner – ComponentsThe Reasoner consists of 3 different blocks:

The Knowledge Structure: The memory or knowledge of the system Initially based on expert domain knowledge (diagnosis rules) Extended by learning from past experiences (knowledge discovery)

The Reasoning/Diagnosis Process: Automates/structures the iterative analysis

The Knowledge Discovery Process: Enriches the knowledge structure and the reasoning process Based on learning (supervised/unsupervised)

The Reasoner – The Overall Picture

Reasoning/DiagnosisProcess

The “Knowledge” of the Reasoner

Knowledge Discovery

What I Know

Learning(un)supervised

Automate Analysis, based on what I know

The Diagnosis Process (1/2) The Reasoner does not work on raw data, but on events

An event captures a particular type of network conditions

E.g., link congestion, YouTube throughput drop, overloaded cell, Google CDN load-

balancing, etc.

Events are extracted from raw measurements through a retrieval process (actual

algorithms at WP2, WP4, queries, etc. )

Events are defined as m-tuples including the following fields: event name: e.g., link overload. location type: e.g., Gn downlink interface. time span: e.g., 2013-10-21-12:30:00, 2013-10-21-12:35:00. retrieval process: e.g., Simple Link Congestion Detection Algorithm – SLCDA (with

utilization threshold Cth). additional diagnosis features: e.g., number of flows, number of bytes, list of server

IPs originating the flows, etc.

The Diagnosis Process (2/2) Some examples of events related to Root Cause Analysis (RCA)

1. A congested Gn interface in a mobile ISP during 5’:

2. An anomaly detected in YouTube traffic, impacting users’ QoE for 5’:

Diagnosis Graph (1/4) Relates problems/issues with events and root causes, exploring the

temporal and spatial relationships between events

Which type of diagnosis graph reasoning? Rule-based reasoning (decision-tree like graph)

Easier to implement and configure (easy to add domain knowledge)

Gives simple and direct association between the diagnosed root cause and the evidence(s) for better interpretation

It is very effective in the practice

Other types of Iterative Reasoning can be implemented in such a way (not only RCA, but generic iterative measurement processes)

Using per use-case graphs, the Reasoner looks for the presence of events, and identifies the root cause as the leaf with the highest probability

Example: Who to blame when YouTube is not working?

AS 2AS 1

ISP Network

Devices? ISP? Internet? YouTube?

G-CDN

Diagnosis Graph (2/4) An example of a Diagnosis Graph (DG) associated to the detection and RCA of QoE-

relevant anomalies in YouTube:

In the example, the DG is structured in 5 different macro-blocks:

① QoE-relevant Anomaly Detection block

② End-device Diagnosis block

③ ISP Diagnosis block

④ Internet paths Diagnosis block

⑤ CDN servers Diagnosis block

Example of root causes and the associated rules’ description

ISP Diagnosis block

Purpose: detect QoE degradation

BASIC PROCESS:

1) Continuous passive monitoring

2) Trigger of active monitoring in case of alarms

Diagnosis Graph (3/4)

High level Diagnosis Graph for ISP (simplified from D4.2):

Triggers Internet Active

Probe

Alarms from

different POPs?

Issue external to SP domain

Alarms from

different BRAS?

Issue in SP Core Network

Issue on BRAS

Issue on DSLAM

Issue on Access Lines

Triggers POP Active Probe

Inter-domain measurements

check

Triggers DSLAM Active

Probe

Diagnosis Graph (4/4)

Knowledge Discovery Domain knowledge and operational experience is incomplete (just using domain-

based diagnosis graphs limits the system capabilities)

Therefore, the specification of an initial diagnosis graph can be rather under-performing, both in accuracy and completeness

The role of Automatic Knowledge Discovery correlate all the events that occur at the same time and are spatially related to the service problem under investigation…

…And learn new diagnosis rules (new knowledge) from past experiences Supervised learning in case of labeled data

Unsupervised learning in the general case

Some mPlane techniques : Automatic Rule Mining, Sub-Space Clustering, Decision–Trees Learning

Final expert intervention to validate the identified diagnosis rules, which are added to the Knowledge Structure

Multiple mPlane Reasoners A mPlane Reasoner is an extended mPlane client, which performs

sequential tasks based on intermediate analysis results, actuating through the mPlane Supervisor interfaces

In the practice, we implemented different Reasoners following the aforementioned principles, but tailored to the specific needs of each use case:

1. Reasoner in nodejs: basic mPlane Reasoner

2. Reasoner for Content Popularity Estimation

3. Reasoner for Content Curation

4. Reasoner for Web browsing QoE

5. Reasoner for Mobile Network RCA

6. Reasoner for Anomaly Detection and RCA

7. Reasoner for SLA Verification

8. Reasoner for Multimedia Content Delivery Analysis

9. Reasoner for GLIMPSE

Analysis Modules or Algorithms further evaluate the measurements gathered and pre-analyzed by the lower layers of mPlane

They operate on low amounts of data (as compared with the data available on WP3 or eventually gathered at WP2)

Analysis Modules

Per-use case algorithmsThe main Analysis Modules are linked to the proposed use cases:

Find the cause of Quality of Experience (QoE) degradations

Estimate the future popularity trends of services and contents for network optimization

Classify and promote interesting web content to end-users

Assess and troubleshoot performance and quality of multimedia stream delivery

Diagnose performance issues in web and identify the segment that is responsible for the QoE

degradation

Find root cause of problems related to connectivity and poor QoE on mobile devices

Detect and diagnose anomalies in Internet-scale services (e.g., CDN-based services)

Verify SLAs

…but there is more

QoE QoE-based monitoring for YouTube: metrics to detect playback

stallings Relate OWD variation to QoE, for generic class of applications

Topology Detect Anycast Services: determine if a service uses IP anycast Reverse Traceroute – DisNETPerf: find probes near some point of

interest in the network to launch active measurements Topology discovery: identification of middle boxes, TCP proxies

and NATs MPLS transit tunnel analysis: Classification of MPLS tunnels

based on their usage/purpose (mono-path, ECMP, multi-FEC, etc.) Topology/Performance

Analyze dynamics of forwarding and routing paths : determine whether routing paths follow perturbations experienced by forwarding paths or vice versa

Prediction of Unmeasured Paths: Inference of path properties (RTT, Available Bandwidth, etc.) on unmeasured network paths

Some Extended Analysis Modules

Partial Mapping of Analysis Modules to Use Cases

Reasoners and Analysis Modules (as well as everything presented so far during the day) isavailable at the mPlane website as soft tools:

https://www.ict-mplane.eu/public/software

- 21 -

0 1 2 3 4 5 61

2

3

4

5

number of stallings

MO

S

crowdsourcinglaboratory

4 seconds of stalling

On the real mobile network

Lab studies

1 single stalling event heavily deteriorates the experience of the end-user

2 or more stallings already means bad quality

Duration of the stallings is less critical, but also has an important impact on QoE

Stallings are the impairments perceived by the end-user (independently of the video resolution, or even DASH)

MOS = F( N,

L)

Selected examples I: YouTube QoE

We introduced a simple KPI to monitor YouTube QoE from passive network measurements

Buffer depletion generally occurs because the downlink bandwidth is lower than the video bitrate

Ex: std 360p YouTube videos VBR=600 kbps DBW > 750 kbps

Stallings and Download Throughput

Selected examples II: Anomaly Detection and Diagnosis

(1) Reference-Set identification: find past traffic distributions which are a suitable reference of normality

(2) AD test: use a normalized variant of the Kullback-Leibler divergence to decide if current distribution is compatible with the reference-set

feature

CD

F

x1 and x2 are similar → L(x1,x2) is smallx1 and x3 are dissimilar → L(x1,x3) is large

x1

x2

x3

We conceived a statistical AD tool which works with full feature distributions

AD algorithm consists of two phases:

Using ADTool for Detecting and Diagnosing Anomalies

Many interesting service anomalies are observed as abrupt changes in the DNS counts

Reasoner approach: correlate observations from multiple metrics revealing service-related and/or device related anomalies:

Fully Qualified Domain Name Device OS Device manufacturer (TAC number in mobile devices) HTTP response code and so on…

Example: service/device related real anomaly in mobile devices

Selected examples: Anomaly Detection and DiagnosisDNS queries counts in a mobile network

Periodic spikes daily synchronization events

Peak hour utilization

Traffic anomaly, what’s that? easy to detect, not so easy to diagnose

Similar behavior in tablets The anomaly is only

observable for Apple devices

akadns.net (Akamai DNS)

push.apple.com (Apple Push Notification Service)

Connection issues to Apple push notification servers

Problem solved:

Anycast enumeration and geolocation

Iterative methodology based on geographically distributed VPs Determine if a service uses IP anycast Enumerate replicas sharing the same IP

address Geolocate those replicas

The iterative workflow is lightweight O(100) pkts, and fast O(100) ms

Shall support RIPE, mPlane/Planetlab probes (RIPE integration in mPlane)

Selected examples III: Anycast Detection

Selected examples IV: DisNETPerf Problem solved Reverse Traceroute (no IP spoofing nor IP record):

find the mPlane probe that is closest to a given Point of Interest (PoI) to enable troubleshooting on the path from that PoI to some user without control on the PoI side (e.g., YouTube server)

Neighborhood model: combined topology- and delay-based distance (BGP same AS + min RTT)

Main idea: we rely on a large set of probes widely spread (e.g., RIPE Atlas) Given IPs (eg YouTube) and IPd (eg, PoP @Heidelberg), locate IPdisnet IPdisnet “mimics” IPs in terms of IPs IPd path similarity Run traceroute measurements from IPc to IPd Collect data for troubleshooting-purposes

DisNETPerf in a Nutshell

mPlane – 2nd Review MeetingBrussels, February 10th, 2015

Reverse Traceroute IPs IPd?

Backup slides

Selected examples I: Content Popularity

Early detection of contents which will receive attention

mPlane

Cache

How mPlane can make it happen

Probes (passive)

Repository

Analysis ModulesPopularity ModelerPopularity Predictor

ReasonerDetect devices and

caches close to location

SupervisorNotify popular

contents

HTTP requests

CDN supervisorCaching strategies

based on future popular contents

Preliminary Results

Popularity Modeler and Predictor modules Topic models: GMM + LDA Maximum likelihood

Caching policy based on content popularity vs. LRU and LFU (Least Recently/Frequently Used)

We improve the SotA algorithms by obtaining the similar RMSE for a much smaller observation window (30’ vs. 4 hs)

RMSE

Selected examples II: Passive Media Curation

A new way of helping users finding, fast, relevant content in the web

mPlane

User clicks are a good measure of Interest (users don’t click randomly)

Curated (relevant) content

WP2

WP3

WP4.1 – Analysis Modules

Portals vs Contents

Content popularityStatisticsClassify Contents

Elect Content to promote

Supervisor

Publish Content

How can mPlane make it happen

User URLs

Interesting URLs

WP3- Scalable data analysis

Orchestrate

http://webrowse.polito.it/(prototype running since few months)

Up to 5M requests/hour

WP2’ (active)

http://webrowse.polito.it/

Content versus Portal classifier module

Features: Hostname URL length Frequency as hostname Request Arrival Process cross-correlation 1-day periodicity

« Feed » a naive bayes classifier Tested different combinations The best is: URL length+period:

As accurate as the 5 features together

Accuracy Tested on manually verified ground truth traces Used 2/3 for training and 1/3 for “prediction” Overall 96% accuracy of the classifier 94% precision, 100% recall in detecting content-URLs

Content-URL Web portalwww.news.com/region/news1.htm www.news.com, www.news.com/region/

http://www.news.com/region/news1.htm

http://www.news.com/

http://www.news.com/region/news1.htm

Content promotion module (in progress) Three types of promoted content-URLs so far

Live stream: News/Videos/Blogs currently attracting the attention of the crowd Top : Most popular (last day, week, month etc) over all content-URLs. Hot: A « mixture » of popularity and freshness (adapted from reddit’s hot algorithm)

First users like them!

Timestamp of first view

Absolute reference: start date of Netcurator

A freshness constant period (12 hours)

Relevant

28%

Very relevant

42%

Extremely relevant

13%Poor

6%

Not that relevant

8%

mplane reasoner(s) & analysis modules pedro casas ftw vienna mplane final workshop 30 november...

Documents