mplane reasoner(s) & analysis modules pedro casas ftw vienna mplane final workshop 30 november...
TRANSCRIPT
mPlane Reasoner(s) & Analysis Modules
Pedro CasasFTW Vienna
mPlane final workshop30 November 2015, Heidelberg
Supervisor
WP4
mPlane itervative measurement
Measurement Layer
mInterface mInterface mInterface mInterface mInterface mInterfacee
mProbe 1 mProbe 2 mProbe N legacyProbe 1 legacyProbe 2 legacyProbe N
WP2Raw data
CoordinationAutomationAnalysis
WP3
Repo
sito
ry a
nd A
naly
sis
Laye
r
legacyDB 1
legacyDB 2
legacyDB N
mPlaneRepository DBStream
Blockmon
Data collection& processing
Intelligent Reasoner
Mo
du
le 1
Mo
du
le 2
Mo
du
le N
Analysis Modules
Outline
The Useful – Coordination and Analysis
The mPlane Reasoner(s)
Analysis Modules
WP4 Overview
Intelligent Reasoner for Iterative and Adaptive Analysis Guides and automates the iterative measurement and exploration, diagnosis
process
Monitoring Data Analysis Modules Complex data analysis, high visibility, filter data accessed at Repos, very specific
data (low volume) from probes
Supervisor The glue of the mPlane protocol Provides centralized control of
distributed measurement framework
Useful
The Reasoner is responsible for driving the measurement analysis process, which by nature is iterative, and ideally adaptive (learning).
Depending on the use-case, the Reasoner has different roles:
In the case of troubleshooting support iteratively find the Root Causes of the associated problems
In the case of generic measurement analysis automate the iterative process
Each use case defines/instantiates a specific Reasoner addressing its goals
Still, generic design rules of a specific Reasoner can be reused in other use cases
The mPlane Reasoner
The Reasoner – ComponentsThe Reasoner consists of 3 different blocks:
The Knowledge Structure: The memory or knowledge of the system Initially based on expert domain knowledge (diagnosis rules) Extended by learning from past experiences (knowledge discovery)
The Reasoning/Diagnosis Process: Automates/structures the iterative analysis
The Knowledge Discovery Process: Enriches the knowledge structure and the reasoning process Based on learning (supervised/unsupervised)
The Reasoner – The Overall Picture
Reasoning/DiagnosisProcess
The “Knowledge” of the Reasoner
Knowledge Discovery
What I Know
Learning(un)supervised
Automate Analysis, based on what I know
The Diagnosis Process (1/2) The Reasoner does not work on raw data, but on events
An event captures a particular type of network conditions
E.g., link congestion, YouTube throughput drop, overloaded cell, Google CDN load-
balancing, etc.
Events are extracted from raw measurements through a retrieval process (actual
algorithms at WP2, WP4, queries, etc. )
Events are defined as m-tuples including the following fields: event name: e.g., link overload. location type: e.g., Gn downlink interface. time span: e.g., 2013-10-21-12:30:00, 2013-10-21-12:35:00. retrieval process: e.g., Simple Link Congestion Detection Algorithm – SLCDA (with
utilization threshold Cth). additional diagnosis features: e.g., number of flows, number of bytes, list of server
IPs originating the flows, etc.
The Diagnosis Process (2/2) Some examples of events related to Root Cause Analysis (RCA)
1. A congested Gn interface in a mobile ISP during 5’:
2. An anomaly detected in YouTube traffic, impacting users’ QoE for 5’:
Diagnosis Graph (1/4) Relates problems/issues with events and root causes, exploring the
temporal and spatial relationships between events
Which type of diagnosis graph reasoning? Rule-based reasoning (decision-tree like graph)
Easier to implement and configure (easy to add domain knowledge)
Gives simple and direct association between the diagnosed root cause and the evidence(s) for better interpretation
It is very effective in the practice
Other types of Iterative Reasoning can be implemented in such a way (not only RCA, but generic iterative measurement processes)
Using per use-case graphs, the Reasoner looks for the presence of events, and identifies the root cause as the leaf with the highest probability
Example: Who to blame when YouTube is not working?
AS 2AS 1
ISP Network
Devices? ISP? Internet? YouTube?
G-CDN
Diagnosis Graph (2/4) An example of a Diagnosis Graph (DG) associated to the detection and RCA of QoE-
relevant anomalies in YouTube:
In the example, the DG is structured in 5 different macro-blocks:
① QoE-relevant Anomaly Detection block
② End-device Diagnosis block
③ ISP Diagnosis block
④ Internet paths Diagnosis block
⑤ CDN servers Diagnosis block
Example of root causes and the associated rules’ description
ISP Diagnosis block
Purpose: detect QoE degradation
BASIC PROCESS:
1) Continuous passive monitoring
2) Trigger of active monitoring in case of alarms
Diagnosis Graph (3/4)
High level Diagnosis Graph for ISP (simplified from D4.2):
Triggers Internet Active
Probe
Alarms from
different POPs?
Issue external to SP domain
Alarms from
different BRAS?
Issue in SP Core Network
Issue on BRAS
Issue on DSLAM
Issue on Access Lines
Triggers POP Active Probe
Inter-domain measurements
check
Triggers DSLAM Active
Probe
Diagnosis Graph (4/4)
Knowledge Discovery Domain knowledge and operational experience is incomplete (just using domain-
based diagnosis graphs limits the system capabilities)
Therefore, the specification of an initial diagnosis graph can be rather under-performing, both in accuracy and completeness
The role of Automatic Knowledge Discovery correlate all the events that occur at the same time and are spatially related to the service problem under investigation…
…And learn new diagnosis rules (new knowledge) from past experiences Supervised learning in case of labeled data
Unsupervised learning in the general case
Some mPlane techniques : Automatic Rule Mining, Sub-Space Clustering, Decision–Trees Learning
Final expert intervention to validate the identified diagnosis rules, which are added to the Knowledge Structure
Multiple mPlane Reasoners A mPlane Reasoner is an extended mPlane client, which performs
sequential tasks based on intermediate analysis results, actuating through the mPlane Supervisor interfaces
In the practice, we implemented different Reasoners following the aforementioned principles, but tailored to the specific needs of each use case:
1. Reasoner in nodejs: basic mPlane Reasoner
2. Reasoner for Content Popularity Estimation
3. Reasoner for Content Curation
4. Reasoner for Web browsing QoE
5. Reasoner for Mobile Network RCA
6. Reasoner for Anomaly Detection and RCA
7. Reasoner for SLA Verification
8. Reasoner for Multimedia Content Delivery Analysis
9. Reasoner for GLIMPSE
Analysis Modules or Algorithms further evaluate the measurements gathered and pre-analyzed by the lower layers of mPlane
They operate on low amounts of data (as compared with the data available on WP3 or eventually gathered at WP2)
Analysis Modules
Per-use case algorithmsThe main Analysis Modules are linked to the proposed use cases:
Find the cause of Quality of Experience (QoE) degradations
Estimate the future popularity trends of services and contents for network optimization
Classify and promote interesting web content to end-users
Assess and troubleshoot performance and quality of multimedia stream delivery
Diagnose performance issues in web and identify the segment that is responsible for the QoE
degradation
Find root cause of problems related to connectivity and poor QoE on mobile devices
Detect and diagnose anomalies in Internet-scale services (e.g., CDN-based services)
Verify SLAs
…but there is more
QoE QoE-based monitoring for YouTube: metrics to detect playback
stallings Relate OWD variation to QoE, for generic class of applications
Topology Detect Anycast Services: determine if a service uses IP anycast Reverse Traceroute – DisNETPerf: find probes near some point of
interest in the network to launch active measurements Topology discovery: identification of middle boxes, TCP proxies
and NATs MPLS transit tunnel analysis: Classification of MPLS tunnels
based on their usage/purpose (mono-path, ECMP, multi-FEC, etc.) Topology/Performance
Analyze dynamics of forwarding and routing paths : determine whether routing paths follow perturbations experienced by forwarding paths or vice versa
Prediction of Unmeasured Paths: Inference of path properties (RTT, Available Bandwidth, etc.) on unmeasured network paths
Some Extended Analysis Modules
Partial Mapping of Analysis Modules to Use Cases
Reasoners and Analysis Modules (as well as everything presented so far during the day) isavailable at the mPlane website as soft tools:
https://www.ict-mplane.eu/public/software
- 21 -
0 1 2 3 4 5 61
2
3
4
5
number of stallings
MO
S
crowdsourcinglaboratory
4 seconds of stalling
On the real mobile network
Lab studies
1 single stalling event heavily deteriorates the experience of the end-user
2 or more stallings already means bad quality
Duration of the stallings is less critical, but also has an important impact on QoE
Stallings are the impairments perceived by the end-user (independently of the video resolution, or even DASH)
MOS = F( N,
L)
Selected examples I: YouTube QoE
We introduced a simple KPI to monitor YouTube QoE from passive network measurements
Buffer depletion generally occurs because the downlink bandwidth is lower than the video bitrate
Ex: std 360p YouTube videos VBR=600 kbps DBW > 750 kbps
Stallings and Download Throughput
Selected examples II: Anomaly Detection and Diagnosis
(1) Reference-Set identification: find past traffic distributions which are a suitable reference of normality
(2) AD test: use a normalized variant of the Kullback-Leibler divergence to decide if current distribution is compatible with the reference-set
feature
CD
F
x1 and x2 are similar → L(x1,x2) is smallx1 and x3 are dissimilar → L(x1,x3) is large
x1
x2
x3
We conceived a statistical AD tool which works with full feature distributions
AD algorithm consists of two phases:
Using ADTool for Detecting and Diagnosing Anomalies
Many interesting service anomalies are observed as abrupt changes in the DNS counts
Reasoner approach: correlate observations from multiple metrics revealing service-related and/or device related anomalies:
Fully Qualified Domain Name Device OS Device manufacturer (TAC number in mobile devices) HTTP response code and so on…
Example: service/device related real anomaly in mobile devices
Selected examples: Anomaly Detection and DiagnosisDNS queries counts in a mobile network
Periodic spikes daily synchronization events
Peak hour utilization
Traffic anomaly, what’s that? easy to detect, not so easy to diagnose
Similar behavior in tablets The anomaly is only
observable for Apple devices
akadns.net (Akamai DNS)
push.apple.com (Apple Push Notification Service)
Connection issues to Apple push notification servers
Problem solved:
Anycast enumeration and geolocation
Iterative methodology based on geographically distributed VPs Determine if a service uses IP anycast Enumerate replicas sharing the same IP
address Geolocate those replicas
The iterative workflow is lightweight O(100) pkts, and fast O(100) ms
Shall support RIPE, mPlane/Planetlab probes (RIPE integration in mPlane)
Selected examples III: Anycast Detection
Selected examples IV: DisNETPerf Problem solved Reverse Traceroute (no IP spoofing nor IP record):
find the mPlane probe that is closest to a given Point of Interest (PoI) to enable troubleshooting on the path from that PoI to some user without control on the PoI side (e.g., YouTube server)
Neighborhood model: combined topology- and delay-based distance (BGP same AS + min RTT)
Main idea: we rely on a large set of probes widely spread (e.g., RIPE Atlas) Given IPs (eg YouTube) and IPd (eg, PoP @Heidelberg), locate IPdisnet IPdisnet “mimics” IPs in terms of IPs IPd path similarity Run traceroute measurements from IPc to IPd Collect data for troubleshooting-purposes
DisNETPerf in a Nutshell
mPlane – 2nd Review MeetingBrussels, February 10th, 2015
Reverse Traceroute IPs IPd?
Backup slides
Selected examples I: Content Popularity
Early detection of contents which will receive attention
mPlane
Cache
How mPlane can make it happen
Probes (passive)
Repository
Analysis ModulesPopularity ModelerPopularity Predictor
ReasonerDetect devices and
caches close to location
SupervisorNotify popular
contents
HTTP requests
CDN supervisorCaching strategies
based on future popular contents
Preliminary Results
Popularity Modeler and Predictor modules Topic models: GMM + LDA Maximum likelihood
Caching policy based on content popularity vs. LRU and LFU (Least Recently/Frequently Used)
We improve the SotA algorithms by obtaining the similar RMSE for a much smaller observation window (30’ vs. 4 hs)
RMSE
Selected examples II: Passive Media Curation
A new way of helping users finding, fast, relevant content in the web
mPlane
User clicks are a good measure of Interest (users don’t click randomly)
Curated (relevant) content
WP2
WP3
WP4.1 – Analysis Modules
Portals vs Contents
Content popularityStatisticsClassify Contents
Elect Content to promote
Supervisor
Publish Content
How can mPlane make it happen
User URLs
Interesting URLs
WP3- Scalable data analysis
Orchestrate
http://webrowse.polito.it/(prototype running since few months)
Up to 5M requests/hour
WP2’ (active)
Content versus Portal classifier module
Features: Hostname URL length Frequency as hostname Request Arrival Process cross-correlation 1-day periodicity
« Feed » a naive bayes classifier Tested different combinations The best is: URL length+period:
As accurate as the 5 features together
Accuracy Tested on manually verified ground truth traces Used 2/3 for training and 1/3 for “prediction” Overall 96% accuracy of the classifier 94% precision, 100% recall in detecting content-URLs
Content-URL Web portalwww.news.com/region/news1.htm www.news.com, www.news.com/region/
Content promotion module (in progress) Three types of promoted content-URLs so far
Live stream: News/Videos/Blogs currently attracting the attention of the crowd Top : Most popular (last day, week, month etc) over all content-URLs. Hot: A « mixture » of popularity and freshness (adapted from reddit’s hot algorithm)
First users like them!
Timestamp of first view
Absolute reference: start date of Netcurator
A freshness constant period (12 hours)
Relevant
28%
Very relevant
42%
Extremely relevant
13%Poor
6%
Not that relevant
8%