predicting the black swan - itu...jun 17, 2019  · •network management data: cm, fm, pm, ue...

35
© 2019 Nokia 1 Public Predicting the Black Swan Ludic Fallacy and Predictive Self-Healing in Future Cellular Networks ITU Workshop on "Machine Learning for 5G and beyond" June 17, 2019 Janne Ali-Tolppa, Henning Sanneck Nokia Bell Labs Research, Munich, Germany

Upload: others

Post on 11-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia1 Public

Predicting the Black SwanLudic Fallacy and Predictive Self-Healing in Future Cellular Networks

ITU Workshop on "Machine Learning for 5G and beyond"

June 17, 2019

Janne Ali-Tolppa, Henning Sanneck

Nokia Bell Labs Research, Munich, Germany

Page 2: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia2 Public

5G Network ManagementAddressing the challenges

Cloudified RAN & Core

Centralization

Combined management of

physical / virtualized

infrastructure

Multi-serviceNetwork

Slices

Intra- / inter-slice management;Separation (&

sharing) of knowledge; Embedded analytics

Ultra Dense Small Cells

Distribution

Distributed decisions;

Management hierarchy /

aggregation

URLLC network service

Low latency radio, edge cloud

NM data resolution; prognostic diagnosis;

combined network resilience & self-

healing

CPCP

Multi-connectivity (MC)

Multi-RAT / layer

MC-aware, management

incl. aggregation of PM/FM data

„Hybrid“ NM

Flexible VNF re-location / re-configuration

Page 3: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia3 Public

5G Network ManagementAddressing the challenges

Cloudified RAN & Core

Centralization

Combined management of

physical / virtualized

infrastructure

Multi-serviceNetwork

Slices

Intra- / inter-slice management;Separation (&

sharing) of knowledge; Embedded analytics

Ultra Dense Small Cells

Distribution

Distributed decisions;

Management hierarchy /

aggregation

URLLC network service

Low latency radio, edge cloud

NM data resolution; prognostic diagnosis;

combined network resilience & self-

healing

CPCP

Multi-connectivity (MC)

Multi-RAT / layer

MC-aware, management

incl. aggregation of PM/FM data

„Hybrid“ NM

Flexible VNF re-location / re-configuration

Page 4: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia4 PublicPublic

Self-Healing in Future Cellular NetworksRobustness and resilience

Page 5: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia5

Self-Healing in Future Cellular Networks

Robustness and resilience

Resilience

“An ability to recover from or adjust easily tomisfortune or change”

Merriam-Webster Dictionary

Robustness

“Capability of performing without failure under a wide range of conditions ”

Public

Page 6: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia6

Self-Healing in Future Cellular Networks

Robustness and resilience

Networks are by nature dynamic and complex → Unforeseen circumstances are bound to happen

Use cases requiring ultra-high reliability (URLLC)→ Robustness (redundancy etc.) is no longer alone enough, resilient networks are needed!

Public

Page 7: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia7

Self-Healing in Future Cellular Networks

Design principles for resilience

Public

Monitor and adapt

Decoupling, modularity

Common core principles

Focus

Page 8: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia8

Self-Healing in Future Cellular Networks

Design principles for resilience

Public

Monitor and adapt

Decoupling, modularity

Common core principles

Focus

Page 9: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia9

General process:

1. Profiling – establishing a baseline definition of “expected”

2. Comparing measurements to the baseline, generating an anomaly value

3. Detecting anomalous events, both in time and space

• An anomaly is something unusual, not necessarily a fault

4. Diagnosing the detected anomalies

• Comparing the anomaly against a diagnosis knowledgebase of previously diagnosed anomalies (case-based reasoning)

• Assess the impact and possible root cause

5. Plan and deploy potential corrective actions

Self-Healing in Future Cellular Networks

Anomaly detection and diagnosis

Public

Anomaly

detection

OSS

counters

KPIs

Contexts

Anomaly

values

Trend or

environment changes

ProfilesAnomaly

patterns

Diagnosis

Knowledge

Page 10: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia10 PublicPublic

Self-Healing in Future Cellular NetworksPredicting network failures

Page 11: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia11

Self-Healing in Future Cellular Networks

What about predicting faults?

Public

• Of course prediction is still detection, i.e. detecting the early signs of a problem

• In addition to learning to detect and diagnose fault patterns, try to learn also to detect early signs of a degradation, if possible

• Including the temporal context, learning the sequence of events that led to a state labeled as degraded

• Different levels of prediction

• Predict a previously diagnosed and labeled fault

• Predict an anomalous state, which is not previously diagnosed (and may or may not be a fault)

• Predict an anomaly, which is not well-represented in the current network element state model

Page 12: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia12

Self-Healing in Future Cellular Networks

What/how to predict from the data?

Public

Anomaly Detection

Anomaly Diagnosis

Time Series Prediction

Input data, e.g. PM KPIs

Degradation?

State Modeling

State Transition Paths

Input data, e.g. PM KPIs

Degraded state?

State Prediction

a) Predict the input b) Predict the system state

Page 13: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia13 Public

Self-Healing in Future Cellular Networks

Algorithms (example)

• ARIMA

• Triple Exponential Smoothing / Holt-Winters

• Recurrent Neural Networks

• Long-Short Term Memory (LSTM)

• Convolutional Neural Networks

• Sparse auto-encoders (for state modeling)

Page 14: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia14

Can we predict also the more

complicated and even previously

unencountered faults?

Page 15: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia15

Self-Healing in Future Cellular NetworksThree Aspects of the Black Swan

1. It is an outlier, as it lies outside the realm

of regular expectations, because nothing in

the past can convincingly point to its

possibility.

2. It carries an extreme 'impact’.

3. In spite of its outlier status, human nature

makes us concoct explanations for its

occurrence after the fact, making it

explainable and predictable

Photograph by Francis C. Franklin / CC-BY-SA-3.0

[1] Nassim Nicholas Taleb, The Black Swan, Random House, 2007

Public

Page 16: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia16

An argument against applying simplified, game-like statistical models in complex

domains [1]:

• It is impossible to have the entirety of information available

• Even the smallest variations can have very significant effects

• Because of this, what we tend to see, post hoc, as a failure of the predictive

capability of a model, may very well be a Black Swan, something that was real,

but unpredictable (or predictable only after the fact)

Public

Self-Healing in Future Cellular Networks

A Ludic Fallacy

[1] N. N. Taleb, The Black Swan, Random House, 2007

Page 17: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia17

Is general predictive self-healing in

mobile networks a ludic fallacy?

Page 18: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia18 Public

Self-Healing in Future Cellular Networks

What can you predict? Some examples?

• Software problem (e.g. a memory leak)?

• Hardware problem (e.g. a power amplifier failure)?

• (Human-induced) misconfiguration?

• Environmental impacts (building, weather)?

• Unexpected network traffic patterns?

• Often failures are a combination of several root causes, each of which

alone would not have led to a failure

• What can be detected, depends on the available data

Page 19: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia19 Public

Self-Healing in Future Cellular Networks

What information do we have available?

Page 20: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia20 Public

Self-Healing in Future Cellular Networks

What data do we have available?

• Network management data: CM, FM, PM, UE measurements

• Cloud resource KPIs

• Logs

• Traces

• Context data

• Weather

• (Road) traffic

• Calendar with special events

• Production system data in an Industry 4.0 private networks

• Etc.

• Service-based architecture, ML-pipelines etc. enable better utilization of all available data

Page 21: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia21 Public

Self-Healing in Future Cellular NetworksCollecting knowledge – transfer learning

Page 22: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia22 Public

Self-Healing in Future Cellular Networks

Collecting knowledge

• Diagnoses for case-based reasoning are collected in a diagnosis

knowledgebase

• The knowledgebase or the predictive model can also contain knowledge of

the state transition paths that led to a fault

• The main challenge is that the faults are rare occasions

• Collecting sufficient amount of analyzed samples is difficult for diagnosis and

for prediction the dimensionality is even higher

Page 23: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia23 Public

Self-Healing in Future Cellular Networks

Transfer Learning

• A domain 𝒟 consists of a feature space 𝒳 and a marginal probability

distribution 𝑃(𝑋), with 𝑋 Є 𝒳.

• A task 𝒯 consists of a label space 𝒴 and an objective predictive function

𝑓 ⋅ , which is not observed but can be learned from the training data

• The goal: Given a source domain 𝒟S and learning task 𝒯S, a target domain

𝒟𝑇 and learning task 𝒯𝑇, transfer learning aims to help improve the learning

of the target predictive function 𝑓𝑇 ⋅ in 𝒟𝑇 using the knowledge in 𝒟S and

𝒯S, where – in general, but not necessarily – 𝒟S ≠ 𝒟T or 𝒯S ≠ 𝒯T.

Transfer of knowledge

between different domains

with different learning tasks

Page 24: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia24 Public

Self-Healing in Future Cellular Networks

Diagnosis Cloud

Diagnosis Cloud

Operator BOperator COperator A

OSSLocal detection

and diagnosis

models

Global

representation

for models or

parametersOperator specific

contextualized

detection/diagnosis

models or

parameters

Local

engineer PM,FM,CM, topology

Central Diagnostic

Agent (CDA)

Local Diagnostic

Agents (LDA)

Gateway Diagnostic

Agents (GDA)

Network A Network B Network C

Ciaocarlie et al: Diagnosis cloud: Sharing knowledge across cellular networks

Page 25: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia25 Public

Self-Healing in Future Cellular NetworksInherent Resilience

Page 26: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia26 Public

Self-Healing in Future Cellular Networks

Inherent resilience

• Self-healing functions are most effective, when used in

conjunction was other resilient design principles

• Mobile networks typically have distributed and decoupled

by nature and have built-in redundancy

• This makes them less susceptible for catastrophic failures

due to minor changes

Monitor and adapt

Decoupling, modularity

Common core principles

Focus

Page 27: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia27 Public

Self-Healing in Future Cellular NetworksA practical example

Page 28: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia28

Public

Self-Healing in Future Cellular Networks

5G Slice Analytics & Diagnostics @ Hamburg Seaport Testbed

• Hamburg seaport testbed: evaluate concepts in a controlled environment (ground truth available; closed-loop

automation possible)

• Results

• Slice-aware Network Element (NE) state model: quantization of NE KPIs into a selected number of states

• States -> Long-Short Term Memory (LSTM) Recurrent Neural Network (RNN) -> State Prediction

• Additionally Mobility Pattern Prediction (MPP) on the barges moving in the harbor area

• Of 6220 sequences in a validation set, we were able to predict 97.6% of the low-SINR events with 1.4%

false positive rate of the total number of sequences

Predicted states (T+3):

Original states:

Page 29: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia29 Public

Self-Healing in Future Cellular NetworksSummary

Page 30: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia30 Public

Self-Healing in Future Cellular Networks

Black Swan

1. An outlier, an anomaly, unpredictable

• We can at least detect them with anomaly detection

2. Extreme impact, especially in the sense of a surprise

• The more automated diagnosis knowledge we collect, the more prepared

we are for a wide variety of issues

-> Less high-impact surprises and we are prepared to react quickly to

correct or mitigate any problems

3. With points 1. and 2. covered or at least mitigated, the human bias for

hindsight may also be manageable

Page 31: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia31 Public

Self-Healing in Future Cellular Networks

Is predictive mobile network self-healing a ludic fallacy?

• It is impossible to have the entirety of information available

• Correct, but by instrumenting the system much more than before and by including also data

sources for context data, a lot of the required information is available

• Methods like transfer learning can help to collect knowledge for diagnosis

• Already ongoing in 5G and beyond

• Even the smallest variations can have very significant effects

• By following the design principles of redundancy, decoupling and modularity, the networks can be

made inherently resilient -> small variations are unlikely to have catastrophic impacts

• In any system with such complexity as in mobile networks, there are and will always unavoidably be

“black swans”, i.e. completely unpredictable major events

• Still, we argue that predictive self-healing is not a ludic fallacy. By proper data and knowledge

collection, advanced self-healing methods can significantly improve the resiliency of future mobile

networks.

• To enable this, the work done on defining ML-friendly architecture and enablers, e.g. for collecting and

sharing data and information, are critically important.

Page 32: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia32 Public

Page 33: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia33 Public

Self-Healing in Future Cellular Networks

Automatic network element state definition

Cell throughput DL

Cell

thro

ughput

UL

Cell throughput DLC

ell

thro

ughput

UL

K-m

eans

BSQ

Clustering

Clustering methods group observations into clusters,

and try to represent all observations in a group with a

single example, a prototype. The prototype is usually

selected through a minimization of an error measure.

Bounding-Sphere Quantization

Previous work [2] involved automated state

definition for anomaly detection and visualization

uses. For these cases, a quantization algorithm

was developed called Bounding-Sphere

Quantization. BSQ strives for a quantizationwhere the maximum error is kept small.

[2] https://ieeexplore.ieee.org/document/8406263

Page 34: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia34 Public

Self-Healing in Future Cellular Networks

Network element state prediction

Predicted states:

Observed states:

• Advantage is that the high

dimensional network

management data is reduced to

a series of discrete states

• Well-suited for Recurrent Neural

Networks (RNNs)

• The performance depends

highly on the state model

• Tradeoff between short-term and

longer-term accuracy

rb sa e

rb sa e

rb sa e

rb sa e

rb sa e

rb sa e

1 2 3 4 5 6

Page 35: Predicting the Black Swan - ITU...Jun 17, 2019  · •Network management data: CM, FM, PM, UE measurements •Cloud resource KPIs •Logs •Traces •Context data •Weather •(Road)

© 2019 Nokia35 Public

Self-Healing in Future Cellular Networks

Holistic optimization and self-healing

• In a complex system, improving the resilience of only one

part or level of the system can sometimes (unintentionally)

introduce fragility in another

• To improve the resilience, it is often necessary to work in

more than one domain, scale and time granularity at a time

• If there are several optimization and self-healing functions,

possibly operating at different scopes, their actions may

need to be coordinated

• E.g. on UE-level vs. on NE-levelQOE

Orchestration

NM

Coordination