guide network and application performance management in

1 Guide Network and Application Performance Management in the Cloud and 5G Era

Guide

Network and Application Performance Management in the Cloud and 5G Era


1 Introduction 3

1.1 What is performance management? 3

1.2 How this document will help you 4

2 The Basics: 10 Fundamentals of Performance Management 5

3 KPIs: The Foundation of Performance Management 6

3.1 The relationship between KPIs and performance management 6

3.2 KPIs that affect quality of experience (QoE) 6

3.3 How to leverage KPIs 8

4 Achieve Business Goals 9

4.1Benefitsofcompleteperformancemanagement 9

4.2 Risk of incomplete performance management 10

5 Standardized Performance Management 11

5.1 Correlating network performance and QoE using standards 11

5.2 How active and passive monitoring work together 12

5.3 Network visibility granularity: why it’s important 13

6 BeyondPassive&ActiveMonitoring:UnifiedNPMandAPM 14

7 Making Sense of Everything with Service Analytics 17

7.1 Performance data quality and correlation 18

8 Putting It All Together: Case Studies 19

8.1CellMobile:networkexpansionanduserexperience 19

8.2 CloudMobile: cloud services and QoE 21

8.3 Value achieved through correlated data and converged NAPM 21

9 Conclusion: The Future of Performance Management 22

9.1Aservice-centricfuture 22

9.2Businessmodeltransformation 22

Table of Contents


For your communications service provider (CSP) organization, the ability to acquire and retain customers, andtomakeaprofit,isincreasinglydrivenbyapositivecustomer experience—all the way from ordering a service and managing service quality, through to billing and customer care. Especially as 5G services emerge—many of them ‘mission critical’ with no tolerance for downtime—proactively managing quality of experience (QoE) becomes a key requirement to ensure customer loyalty, sustain or grow revenue, and reduce churn. What is your plan to take control of 5G customer experience?

Your answer should focus on network and application performance management, which encompasses physical and software/virtual layers all the way from the core to the end user. This is vital to collect the key performance indicators (KPIs) you will use to accurately calculate 5G QoE.

However, traditional performance monitoring systems and methodsarenolongersufficientforyourorganization’soperations and customer care teams to respond effectively to customer dissatisfaction with network speed or service performance. Effective performance management for increasingly complex services, applications, and cloud networksmustbereal-timeanddynamic.Youneednewtools and a new approach.

1.1 What is performance management?In this context, performance management means the ability todefineandmeasurekeyperformanceindicators(KPIs)tounderstand,control,report,andoptimizetheend-to-end network and the entire service lifecycle. Its purpose is to extract meaning from performance data and then make informed, effective decisions.

Performance management, unlike performance monitoring, isn’t just about checking on network performance (which can be misleading, since the network may appear to be ‘allgreen’whileQoEsuffers)orfixingobviousproblems;components will need to be monitored and managed, including virtualized network functions, distributed edge architectures, service chains, network slices, and cloud data center infrastructure.

Traditionally,“performancemanagement”hasbeenconfinedmostly or completely to monitoring the core and distribution network, ending at a demarcation point some distance from the customer. Increasingly, service providers are integrating network performance management (NPM) with application performance management (APM), which provides visibility into the performance of last mile delivery and the behavior ofspecificservicesandevenapplicationsonenduserdevices. Are users experiencing what they were promised? You must now consider APM as part of performance management.

1. Introduction

1.2 How this guide will help youWhile there are many facets to performance management, the purpose of this guide is to share an overview of the discipline, and offer insight into Accedian’s approach to continuous, precise performance monitoring for complex cloud networks and services.

In this overview, we cover:

• Best practice recommendations based on industry data andreal-worldexperiencesofserviceprovidersglobally.

• Meeting service level agreements (SLAs) for different types of networks and services. What are the differences? What core principles of effective, continuous performance monitoring do they have in common?

• Emerging network standards (MEF 3.0, 5G, DOCSIS 3.1,LTE,LTE-A,VoLTE)andtheirqualityandcapacityrequirements. How do new, more stringent requirements intersect with modern performance management solutions?

• Ensuringaworld-classnetworkandapositivecustomerexperience. How can a service provider be sure end customers are getting excellent quality of experience (QoE) and the best value for their money?

• Using existing standards for ‘vendor neutral’, continuous, proactive monitoring. How does network performance impact the core business drivers that are key to revenue generation? How does this apply to networks today, and in the future?

• Networking changes that are making performance management increasingly important. How does this apply to mobile, cable, and backhaul networks? What does this look like in the real world?

Introduction


Performance monitoring and assurance solutions must beopen,programmable,andfullysupportmulti-vendornetworks. Ideally, the solutions you use should provide truly independent visibility and provide relevant KPI, QoE and SLA intelligence to support all networks, resources and services. This applies to 5G, cable MSO offerings, mobile services, small cells, business Ethernet,financialnetworks,software-definednetworks (SDNs), and data center connectivity and cloud infrastructure.

Underpinning all this is a set of ten capabilities service providers must include in their performance management systems. While somewhat adaptable, these capabilities must work in unison for effective QoE and SLA management.

2. The Basics: 10 Fundamentals of Performance Management

Figure 1: Ten fundamentals of performance management

10 fundamentals of performance management

Accurate and preciseHigh resolution for tight control-plane timing

GranularKeep pace with rapid sampling

One-way metricsLeverage directional test results

ProgrammableAutomated, tailored assurance

Standards-basedUse available standards

ContinuousMonitor 24/7

InteroperableLeverage multi-vendor infrastructure

OpenAccessible applications using APIs

Real-timeFaster response with instant feedback

UbiquitousCover core-to-edge; eliminate blind spots


WedefineKPIsasmeasurableperformancereferencepoints that identify key areas you need to understand, control, visualize, report on, and optimize in order to improve service and network performance.

Useful KPIs:

• Are relevant and clear

• Generate timely and useful analytics data

• LeadtomeaningfulQoE-focusedactionsthat directly impact customers

3.1 The relationship between KPIs and performance managementYou can leverage KPIs to map network changes and meas uretheirimpactonfinances,revenue,customers,andstrategicinitiatives;toquicklyidentifythebestcourseofactiontoachievebusinesssuccess;andtocreateactionable results about what to improve. Analytics, machine learning, and AI algorithms can help you make sense of the volumes of data and KPIs and to correlate performance data with other data. These tools help answer your key performance management questions and help determine recommended actions. Is my network getting better or worse? What applications are people using? How well are those applications performing? Where are the recurring trouble spots?

3.2 KPIs that affect quality of experience (QoE)As an operations professional tasked with creating procedures to improve QoE and service quality, you have many KPIs at your disposal. CSPs and vendors typically only use a few of these: delay, packet loss, delay variation, and mean opinion score (MOS). But those are just the tip of the iceberg! Many more KPIs exist to help you pinpoint performance problems, and reduce mean time to innocence (MTTI) and/or mean time to repair (MTTR).

Here are the most important (although not universally-supported)KPIs:

• Packet delay is the time it takes for a packet to travel from source to destination

• Inter-packetdelayvariationisthedelayvariation between consecutive packets

• Packet delay variation is the comparison of a delay percentile with the minimum delay value

• Packet loss is the number of lost packets and the number of periods of lost packets

• Packet loss burst is consecutive packets lost within aspecifictimeperiod

• Packet misorder is when packets arrive out of order and cause anomalies in a service or application

• Packet duplication is when the same packet is generated and received at the termination point

• TypeofService(ToS)ishowpacketsareidentified and are therefore processed, prioritized, and buffered

3. KPIs: The Foundation of Performance Management


What causes packet loss?

What causes delay variation (jitter)?

What causes delay?

Bursty traffic No buffering Network configuration

Over-buffering

Packet buffering in routers and switchesterminating highly utilized links

Number of hubs TCP throughput Packet size

speedduplex

physicalissues

cabling

fibermaxed-outCPUs

hardware & softwaremalfunctions

No Jitter

Jitter

Figure 2: Causes of packet loss, delay, and jitter

Forreference,hereareMEF-definedrecommendedvaluesandlimitsforsomeofthekey KPIs discussed here.

Figure 3: Recommended values and limits for KPIs Source: Based on MEF 23 Class of Service (CoS) references

Application Framedelay

Frame lossratio

Frame delayratio

Inter frame delay variation

VoIP data

125 ms pref375 mslimitPd = 0.999

3e-2 50 msPr = 0.999

50 msPr = 0.999

Pr = 0.999

40 ms

Video conferencingdata

125 ms pref375 ms limitPd = 0.999

1e-2 50 ms 40 msPv = 0.999

Pv = 0.999

VoIP and videoconfsignaling

Notspecified 1e-3 Not

specifiedNot specified

IPTV data plane 125 msPd = 0.999 1e-3 40 ms

Pv = 0.999

IPTV control plane Not specified 1e-3 Not

specifiedNotspecified

Streaming media Notspecified 1e-3 2s 1.5 s

Pv = 0.99

Interactive gaming 50 ms 1e-3 10 ms 8 ms

Mobile backhaul H 10 ms 1e-4 5 ms 3 ms

Mobile backhaul M 20 ms 1e-4 10 ms 8 ms

Mobile backhaul L 37 ms

Mean framedelay

100 ms pref350 mslimit

100 ms pref350 mslimit

250 ms pref

100 ms

75 ms

Notspecified

40 ms

7 ms

13 ms

28 ms 1e-3 Notspecified

Notspecified

Key: ms millisecond | pref preferred | Pd Packet Delay | Pr Packet Ratio | Pv Packet Variation


3.4 How to leverage KPIsEvensmallapplicationorservice-basedissuescanhaveatremendousimpactonthecustomerexperience—especially if the network is “green” yet customers are having service issues. You need to be able to quickly identify the root cause of customer dissatisfaction. If you are learning about quality problems from customer complaints to the service center, you are not properly leveraging the power of KPIs.

Here are some of the most common QoE issues and the KPIs you can use to identify and track them:

• Droppedcallsareimpliedifspecificcellsitesexceedpacketlossburstlimits.

• Poor voice quality is likely to be present if jitter or delay, or a combination of the two, are present. Packet loss also impacts voice quality.

• OveralldegradedinternetperformanceisoftentheresultofTCPre-transmissionsduetodelay, jitter,andpacketloss;thisimpactsthroughputtheenduserexperiences.

3.3 KPIs for 5G Thespecificationsshownabovearefairly“traditional”intheirassociationwith4Gandearliernetworkarchitectures.5Gbringsamuchmoreintensesetofperformancedemands.Whilethesearenotyetfullydefined,thechartbelowgivesagoodpictureof what your CSP organization faces with performance management for future networks. The biggest challenge lies in the orangebox,withverylow-latencyandpacketloss-sensitiveapplications.

Seefigure11onpage16forahigh-levelviewofwhatanextgenerationperformancemanagementsystementails.

sensors for agriculture, smart meters, connected car insurance,

smart city lighting, package tracking and logistics

real-time factory automation, remote patient vital sign

monitoring, connected car collision avoidance /federated driving

video/hologram calling, VR gaming, augmented reality, telematics, fixed

line internet replacement

UHD 3D YouTube, VR movies &AR sporting events, 8k video

content, visuals, client to server-based web applications and basic HTTP/HTTPS connectivity to web

Mbps ms %PL

Mbps

Mbps

Mbps

Mbps

100

101

0.1

100

101

0.1

100

101

0.1

100

101

0.1

100 1001

10 10

ms

1001

10

ms

1001

10

ms

1001

10

ms

1001

10

1

0.1 10

1

0.1

%PL

10

1

0.1

%PL

10

1

0.1

%PL

10

1

0.1

%PL

10

1

0.1

5G services & KPI mappingPerformance level Impact if not metNot demanding LowWithin LTE specs ModerateChallenging High

IoTNB-LTE-M/LTE-IoT allowdevices to communicateas rarely asonce every 40minutes

eMBBExtremeMobileBroadband

mMTCMassive Machine-TypeCommunications

Local computeRequired

Packet lossSensitivity

LatencyAllowable

ThroughputRequired

uRLLCUltra-Reliable, Low-Latency Communications

Real-TimeCommunication/Interaction

StreamingVideo/Media

HTTP/HTTPsInternet/Web Services

Class Type Examples

Figure4:Expected,verytightspecificationsfor5Gservicesdemandend-to-endperformancemanagement Source: GSMA, Accedian

9Guide Network and Application Performance Management in the Cloud and 5G Era

Service providers must prove they are proactively on the side of the customer. This goes beyond “mean time to innocence”—how long it takes to show that the issue isn’t with your network—to quickly identifying the source of a problem and resolving it, ideally before customers are affected.

The goal here is to reduce mean time to diagnose and mean time to repair (MTTR), leveraging machine learning and analytics capabilities to identify and prevent recurring problems, as well as predict issues and take action to avoid them. The shorter the time, the better for everyone: the customer is happy, and the service provider avoids the high costs of calls to customer support.

Performance issues that might not seem like a big deal can in fact snowball into customer churn and lost revenue. For example, we’ve seen instances where packet loss as low as 0.2% can result in a reduction of 10% in TCP throughput and only 0.5% packet loss can result in up to 50% in TCP throughput. Micro impairments can have macro consequences. Related issues like this become clear only when KPIs are based on meaningful, actionable performance metrics.

4.1BenefitsofcompleteperformancemanagementPerformance management tends to be focused on network operations and the health of the network, its infrastructure, and ultimately the QoE delivered. But it is also a means to an end: achieving business goals. A successful launch of a service means using performance management capabilities to ensure that the “service” is running within the required specs to assure QoE.

What you can do with well-definedKPIs• Proactively detect, troubleshoot,

and resolve issues

• Continuously control and adjust the quality of service (QoS) and bandwidth allocation forbusiness-criticaltraffic

• Reduce the risk of investment in new technology, and rapidly deploy it

• Verify that a network is ready to launch new services

• Prove performance metrics before and after launching new services and technologies

• Deliverunique,consistently-performingservices to generate new revenue streams

• Improve QoE to reduce customer churn

• Consistently meet service level agreements (SLAs)

4. Achieve Business Goals


Existing performance monitoring tools may lead you to think that the network is behaving normally when in fact QoE has significantissues.Throughprecise,granular,accurateKPIsandanalytics,it’spossibletopinpointpreviouslyhiddencausesofQoE degradation.

Mean time & cost to repair

With legacy performance monitoring

With high quality performance management

Remote hours 7.0 3.50

On-site hours 12.0 9.0

Average time (hours)

8.20 4.44

Remote repair cost

$332.10 $179.6

Total cost with truck roll & labor on-site

$600.00 $600.00

Average cost to repair

$396.40 $251.08

Percentage savings

25%

50%

45.8%

45.9%

36.6%

Total savings with high quality

performance management

Figure 5: Risk of incomplete performance management

4.2 Risk of incomplete performance management Stop us if this sounds familiar: you spend too much time resolving performance problems because it is so hard (or even impossible)toidentifytheirrootcause.CustomerQoEsuffersandcallcentersarefloodedwithcomplaintcalls.Thesituationsnowballs, as dealing with each trouble ticket ties up resources that could focus on less immediate issues that will inevitably becomeurgentatsomepoint.Thiswaterfalleffect(damgetsfull,overflows,resultsinafloodofissues)haslong-termconsequences like churn, loss of revenue, and increased operational costs.

To stop this in its tracks, you need granular visibility into the root cause of performance issues. When you have that, there’s a butterflyeffectthatistheoppositeofthewaterfall:fewerservicecalls,moreresourcestoworkongrowthprojects,andloweroperationalcosts.Imagineif,whenacustomerdoescallwithaserviceissue,youcanconfidentlyinformthemyouarealreadyaware of the problem, your team is working on it, and it will be resolved soon. You can be in control!

FixingproblemsthatcustomersarehavingisasignificantcontributortoOPEX.Eachproblemturnsintoatroubleticketthatmust be managed by the network operations center (NOC), operations, marketing, and other teams. When the same issue repeats, it may result in multiple trouble tickets, which not only increases MTTR but also creates an ongoing QoE problem for thecustomer.Throughhighqualityperformancemanagementwithgranularvisibilitytoidentifyandresolveissues,OPEXcanbesignificantlyreduced.

Figure 5 shows a generalized example of this based on combined/averaged results from a variety of service providers (dollar amounts based on standardized assumptions about labor and other costs). Due to reduction in the number of customer tickets and the time it takes to resolve issues, average cost to repair is reduced by over 36%.


Inmulti-vendorandmulti-featurenetworksandenvironments, measuring, correlating and ensuring precisionandgranularityofKPIsisdifficult.Acrossvendors, monitoring features are deployed and used in a variety of ways that differ from one proprietary system toanother.AlthoughstandardsexisttodefineobjectiveKPIs (more on this below), not all vendors support the samestandards,andalloffertheirownvalue-addedmethodologies to differentiate their products.

To get a full picture of what’s really going on and how itaffectscustomers,youneedaunifiedmanagementsystem that, at the very least, supports all the monitoring standards used by all the different vendors in play, and correlates them in a coherent, meaningful way. A universal monitoring overlay across the various vendors may also be needed to collect KPIs currently missing entirely, or only collected in some locations.

5.1 The role of standards in QoE managementA question you probably struggle with often is, what’s happening at the infrastructure level and how does that impact the service layer? That question is answerable by using an analytics system to correlate objective KPIs like packet loss with subjective key quality indicators (KQIs) like mean opinion score (MOS) to make sense of it all and direct your problem resolution efforts to where they will have the most impact.

This process necessarily combines objective data and subjective quality index scores based on business rules andSLAspecificationstoprovideinsightintohownetwork performance affects customer experience. Noneofthisispossiblewithouthigh-qualitydata,collectedandclassifiedinastandardway.Withoutauniversally-understoodreferencepointforKPIs, confusion would run rampant.

Network-andservice-basedKPIsarewell-establishedbyorganizations like MEF, ETSI, 3rd Generation Partnership Project (3GPP), Internet Engineering Task Force (IETF), Next Generation Mobile Networks (NGMN), and Institute of Electrical and Electronics Engineers (IEEE).

The question now is how to create a universal understandingofnext-generationperformancemanagement, so that answering the question, how well are the network and services working? is less daunting.

5. Standardized Performance Management


Toreiterate:withoutstandards-basedKPIsandaclearly-definedmethodofhowtousethemforperformancemanagement,it is unclear how to best measure or calculate network performance and its relationship with QoE. Only very granular, precise metricscanbeusedtobuildqualityindexesthatdefineandverifythedesiredcustomerexperienceparameters.Leveragingcomplementary methodologies like active and passive monitoring is part of this process.

5.2 How active and passive monitoring work togetherActiveandpassivemonitoringarecomplementary;youneedboth,incombinationwithanalyticsandinsight,tounderstand themulti-layerdynamicnatureofvirtualnetworksandservicesenabledbynetworkfunctionsvirtualization(NFV), software-definednetworking(SDN),and5G.Serviceprovidersneedtoclearlyvisualizewhatishappeningandtakeaction inrealtimetooptimizenetworkefficiencyandQoE,andtokeepOPEXandCAPEXundercontrol.

Activemonitoringisaproactive,real-timetoolmainlyusedtomeasureKPIsbyinjectingsynthetictraffictoemulatetheway acustomer’strafficwouldtravelacrossandinteractwiththenetwork(Layers2-3).

Passive monitoring is a reactive, retrospective tool, mainly used to understand application behavior and service overlay andconnectivity(Layer4-7).

Figure 6: Performance test functions and standards

Function

Service Activation Testing (SAT)

Layer 2 performance monitoring

Layer 2 and 3 performance monitoring

Standard(s)

IEEE/IETF RFC-2544 benchmarking

ITU-T Y.1564 turn-up testing

ITU-T Y.1731 IEEE 802.3ah Ethernet SOAM

RFC-5357 TWAMP

MEF has been working with other standards bodies, as well as vendors and service providers, to develop a standard way ofdefininghowQoEandKQIsshouldbemeasuredagainstservicesandtheenduserexperience—throughmethodologiesusingmeasurementprotocolsdefinedbyIETF,InternetSociety(ISOC),TelecommunicationStandardizationSectoroftheInternationalTelecommunicationUnion(ITU-T).


5.3Networkvisibilitygranularity:whyIt’simportantData granularity determines the degree to which you can see what’s really happening in the network at any given time oroveraspecifictimeperiod,versusmerelyanaverageconsensusofevents.Fromexperience,youknowthatthemoregranulardatais,theeasieritistogaininsightsthatactioncanbeimmediatelytakenontodriveefficientproblemresolutionandnetworkperformanceoptimization.Afterall,youcanonlycorrelatetimesequence-basedeventsifyouknowexactlywhen they happened.

Isthenetworktrulyproblem-free?Doesitrequireproactiveintervention?Thesequestionscanonlybeansweredusingdatathat’s accurate and precise.

Passive monitoring

Scope and origin of performance degradations (client, network, server, application)

Application transaction visibility

Micro-burst detection through utilization metering

Two-way latency

Retransmission rates

Active monitoring

Possible impact correlation of underlay service domain

KPI input for composite QoE metrics

Delay variation

One-way latency

Packet loss

Hop-by-hop path analysis

Application Quality of experience Quality of service

Figure 7: Passive monitoring, active monitoring, QoE, and QoS are interrelated aspects of performance management

Figure 8: The more granular data is, the easier it is to identify and resolve performance issues

20pps: 50ms visibility



1,000pps: 1ms visibility

10,000pps: 0.01ms visibility



40pps: 25ms visibility50pps: 20ms visibility


6. Beyond Passive & Active Monitoring: UnifiedNPMandAPM

Figure9:UnifiedNAPMprovidesthetoolstoovercomereal-worldtroubleshootingandoperationschallenges

Challenge

Reduce MTTR

Collaboratively resolve issues and optimize performance

Manage network performance

Manage infrastructure changes and service roll-outs

How to solve it

Visualize all network and application exchanges

Identify performance degradation scope and origin

Cover all data center environments

Provide fast, retrospective analysis

Deploy capture points anywhere in minutes

Show performance across network, system, and application tiers

Show all transactions for web applications, DNS, databases, and file transfers

Provide visibility into VoIP, Citrix XenApp, and Citrix XenDesktop

Provide alerts of any performance breaches and degradation

Understand network usage, leveraging NetFlow and packet capture

Map usage and performance to track leaks

Track network behavior changes and errors

Track before and after performance

Identify application and service dependencies

Use application profiling

Toresolveperformanceproblemsinawaythat’sfastandefficientforbothyouandyourcustomers,thefocusistypicallyon reducing mean time to repair (MTTR)/mean time to innocence (MTTI). This requires the use of both network performance management (NPM) and application performance management (APM). Working together, these systems give CSPs what’s needed to proactively improve QoE for end customers.

Complementary to traditional monitoring, NAPM fundamentally does two things: 1) measures the performance of network andapplicationsbasedontrafficcapture,and2)providesawaytotroubleshootperformancedegradations.

CSPs are in the midst of transforming business processes, simplifying backend systems, and moving to cloud and IT network infrastructures. As this happens, new challenges arise around performance management, necessitating deployment of an end-to-end,multi-layersolutionwithreal-timeanalyticsandvisualizationtocoverallaspectsofapplication,serviceandnetwork lifecycle management.

Challenge

North-South visibility

East-West visibility

Edge visibility

Locations

On-premises data center

Software-defined network (SDN)and virtual data center

Public cloud Infrastructure-as-a-Service (IaaS) coverage

Cloud services and Software-as-a-Service (SaaS)

Small form factor (SFP) units

Active testing and distributed packet capture

xFlow collection

Virtual and micro-capture appliances

To actually address those visibility challenges, you really do need a comprehensive performance management system that includesmanycomponents:serviceanalyticsusingautomation,machinelearning,andartificialintelligence(AI);passiveandactiveperformancemonitoringandmeasurement;continuous,service-basedtesting;bandwidthutilizationmetering;anddistributedpacketcapture.Collectively,thesecovertheentireend-to-endnetworkfromcoretoenduser,theentireinfrastructurestackfromunderlaynetworktoapplications(Layer2-7),andtheentireservicelifecycle.

Figure10:Anend-to-end,multi-layerperformancemanagementsolutionaddressesavarietyofvisibility challenges


3rd partypartner

Analytics/Automation/ML/AI

Open APIs

Open APIs

End to endservicemanagement

Customer/partnermanagement

ABC

End-to-end service orchestration

SLA & QoSintent

5GCSG/Cloudlet Access

VNF VNF

MEC

VNF VNF VNF VNF VNF VNF

Core connectivity Core data center

App: NATApp: DHCP

App: FirewallApp: Enterprise

App: FirewallApp: DDoS

End-to-end service assurance

Accurate data

Closed loopautomation

Visualization

QoE throughactionable

insight

ServiceKPI/KQI

NetworkKPI/KQI

Smartanalytics

Granular data

High qualitydata

Precision data

Ingest, store, analyze

All layer data

Multiple sources

Beyond reporting

Analytics and machine learning

Single consistent view

All layers

Big Data

Passive network/app PM

Real user monitoring

End-to-endapplication delivery

TCP metrics

North-South/East-West

Wide-angle view

L2-7 NAPM

QoE

Highly accurate & granularactive-synthetic PM

Patented one-waymeasurements

Bandwidth utilization < 1 second

Service activation testing

Remote packet capture

L2-3 PM

QoS

Inacomprehensive,end-to-endperformancemanagementsolution(thatplaysnicelywithyourorganization’sexistingecosystem)—liketheoneshowninfigure11—thedifferentaspectsofactive,passive,andbigdataanalyticsfunctionasa whole to give you a detailed view of the network and service layers, correlated with business and service assurance requirements. In doing so, it uses ‘smart AI’ to provide data and metadata that relates performance with customer experience, and helps you improve overlay and underlay connectivity to deliver promised service levels to customers.

Figure11:Thecomponentsofcomprehensive,end-to-endperformancemanagementforfuturenetworks


While many CSPs implement various performance measurement tools and methodologies—and these may now cover both NPM and APM—it’s common to miss the important point that not all critical metrics are clearly visible. Suchvisibilityisdependentonawide-angleviewoftheentireserviceinfrastructure,withcapabilitiestodrilldowndeeplyintospecificproblemareas.Butananalyticsenginecanonlyprovidemeaningfuloutputifthedataitingestsisgranularandhigh-quality.Whenitis,theresultisbusiness,operations,andcommercialinsightthatmakesitpossibletoproactively improve customer experience.

Whattypesofcapabilitiesarewetalkingabouthere?Andwhatbusinessbenefitsdotheybring?

Figure 12: Performance assurance solution features and their business impact

Performancemanagement feature

End-to-end monitoring per class of service (CoS) with one-way measurement, fault isolation, traffic generation, and loopbacks

Throughput measurement per virtual LAN (VLAN) with sub-second granularity

Service activation testing (SAT) that supports RFC 2544 and other standards for circuit “birth certificate”

Packet brokering that can capture flows in any location and filter on any protocol

Passive monitoring of applications in real-time, including transaction-level applications and user experience

Business impact

• Improve network visibility • Optimize network planning • Reduce mean time to repair (MTTR) • Increase customer satisfaction and QoE • Reduce churn • Predict how changes will affect the network

• Guide network expansion planning • Optimize bandwidth and investment

• Validate SLA compliance before services launch • Reduce truck rolls

• Analyze any network flow to identify network issues

• Reduce cost of analytics

• Increase business productivity by proactively monitoring application QoE and end user experience

• Optimize SaaS cloud application performance and reduce downtime

7. Making Sense of Everything with Service Analytics


7.1 The importance of precise, granular performance dataCombining objective data with subjective business rules is the foundation of effective performance management. Accurately calculating perceived customer experience based on network and services layer KPIs is possible by correlating high quality, raw performance data with other contextual data from sources like vendor equipment at various locations in core and distribution networks.

This goes beyond using performance monitoring data on its own. Visualizing what’s going on with services and the network assures that correct actions are taken so customers are not impacted.

But again, as stressed earlier, the type of data and how it is collected are hugely important. For example, as seen in figure13below,samplingspeedcanmakeahugedifferenceinunderstandingifeverythingisokay…ornot.

Here,whenbandwidthutilizationissampledat15-secondfrequency,thenetworkappears“green.”At1-secondsampling frequency, it looks like there may be a small problem. Only by getting down to 0.1 second sampling frequencycanyouseethere’sabigproblem:evenalessthan1percentpacketlossratecansignificantlyaffectthecustomer experience by decreasing throughput up to 50 percent!

Note that this is not just a matter of feeding monitoring data into a system with a visual dashboard. To track metrics that have a critical impact on QoE and business goals, analytics dashboards must be designed to show the relationship between reported data values.

Awell-designedanalyticssystemwithdashboardsusedappropriatelyleadstoanunderstandingofwhat’shappening in the network and what needs to be done to keep it under control to comply with SLAs and maintain a superior level of QoE.

Increasing sampling speed and adding statistical perspectives changes everything

Effects ofsamplingfrequency

Measureevery

15sec

Measureevery

5sec

Measureevery

1sec

Measureevery

0.1sec

Noproblem

Noproblem

Smallproblem

Bigproblem

Figure13:Withoutverygranularbandwidthutilizationsamplingfrequency,theoriginofsignificantperformanceproblems may be invisible.

19Guide Network and Application Performance Management in the Cloud and 5G Era

8.1 CellMobile: network expansion and user experienceCellMobile is a mobile communications service provider, ranked as the number two player in its region, battling fiercelytomaintainitspositioninthemarket.Itoffersprepaid and postpaid mobile voice services, mobile broadband, enterprise solutions, bulk wholesale services,digitalservices,andmachine-to-machinesolutions. Its network assets include 2G, 3G, and 4G LTE infrastructure.

CellMobile has roughly 13 million prepaid and postpaid mobile service customers. It has 7,000 4G LTE cell sites deployed. CellMobile’s ongoing investments focus on network coverage, capacity, and performance with the goal of positioning itself as the country’s best mobile communications service provider.

Goals and challengesLTE network expansion is CellMobile’s main goal, and it is making tangible progress. Base stations have been standardized to only two vendors, and new base stations are being added as demand grows.

However, some sectors of the network are experiencing low quality and low throughput. Users are experiencing noisy calls, dropped calls, and a poor internet service. CellMobile thinks these issues are caused by data requirements of new users squeezing available bandwidth, and troubleshooting is focused on capacity.

CellMobile’s existing network performance monitoring solution is based on traditional SNMP metrics collected from eNodeBs, routers, and switches in the network. Eventreportingoccursat15-minuteintervals.Bandwidth utilization is also reported on, using data from network elements. All reports are provided by RAN vendors.

In order to achieve a more precise, accurate view ofwhat’sgoingoninordertoefficientlyfocustroubleshooting efforts, CellMobile has put a network performance optimization initiative into place. CellMobile plans to deploy a standard solution with performance measurement points available on all existing cell sites. As new sites are added, they plan to use performance data from vendors. Eventually, all performance data from these variety of sources willbepulledintoacentralized,standards-based AI engine to make sense of it all.

Active measurement requirementsCellMobile has developed concrete requirements for network performance monitoring and transport network capacity measurements to align performance management with quality assurance goals. Whatever solutionitdeploysmustincludetoolsforend-to-end transport network performance monitoring and troubleshooting,andend-to-endnetworkutilizationmonitoring. These requirements focus on active measurements (continuous sampling):

• Ubiquitous coverage

• TWAMP for active measurements and interoperabilitywiththird-partyreflectors

• Y.1731 testing available for Ethernet layer monitoring

• Virtual (VMware and KVM) solution scalable to thousands of monitoring sessions

• Deployableinamulti-vendorenvironment

• SmartSFPscapableofreflectingstateful and stateless TWAMP sessions

• Ability to start and stop individual monitoring sessions

• TWAMP initiated either from centralized servers or smart SFPs

• Monitoring sessions capable of up to 10,000 packets per second

8. Putting It All Together: Case Studies


• Ability to set packet size settings for each session, per class of service

• 1 minute or smaller monitoring window

• One-waymetricswithoutrequiringendpointsto be synchronized

• Measurebandwidthutilizationin10-secondintervalsor more frequently

• Dashboard-typemanagementsystemtovisualizeKPIs and bandwidth utilization metrics, with data exportabletoXMLorCSVfiles

Performance management requirementsFurther,CellMobiledefinesspecificrequirementsfor features and capabilities of a performance management solution to support its business goals, focused on converged NAPM and analytics.

• Cloudnativesolutionwithreal-time,360-degreevisibility into both network and application performance. This is needed to drive shorter resolution times, optimize productivity, and ensure

24x7x365availabilityofbusiness-criticalapplications.

• Proactive, actionable insights into root causes of degradationsandlong-termperformancetrends,through forensic and historic analysis, andtransactionalanalysisoffilestorageand transferflows.

• Detailed performance reports and dashboards to help data center staff gain a faster understanding of performance issues and their root causes.

• Unifiednetworkandperformancemonitoringtoeliminateinefficienciesresultingfromvaried one-offsolutionsforseparatefunctionslikedevicemonitoring, WAN performance, and application visibility.

• Cross-functionalcapabilitiesusefultobothLevel2and Level 3 teams to gain visibility into performance issues and how to resolve them.

Figure 14: CellMobile’s performance management solution requirements,focusedonfull-stack,convergedNAPMandanalytics

Automated discovery

Dependency mapping

Topology visualization

Correlative intellignece

Root cause diagnosis

Auto-baselining

Historical reports

Converged NAPM and analytics

Digital experiencemonitoring

Application discovery,tracing and diagnostics

Application analytics

End-user experience

Full stack applicationperformance monitoring (APM)

Business transactions

Application middleware

Database

Operating system

Server (physical/virtual)

Network, storage, etc.

On-premises Cloud


8.2 CloudMobile: cloud services and QoECloudMobile is a communications service provider with significantrevenuesgeneratedthroughbusinessservices. It has three large data centers in South Africa, and wants to grow its cloud services presence but is concerned that doing so might adversely affect promised quality of service (QoE) and SLAs in the region, given the new network complexities involved.

Goals and challengesBecause CloudMobile’s main challenge is managing the complexity of its growing data center business, it is seeking a performance management solution that’s agile and providescomplete(Layer2-7)visibilityintoinfrastructure,service/application performance, and related environments. CloudMobile needs these capabilities to achieve its new revenue stream goals without negatively affecting QoE for existing customers.

CloudMobile realizes that, by changing and growing its cloud-basedarchitecture,itmustbeabletoseetheentire“service picture” through clear insight into the overall logical and physical infrastructure. Different applications, after all, have different dependencies for each customer. The only way to effectively manage this complexity is with a converged NAPM and analytics model for its performance management.

8.3 Value achieved through correlated dataandconvergedNAPMIn both cases, implementing a performance management system that correlates data, converges NAPM, and leverages analytics, gives these CSPs tools to solve a variety of performance issues and achieve their business goals.

Each of the CSPs highlighted here deployed this performance management model, in order to:

• Notice the timing of bandwidth spikes and correlate that information with a marketing campaign to change customer behavior and improve user experience.

• Observethelocationandtimingofmicro-outagesandcorrelate that information with data from other sources to identify overutilized network segments.

• Identify abnormal ring switchovers that were impacting QoE;fixingthesethroughasoftwarereleasesignificantlyimproves customer experience.

• Perform effective capacity planning by knowing exactly how much bandwidth is utilized, when, and where.

• Proactively monitor network behaviors likely to cause issues and mitigate those before they affect QoE.

• Reduce truck rolls and problem resolution costs by capturingnetworkflowsatspecificlocations,brokeringthem using smart small form factor (SFP) devices, and analyzing the data centrally for actionable insight.

• Achieveservice-andapplication-levelvisibilitythroughproactive monitoring of the network, applications, and web services.

• Accelerateresolutiontime,therebybenefitingfromsignificantROIsavingsandhighercustomersatisfaction.

• Identify and pinpoint the root cause of performance issues, quickly and easily.

• LeveragefullNorth-SouthandEast-Westtrafficmonitoring for full visibility into the data center and virtual domains.

• Offerserviceassurancetocloud-basedbusinesscustomers,through ability to troubleshoot bottlenecks and degradation and identify their origin/location.


9. Conclusion: The Future of Performance Management

Communications networks and the services running on them are getting more complicated to design, deploy, and manage. There are more endpoints, types ofconnectivity,machine-to-machineinteractions,andcapacity requirements. These drive CSPs to invest in new network and IT systems as a way to manage dynamic, on-demandservices.Makingtherightinvestmentsiscrucial to ensure these new tools and systems are both cost-effectiveandefficient,withcapabilitiesthatcansupportfullyautomatingservicequalityandend-to-endcustomer experience management. (That automation will not happen immediately, but it’s coming.)

Addingfueltothefireistheintroductionofconvergedmobileandfixedarchitecture,alongwiththemovetoward 5G with its complex set of factors—including virtualization,orchestration,multi-accessedgecomputing,networkslicing—thatdemandsend-to-endmanagementof both physical and virtual infrastructure.

9.1. A service-centric futurePerformancemanagementisshiftingtoaservice-centricarchitecture model, where the availability of applications and services and the reliability of the network underlay is key.Service-basedperformancemanagementisneeded forQoEassuranceandfulfillingservicelevelagreements(SLAs). There are several crucial steps to this:

• Collect the right performance data, with the right granularity and precision

• Use passive and active monitoring

• Correlate data from multiple sources

• Perform analytics on data

• Feed insight to orchestrators for closed-loopautomation

• Use SDN capabilities to automate QoE

Traditional network performance management approaches, embedded in the existing network architecture,aredifficultorimpossibletoscale. Toovercomethatlimitation,next-generation performance management must provide:

• Flexibilitytosupportrapid,non-disruptiveaugmentation or replacement of MPLS networks with any form of broadband connectivity.

• Unifiedvisibilityand control across the entire application network and service chain, to centrally apply business intent policies in line with QoE requirements.

• Application performance management to significantlyimproveend-usersatisfaction.

• Bandwidth management to reduce the cost of connectivity, equipment, and network administration.

• Real-time insight into every platform, link, application, anduseracrosscomplex,multi-clouddeployments.

• Scalability for easy, dynamic setup.


9.2. Business model transformationThe ability to keep customers happy—increasingly the only competitive differentiator that really matters—depends on how well you understand their overall service experience. This means 1) using active and passive monitoring to measureLayer2-7performanceand2)analyzeandtakeaction, whether manual or automated, to sustain customer loyalty. New and sustained revenue streams depend on it.

Over 70%* of CSPs say that mobile and edge cloud assets give them a performance advantage over public cloud providers, especially in the enterprise and business services market.Thechallengeliesinselling5G,IoT,SD-WAN,andrelated services with performance guarantees, and actually delivering on that promise. It’s simply not possible to do that using traditional performance management tools and methodologies.

Instead,thewayforwardiswithnext-generationperformance management that can handle virtualization, automation, network slicing, and other complex aspects of service-orientedarchitecturethatdefinesthefuture.CSPssay that most important aspects* of this are:

• Unifiedperformancemanagementvisibility across all network and application layers

• Real-time analyticsforroot-causeidentification, closed-loopQoEautomation,andnetworkedgemanagement

• Continuous monitoring of circuits for full visibility intoQoE-impactingperformanceissues

*Source: Heavy Reading 5G service provider survey Q3 2018

Network Performance

CustomerExperience

ApplicationPerformance

©2019AccedianNetworksInc.Allrightsreserved.Accedian,the Accedian logo and Skylight are trademarks or registered trademarks of Accedian Networks Inc. To view a list of Accedian trademarks visit: accedian.com/legal/trademarks

About AccedianAccedian is the leader in performance analytics and end user experience solutions, dedicated to providing our customers with the ability to assure their digital infrastructure, while helping them to unlock the full productivity of their users.

We are committed to empowering our customers with the ability to see far and wide across their IT and network infrastructure and a microscopic ability to dive deep and understand the experience of every user, helping them to delight their own customers each and every time.

Accedian has been delivering solutions to high profilecustomersgloballyforover15years.

Learn more at Accedian.com

2351Blvd.AlfredNobel,N-410 Saint-Laurent,QCH4S2A9 1866-685-8181 accedian.com

guide network and application performance management in

Documents