guide network and application performance management in
TRANSCRIPT
1 Guide Network and Application Performance Management in the Cloud and 5G Era
Guide
Network and Application Performance Management in the Cloud and 5G Era
2 Guide Network and Application Performance Management in the Cloud and 5G Era
1 Introduction 3
1.1 What is performance management? 3
1.2 How this document will help you 4
2 The Basics: 10 Fundamentals of Performance Management 5
3 KPIs: The Foundation of Performance Management 6
3.1 The relationship between KPIs and performance management 6
3.2 KPIs that affect quality of experience (QoE) 6
3.3 How to leverage KPIs 8
4 Achieve Business Goals 9
4.1Benefitsofcompleteperformancemanagement 9
4.2 Risk of incomplete performance management 10
5 Standardized Performance Management 11
5.1 Correlating network performance and QoE using standards 11
5.2 How active and passive monitoring work together 12
5.3 Network visibility granularity: why it’s important 13
6 BeyondPassive&ActiveMonitoring:UnifiedNPMandAPM 14
7 Making Sense of Everything with Service Analytics 17
7.1 Performance data quality and correlation 18
8 Putting It All Together: Case Studies 19
8.1CellMobile:networkexpansionanduserexperience 19
8.2 CloudMobile: cloud services and QoE 21
8.3 Value achieved through correlated data and converged NAPM 21
9 Conclusion: The Future of Performance Management 22
9.1Aservice-centricfuture 22
9.2Businessmodeltransformation 22
Table of Contents
3 Guide Network and Application Performance Management in the Cloud and 5G Era
For your communications service provider (CSP) organization, the ability to acquire and retain customers, andtomakeaprofit,isincreasinglydrivenbyapositivecustomer experience—all the way from ordering a service and managing service quality, through to billing and customer care. Especially as 5G services emerge—many of them ‘mission critical’ with no tolerance for downtime—proactively managing quality of experience (QoE) becomes a key requirement to ensure customer loyalty, sustain or grow revenue, and reduce churn. What is your plan to take control of 5G customer experience?
Your answer should focus on network and application performance management, which encompasses physical and software/virtual layers all the way from the core to the end user. This is vital to collect the key performance indicators (KPIs) you will use to accurately calculate 5G QoE.
However, traditional performance monitoring systems and methodsarenolongersufficientforyourorganization’soperations and customer care teams to respond effectively to customer dissatisfaction with network speed or service performance. Effective performance management for increasingly complex services, applications, and cloud networksmustbereal-timeanddynamic.Youneednewtools and a new approach.
1.1 What is performance management?In this context, performance management means the ability todefineandmeasurekeyperformanceindicators(KPIs)tounderstand,control,report,andoptimizetheend-to-end network and the entire service lifecycle. Its purpose is to extract meaning from performance data and then make informed, effective decisions.
Performance management, unlike performance monitoring, isn’t just about checking on network performance (which can be misleading, since the network may appear to be ‘allgreen’whileQoEsuffers)orfixingobviousproblems;components will need to be monitored and managed, including virtualized network functions, distributed edge architectures, service chains, network slices, and cloud data center infrastructure.
Traditionally,“performancemanagement”hasbeenconfinedmostly or completely to monitoring the core and distribution network, ending at a demarcation point some distance from the customer. Increasingly, service providers are integrating network performance management (NPM) with application performance management (APM), which provides visibility into the performance of last mile delivery and the behavior ofspecificservicesandevenapplicationsonenduserdevices. Are users experiencing what they were promised? You must now consider APM as part of performance management.
1. Introduction
1.2 How this guide will help youWhile there are many facets to performance management, the purpose of this guide is to share an overview of the discipline, and offer insight into Accedian’s approach to continuous, precise performance monitoring for complex cloud networks and services.
In this overview, we cover:
• Best practice recommendations based on industry data andreal-worldexperiencesofserviceprovidersglobally.
• Meeting service level agreements (SLAs) for different types of networks and services. What are the differences? What core principles of effective, continuous performance monitoring do they have in common?
• Emerging network standards (MEF 3.0, 5G, DOCSIS 3.1,LTE,LTE-A,VoLTE)andtheirqualityandcapacityrequirements. How do new, more stringent requirements intersect with modern performance management solutions?
• Ensuringaworld-classnetworkandapositivecustomerexperience. How can a service provider be sure end customers are getting excellent quality of experience (QoE) and the best value for their money?
• Using existing standards for ‘vendor neutral’, continuous, proactive monitoring. How does network performance impact the core business drivers that are key to revenue generation? How does this apply to networks today, and in the future?
• Networking changes that are making performance management increasingly important. How does this apply to mobile, cable, and backhaul networks? What does this look like in the real world?
Introduction
5 Guide Network and Application Performance Management in the Cloud and 5G Era
Performance monitoring and assurance solutions must beopen,programmable,andfullysupportmulti-vendornetworks. Ideally, the solutions you use should provide truly independent visibility and provide relevant KPI, QoE and SLA intelligence to support all networks, resources and services. This applies to 5G, cable MSO offerings, mobile services, small cells, business Ethernet,financialnetworks,software-definednetworks (SDNs), and data center connectivity and cloud infrastructure.
Underpinning all this is a set of ten capabilities service providers must include in their performance management systems. While somewhat adaptable, these capabilities must work in unison for effective QoE and SLA management.
2. The Basics: 10 Fundamentals of Performance Management
Figure 1: Ten fundamentals of performance management
10 fundamentals of performance management
Accurate and preciseHigh resolution for tight control-plane timing
GranularKeep pace with rapid sampling
One-way metricsLeverage directional test results
ProgrammableAutomated, tailored assurance
Standards-basedUse available standards
ContinuousMonitor 24/7
InteroperableLeverage multi-vendor infrastructure
OpenAccessible applications using APIs
Real-timeFaster response with instant feedback
UbiquitousCover core-to-edge; eliminate blind spots
6 Guide Network and Application Performance Management in the Cloud and 5G Era
WedefineKPIsasmeasurableperformancereferencepoints that identify key areas you need to understand, control, visualize, report on, and optimize in order to improve service and network performance.
Useful KPIs:
• Are relevant and clear
• Generate timely and useful analytics data
• LeadtomeaningfulQoE-focusedactionsthat directly impact customers
3.1 The relationship between KPIs and performance managementYou can leverage KPIs to map network changes and meas uretheirimpactonfinances,revenue,customers,andstrategicinitiatives;toquicklyidentifythebestcourseofactiontoachievebusinesssuccess;andtocreateactionable results about what to improve. Analytics, machine learning, and AI algorithms can help you make sense of the volumes of data and KPIs and to correlate performance data with other data. These tools help answer your key performance management questions and help determine recommended actions. Is my network getting better or worse? What applications are people using? How well are those applications performing? Where are the recurring trouble spots?
3.2 KPIs that affect quality of experience (QoE)As an operations professional tasked with creating procedures to improve QoE and service quality, you have many KPIs at your disposal. CSPs and vendors typically only use a few of these: delay, packet loss, delay variation, and mean opinion score (MOS). But those are just the tip of the iceberg! Many more KPIs exist to help you pinpoint performance problems, and reduce mean time to innocence (MTTI) and/or mean time to repair (MTTR).
Here are the most important (although not universally-supported)KPIs:
• Packet delay is the time it takes for a packet to travel from source to destination
• Inter-packetdelayvariationisthedelayvariation between consecutive packets
• Packet delay variation is the comparison of a delay percentile with the minimum delay value
• Packet loss is the number of lost packets and the number of periods of lost packets
• Packet loss burst is consecutive packets lost within aspecifictimeperiod
• Packet misorder is when packets arrive out of order and cause anomalies in a service or application
• Packet duplication is when the same packet is generated and received at the termination point
• TypeofService(ToS)ishowpacketsareidentified and are therefore processed, prioritized, and buffered
3. KPIs: The Foundation of Performance Management
7 Guide Network and Application Performance Management in the Cloud and 5G Era
What causes packet loss?
What causes delay variation (jitter)?
What causes delay?
Bursty traffic No buffering Network configuration
Over-buffering
Packet buffering in routers and switchesterminating highly utilized links
Number of hubs TCP throughput Packet size
speedduplex
physicalissues
cabling
fibermaxed-outCPUs
hardware & softwaremalfunctions
No Jitter
Jitter
Figure 2: Causes of packet loss, delay, and jitter
Forreference,hereareMEF-definedrecommendedvaluesandlimitsforsomeofthekey KPIs discussed here.
Figure 3: Recommended values and limits for KPIs Source: Based on MEF 23 Class of Service (CoS) references
Application Framedelay
Frame lossratio
Frame delayratio
Inter frame delay variation
VoIP data
125 ms pref375 mslimitPd = 0.999
3e-2 50 msPr = 0.999
50 msPr = 0.999
Pr = 0.999
40 ms
Video conferencingdata
125 ms pref375 ms limitPd = 0.999
1e-2 50 ms 40 msPv = 0.999
Pv = 0.999
VoIP and videoconfsignaling
Notspecified 1e-3 Not
specifiedNot specified
IPTV data plane 125 msPd = 0.999 1e-3 40 ms
Pv = 0.999
IPTV control plane Not specified 1e-3 Not
specifiedNotspecified
Streaming media Notspecified 1e-3 2s 1.5 s
Pv = 0.99
Interactive gaming 50 ms 1e-3 10 ms 8 ms
Mobile backhaul H 10 ms 1e-4 5 ms 3 ms
Mobile backhaul M 20 ms 1e-4 10 ms 8 ms
Mobile backhaul L 37 ms
Mean framedelay
100 ms pref350 mslimit
100 ms pref350 mslimit
250 ms pref
100 ms
75 ms
Notspecified
40 ms
7 ms
13 ms
28 ms 1e-3 Notspecified
Notspecified
Key: ms millisecond | pref preferred | Pd Packet Delay | Pr Packet Ratio | Pv Packet Variation
8 Guide Network and Application Performance Management in the Cloud and 5G Era
3.4 How to leverage KPIsEvensmallapplicationorservice-basedissuescanhaveatremendousimpactonthecustomerexperience—especially if the network is “green” yet customers are having service issues. You need to be able to quickly identify the root cause of customer dissatisfaction. If you are learning about quality problems from customer complaints to the service center, you are not properly leveraging the power of KPIs.
Here are some of the most common QoE issues and the KPIs you can use to identify and track them:
• Droppedcallsareimpliedifspecificcellsitesexceedpacketlossburstlimits.
• Poor voice quality is likely to be present if jitter or delay, or a combination of the two, are present. Packet loss also impacts voice quality.
• OveralldegradedinternetperformanceisoftentheresultofTCPre-transmissionsduetodelay, jitter,andpacketloss;thisimpactsthroughputtheenduserexperiences.
3.3 KPIs for 5G Thespecificationsshownabovearefairly“traditional”intheirassociationwith4Gandearliernetworkarchitectures.5Gbringsamuchmoreintensesetofperformancedemands.Whilethesearenotyetfullydefined,thechartbelowgivesagoodpictureof what your CSP organization faces with performance management for future networks. The biggest challenge lies in the orangebox,withverylow-latencyandpacketloss-sensitiveapplications.
Seefigure11onpage16forahigh-levelviewofwhatanextgenerationperformancemanagementsystementails.
sensors for agriculture, smart meters, connected car insurance,
smart city lighting, package tracking and logistics
real-time factory automation, remote patient vital sign
monitoring, connected car collision avoidance /federated driving
video/hologram calling, VR gaming, augmented reality, telematics, fixed
line internet replacement
UHD 3D YouTube, VR movies &AR sporting events, 8k video
content, visuals, client to server-based web applications and basic HTTP/HTTPS connectivity to web
Mbps ms %PL
Mbps
Mbps
Mbps
Mbps
100
101
0.1
100
101
0.1
100
101
0.1
100
101
0.1
100 1001
10 10
ms
1001
10
ms
1001
10
ms
1001
10
ms
1001
10
1
0.1 10
1
0.1
%PL
10
1
0.1
%PL
10
1
0.1
%PL
10
1
0.1
%PL
10
1
0.1
5G services & KPI mappingPerformance level Impact if not metNot demanding LowWithin LTE specs ModerateChallenging High
IoTNB-LTE-M/LTE-IoT allowdevices to communicateas rarely asonce every 40minutes
eMBBExtremeMobileBroadband
mMTCMassive Machine-TypeCommunications
Local computeRequired
Packet lossSensitivity
LatencyAllowable
ThroughputRequired
uRLLCUltra-Reliable, Low-Latency Communications
Real-TimeCommunication/Interaction
StreamingVideo/Media
HTTP/HTTPsInternet/Web Services
Class Type Examples
Figure4:Expected,verytightspecificationsfor5Gservicesdemandend-to-endperformancemanagement Source: GSMA, Accedian
9Guide Network and Application Performance Management in the Cloud and 5G Era
Service providers must prove they are proactively on the side of the customer. This goes beyond “mean time to innocence”—how long it takes to show that the issue isn’t with your network—to quickly identifying the source of a problem and resolving it, ideally before customers are affected.
The goal here is to reduce mean time to diagnose and mean time to repair (MTTR), leveraging machine learning and analytics capabilities to identify and prevent recurring problems, as well as predict issues and take action to avoid them. The shorter the time, the better for everyone: the customer is happy, and the service provider avoids the high costs of calls to customer support.
Performance issues that might not seem like a big deal can in fact snowball into customer churn and lost revenue. For example, we’ve seen instances where packet loss as low as 0.2% can result in a reduction of 10% in TCP throughput and only 0.5% packet loss can result in up to 50% in TCP throughput. Micro impairments can have macro consequences. Related issues like this become clear only when KPIs are based on meaningful, actionable performance metrics.
4.1BenefitsofcompleteperformancemanagementPerformance management tends to be focused on network operations and the health of the network, its infrastructure, and ultimately the QoE delivered. But it is also a means to an end: achieving business goals. A successful launch of a service means using performance management capabilities to ensure that the “service” is running within the required specs to assure QoE.
What you can do with well-definedKPIs• Proactively detect, troubleshoot,
and resolve issues
• Continuously control and adjust the quality of service (QoS) and bandwidth allocation forbusiness-criticaltraffic
• Reduce the risk of investment in new technology, and rapidly deploy it
• Verify that a network is ready to launch new services
• Prove performance metrics before and after launching new services and technologies
• Deliverunique,consistently-performingservices to generate new revenue streams
• Improve QoE to reduce customer churn
• Consistently meet service level agreements (SLAs)
4. Achieve Business Goals
10 Guide Network and Application Performance Management in the Cloud and 5G Era
Existing performance monitoring tools may lead you to think that the network is behaving normally when in fact QoE has significantissues.Throughprecise,granular,accurateKPIsandanalytics,it’spossibletopinpointpreviouslyhiddencausesofQoE degradation.
Mean time & cost to repair
With legacy performance monitoring
With high quality performance management
Remote hours 7.0 3.50
On-site hours 12.0 9.0
Average time (hours)
8.20 4.44
Remote repair cost
$332.10 $179.6
Total cost with truck roll & labor on-site
$600.00 $600.00
Average cost to repair
$396.40 $251.08
Percentage savings
25%
50%
45.8%
45.9%
36.6%
Total savings with high quality
performance management
Figure 5: Risk of incomplete performance management
4.2 Risk of incomplete performance management Stop us if this sounds familiar: you spend too much time resolving performance problems because it is so hard (or even impossible)toidentifytheirrootcause.CustomerQoEsuffersandcallcentersarefloodedwithcomplaintcalls.Thesituationsnowballs, as dealing with each trouble ticket ties up resources that could focus on less immediate issues that will inevitably becomeurgentatsomepoint.Thiswaterfalleffect(damgetsfull,overflows,resultsinafloodofissues)haslong-termconsequences like churn, loss of revenue, and increased operational costs.
To stop this in its tracks, you need granular visibility into the root cause of performance issues. When you have that, there’s a butterflyeffectthatistheoppositeofthewaterfall:fewerservicecalls,moreresourcestoworkongrowthprojects,andloweroperationalcosts.Imagineif,whenacustomerdoescallwithaserviceissue,youcanconfidentlyinformthemyouarealreadyaware of the problem, your team is working on it, and it will be resolved soon. You can be in control!
FixingproblemsthatcustomersarehavingisasignificantcontributortoOPEX.Eachproblemturnsintoatroubleticketthatmust be managed by the network operations center (NOC), operations, marketing, and other teams. When the same issue repeats, it may result in multiple trouble tickets, which not only increases MTTR but also creates an ongoing QoE problem for thecustomer.Throughhighqualityperformancemanagementwithgranularvisibilitytoidentifyandresolveissues,OPEXcanbesignificantlyreduced.
Figure 5 shows a generalized example of this based on combined/averaged results from a variety of service providers (dollar amounts based on standardized assumptions about labor and other costs). Due to reduction in the number of customer tickets and the time it takes to resolve issues, average cost to repair is reduced by over 36%.
11 Guide Network and Application Performance Management in the Cloud and 5G Era
Inmulti-vendorandmulti-featurenetworksandenvironments, measuring, correlating and ensuring precisionandgranularityofKPIsisdifficult.Acrossvendors, monitoring features are deployed and used in a variety of ways that differ from one proprietary system toanother.AlthoughstandardsexisttodefineobjectiveKPIs (more on this below), not all vendors support the samestandards,andalloffertheirownvalue-addedmethodologies to differentiate their products.
To get a full picture of what’s really going on and how itaffectscustomers,youneedaunifiedmanagementsystem that, at the very least, supports all the monitoring standards used by all the different vendors in play, and correlates them in a coherent, meaningful way. A universal monitoring overlay across the various vendors may also be needed to collect KPIs currently missing entirely, or only collected in some locations.
5.1 The role of standards in QoE managementA question you probably struggle with often is, what’s happening at the infrastructure level and how does that impact the service layer? That question is answerable by using an analytics system to correlate objective KPIs like packet loss with subjective key quality indicators (KQIs) like mean opinion score (MOS) to make sense of it all and direct your problem resolution efforts to where they will have the most impact.
This process necessarily combines objective data and subjective quality index scores based on business rules andSLAspecificationstoprovideinsightintohownetwork performance affects customer experience. Noneofthisispossiblewithouthigh-qualitydata,collectedandclassifiedinastandardway.Withoutauniversally-understoodreferencepointforKPIs, confusion would run rampant.
Network-andservice-basedKPIsarewell-establishedbyorganizations like MEF, ETSI, 3rd Generation Partnership Project (3GPP), Internet Engineering Task Force (IETF), Next Generation Mobile Networks (NGMN), and Institute of Electrical and Electronics Engineers (IEEE).
The question now is how to create a universal understandingofnext-generationperformancemanagement, so that answering the question, how well are the network and services working? is less daunting.
5. Standardized Performance Management
12 Guide Network and Application Performance Management in the Cloud and 5G Era
Toreiterate:withoutstandards-basedKPIsandaclearly-definedmethodofhowtousethemforperformancemanagement,it is unclear how to best measure or calculate network performance and its relationship with QoE. Only very granular, precise metricscanbeusedtobuildqualityindexesthatdefineandverifythedesiredcustomerexperienceparameters.Leveragingcomplementary methodologies like active and passive monitoring is part of this process.
5.2 How active and passive monitoring work togetherActiveandpassivemonitoringarecomplementary;youneedboth,incombinationwithanalyticsandinsight,tounderstand themulti-layerdynamicnatureofvirtualnetworksandservicesenabledbynetworkfunctionsvirtualization(NFV), software-definednetworking(SDN),and5G.Serviceprovidersneedtoclearlyvisualizewhatishappeningandtakeaction inrealtimetooptimizenetworkefficiencyandQoE,andtokeepOPEXandCAPEXundercontrol.
Activemonitoringisaproactive,real-timetoolmainlyusedtomeasureKPIsbyinjectingsynthetictraffictoemulatetheway acustomer’strafficwouldtravelacrossandinteractwiththenetwork(Layers2-3).
Passive monitoring is a reactive, retrospective tool, mainly used to understand application behavior and service overlay andconnectivity(Layer4-7).
Figure 6: Performance test functions and standards
Function
Service Activation Testing (SAT)
Layer 2 performance monitoring
Layer 2 and 3 performance monitoring
Standard(s)
IEEE/IETF RFC-2544 benchmarking
ITU-T Y.1564 turn-up testing
ITU-T Y.1731 IEEE 802.3ah Ethernet SOAM
RFC-5357 TWAMP
MEF has been working with other standards bodies, as well as vendors and service providers, to develop a standard way ofdefininghowQoEandKQIsshouldbemeasuredagainstservicesandtheenduserexperience—throughmethodologiesusingmeasurementprotocolsdefinedbyIETF,InternetSociety(ISOC),TelecommunicationStandardizationSectoroftheInternationalTelecommunicationUnion(ITU-T).
13 Guide Network and Application Performance Management in the Cloud and 5G Era
5.3Networkvisibilitygranularity:whyIt’simportantData granularity determines the degree to which you can see what’s really happening in the network at any given time oroveraspecifictimeperiod,versusmerelyanaverageconsensusofevents.Fromexperience,youknowthatthemoregranulardatais,theeasieritistogaininsightsthatactioncanbeimmediatelytakenontodriveefficientproblemresolutionandnetworkperformanceoptimization.Afterall,youcanonlycorrelatetimesequence-basedeventsifyouknowexactlywhen they happened.
Isthenetworktrulyproblem-free?Doesitrequireproactiveintervention?Thesequestionscanonlybeansweredusingdatathat’s accurate and precise.
Passive monitoring
Scope and origin of performance degradations (client, network, server, application)
Application transaction visibility
Micro-burst detection through utilization metering
Two-way latency
Retransmission rates
Active monitoring
Possible impact correlation of underlay service domain
KPI input for composite QoE metrics
Delay variation
One-way latency
Packet loss
Hop-by-hop path analysis
Application Quality of experience Quality of service
Figure 7: Passive monitoring, active monitoring, QoE, and QoS are interrelated aspects of performance management
Figure 8: The more granular data is, the easier it is to identify and resolve performance issues
20pps: 50ms visibility
30pps: 33ms visibility
100pps: 10ms visibility
1,000pps: 1ms visibility
10,000pps: 0.01ms visibility
5pps: 200ms visibility
10pps: 100ms visibility
40pps: 25ms visibility50pps: 20ms visibility
14 Guide Network and Application Performance Management in the Cloud and 5G Era
6. Beyond Passive & Active Monitoring: UnifiedNPMandAPM
Figure9:UnifiedNAPMprovidesthetoolstoovercomereal-worldtroubleshootingandoperationschallenges
Challenge
Reduce MTTR
Collaboratively resolve issues and optimize performance
Manage network performance
Manage infrastructure changes and service roll-outs
How to solve it
Visualize all network and application exchanges
Identify performance degradation scope and origin
Cover all data center environments
Provide fast, retrospective analysis
Deploy capture points anywhere in minutes
Show performance across network, system, and application tiers
Show all transactions for web applications, DNS, databases, and file transfers
Provide visibility into VoIP, Citrix XenApp, and Citrix XenDesktop
Provide alerts of any performance breaches and degradation
Understand network usage, leveraging NetFlow and packet capture
Map usage and performance to track leaks
Track network behavior changes and errors
Track before and after performance
Identify application and service dependencies
Use application profiling
Toresolveperformanceproblemsinawaythat’sfastandefficientforbothyouandyourcustomers,thefocusistypicallyon reducing mean time to repair (MTTR)/mean time to innocence (MTTI). This requires the use of both network performance management (NPM) and application performance management (APM). Working together, these systems give CSPs what’s needed to proactively improve QoE for end customers.
Complementary to traditional monitoring, NAPM fundamentally does two things: 1) measures the performance of network andapplicationsbasedontrafficcapture,and2)providesawaytotroubleshootperformancedegradations.
CSPs are in the midst of transforming business processes, simplifying backend systems, and moving to cloud and IT network infrastructures. As this happens, new challenges arise around performance management, necessitating deployment of an end-to-end,multi-layersolutionwithreal-timeanalyticsandvisualizationtocoverallaspectsofapplication,serviceandnetwork lifecycle management.
Challenge
North-South visibility
East-West visibility
Edge visibility
Locations
On-premises data center
Software-defined network (SDN)and virtual data center
Public cloud Infrastructure-as-a-Service (IaaS) coverage
Cloud services and Software-as-a-Service (SaaS)
Small form factor (SFP) units
Active testing and distributed packet capture
xFlow collection
Virtual and micro-capture appliances
To actually address those visibility challenges, you really do need a comprehensive performance management system that includesmanycomponents:serviceanalyticsusingautomation,machinelearning,andartificialintelligence(AI);passiveandactiveperformancemonitoringandmeasurement;continuous,service-basedtesting;bandwidthutilizationmetering;anddistributedpacketcapture.Collectively,thesecovertheentireend-to-endnetworkfromcoretoenduser,theentireinfrastructurestackfromunderlaynetworktoapplications(Layer2-7),andtheentireservicelifecycle.
Figure10:Anend-to-end,multi-layerperformancemanagementsolutionaddressesavarietyofvisibility challenges
16 Guide Network and Application Performance Management in the Cloud and 5G Era
3rd partypartner
Analytics/Automation/ML/AI
Open APIs
Open APIs
End to endservicemanagement
Customer/partnermanagement
ABC
End-to-end service orchestration
SLA & QoSintent
5GCSG/Cloudlet Access
VNF VNF
MEC
VNF VNF VNF VNF VNF VNF
Core connectivity Core data center
App: NATApp: DHCP
App: FirewallApp: Enterprise
App: FirewallApp: DDoS
End-to-end service assurance
Accurate data
Closed loopautomation
Visualization
QoE throughactionable
insight
ServiceKPI/KQI
NetworkKPI/KQI
Smartanalytics
Granular data
High qualitydata
Precision data
Ingest, store, analyze
All layer data
Multiple sources
Beyond reporting
Analytics and machine learning
Single consistent view
All layers
Big Data
Passive network/app PM
Real user monitoring
End-to-endapplication delivery
TCP metrics
North-South/East-West
Wide-angle view
L2-7 NAPM
QoE
Highly accurate & granularactive-synthetic PM
Patented one-waymeasurements
Bandwidth utilization < 1 second
Service activation testing
Remote packet capture
L2-3 PM
QoS
Inacomprehensive,end-to-endperformancemanagementsolution(thatplaysnicelywithyourorganization’sexistingecosystem)—liketheoneshowninfigure11—thedifferentaspectsofactive,passive,andbigdataanalyticsfunctionasa whole to give you a detailed view of the network and service layers, correlated with business and service assurance requirements. In doing so, it uses ‘smart AI’ to provide data and metadata that relates performance with customer experience, and helps you improve overlay and underlay connectivity to deliver promised service levels to customers.
Figure11:Thecomponentsofcomprehensive,end-to-endperformancemanagementforfuturenetworks
17 Guide Network and Application Performance Management in the Cloud and 5G Era
While many CSPs implement various performance measurement tools and methodologies—and these may now cover both NPM and APM—it’s common to miss the important point that not all critical metrics are clearly visible. Suchvisibilityisdependentonawide-angleviewoftheentireserviceinfrastructure,withcapabilitiestodrilldowndeeplyintospecificproblemareas.Butananalyticsenginecanonlyprovidemeaningfuloutputifthedataitingestsisgranularandhigh-quality.Whenitis,theresultisbusiness,operations,andcommercialinsightthatmakesitpossibletoproactively improve customer experience.
Whattypesofcapabilitiesarewetalkingabouthere?Andwhatbusinessbenefitsdotheybring?
Figure 12: Performance assurance solution features and their business impact
Performancemanagement feature
End-to-end monitoring per class of service (CoS) with one-way measurement, fault isolation, traffic generation, and loopbacks
Throughput measurement per virtual LAN (VLAN) with sub-second granularity
Service activation testing (SAT) that supports RFC 2544 and other standards for circuit “birth certificate”
Packet brokering that can capture flows in any location and filter on any protocol
Passive monitoring of applications in real-time, including transaction-level applications and user experience
Business impact
• Improve network visibility • Optimize network planning • Reduce mean time to repair (MTTR) • Increase customer satisfaction and QoE • Reduce churn • Predict how changes will affect the network
• Guide network expansion planning • Optimize bandwidth and investment
• Validate SLA compliance before services launch • Reduce truck rolls
• Analyze any network flow to identify network issues
• Reduce cost of analytics
• Increase business productivity by proactively monitoring application QoE and end user experience
• Optimize SaaS cloud application performance and reduce downtime
7. Making Sense of Everything with Service Analytics
18 Guide Network and Application Performance Management in the Cloud and 5G Era
7.1 The importance of precise, granular performance dataCombining objective data with subjective business rules is the foundation of effective performance management. Accurately calculating perceived customer experience based on network and services layer KPIs is possible by correlating high quality, raw performance data with other contextual data from sources like vendor equipment at various locations in core and distribution networks.
This goes beyond using performance monitoring data on its own. Visualizing what’s going on with services and the network assures that correct actions are taken so customers are not impacted.
But again, as stressed earlier, the type of data and how it is collected are hugely important. For example, as seen in figure13below,samplingspeedcanmakeahugedifferenceinunderstandingifeverythingisokay…ornot.
Here,whenbandwidthutilizationissampledat15-secondfrequency,thenetworkappears“green.”At1-secondsampling frequency, it looks like there may be a small problem. Only by getting down to 0.1 second sampling frequencycanyouseethere’sabigproblem:evenalessthan1percentpacketlossratecansignificantlyaffectthecustomer experience by decreasing throughput up to 50 percent!
Note that this is not just a matter of feeding monitoring data into a system with a visual dashboard. To track metrics that have a critical impact on QoE and business goals, analytics dashboards must be designed to show the relationship between reported data values.
Awell-designedanalyticssystemwithdashboardsusedappropriatelyleadstoanunderstandingofwhat’shappening in the network and what needs to be done to keep it under control to comply with SLAs and maintain a superior level of QoE.
Increasing sampling speed and adding statistical perspectives changes everything
Effects ofsamplingfrequency
Measureevery
15sec
Measureevery
5sec
Measureevery
1sec
Measureevery
0.1sec
Noproblem
Noproblem
Smallproblem
Bigproblem
Figure13:Withoutverygranularbandwidthutilizationsamplingfrequency,theoriginofsignificantperformanceproblems may be invisible.
19Guide Network and Application Performance Management in the Cloud and 5G Era
8.1 CellMobile: network expansion and user experienceCellMobile is a mobile communications service provider, ranked as the number two player in its region, battling fiercelytomaintainitspositioninthemarket.Itoffersprepaid and postpaid mobile voice services, mobile broadband, enterprise solutions, bulk wholesale services,digitalservices,andmachine-to-machinesolutions. Its network assets include 2G, 3G, and 4G LTE infrastructure.
CellMobile has roughly 13 million prepaid and postpaid mobile service customers. It has 7,000 4G LTE cell sites deployed. CellMobile’s ongoing investments focus on network coverage, capacity, and performance with the goal of positioning itself as the country’s best mobile communications service provider.
Goals and challengesLTE network expansion is CellMobile’s main goal, and it is making tangible progress. Base stations have been standardized to only two vendors, and new base stations are being added as demand grows.
However, some sectors of the network are experiencing low quality and low throughput. Users are experiencing noisy calls, dropped calls, and a poor internet service. CellMobile thinks these issues are caused by data requirements of new users squeezing available bandwidth, and troubleshooting is focused on capacity.
CellMobile’s existing network performance monitoring solution is based on traditional SNMP metrics collected from eNodeBs, routers, and switches in the network. Eventreportingoccursat15-minuteintervals.Bandwidth utilization is also reported on, using data from network elements. All reports are provided by RAN vendors.
In order to achieve a more precise, accurate view ofwhat’sgoingoninordertoefficientlyfocustroubleshooting efforts, CellMobile has put a network performance optimization initiative into place. CellMobile plans to deploy a standard solution with performance measurement points available on all existing cell sites. As new sites are added, they plan to use performance data from vendors. Eventually, all performance data from these variety of sources willbepulledintoacentralized,standards-based AI engine to make sense of it all.
Active measurement requirementsCellMobile has developed concrete requirements for network performance monitoring and transport network capacity measurements to align performance management with quality assurance goals. Whatever solutionitdeploysmustincludetoolsforend-to-end transport network performance monitoring and troubleshooting,andend-to-endnetworkutilizationmonitoring. These requirements focus on active measurements (continuous sampling):
• Ubiquitous coverage
• TWAMP for active measurements and interoperabilitywiththird-partyreflectors
• Y.1731 testing available for Ethernet layer monitoring
• Virtual (VMware and KVM) solution scalable to thousands of monitoring sessions
• Deployableinamulti-vendorenvironment
• SmartSFPscapableofreflectingstateful and stateless TWAMP sessions
• Ability to start and stop individual monitoring sessions
• TWAMP initiated either from centralized servers or smart SFPs
• Monitoring sessions capable of up to 10,000 packets per second
8. Putting It All Together: Case Studies
20 Guide Network and Application Performance Management in the Cloud and 5G Era
• Ability to set packet size settings for each session, per class of service
• 1 minute or smaller monitoring window
• One-waymetricswithoutrequiringendpointsto be synchronized
• Measurebandwidthutilizationin10-secondintervalsor more frequently
• Dashboard-typemanagementsystemtovisualizeKPIs and bandwidth utilization metrics, with data exportabletoXMLorCSVfiles
Performance management requirementsFurther,CellMobiledefinesspecificrequirementsfor features and capabilities of a performance management solution to support its business goals, focused on converged NAPM and analytics.
• Cloudnativesolutionwithreal-time,360-degreevisibility into both network and application performance. This is needed to drive shorter resolution times, optimize productivity, and ensure
24x7x365availabilityofbusiness-criticalapplications.
• Proactive, actionable insights into root causes of degradationsandlong-termperformancetrends,through forensic and historic analysis, andtransactionalanalysisoffilestorageand transferflows.
• Detailed performance reports and dashboards to help data center staff gain a faster understanding of performance issues and their root causes.
• Unifiednetworkandperformancemonitoringtoeliminateinefficienciesresultingfromvaried one-offsolutionsforseparatefunctionslikedevicemonitoring, WAN performance, and application visibility.
• Cross-functionalcapabilitiesusefultobothLevel2and Level 3 teams to gain visibility into performance issues and how to resolve them.
Figure 14: CellMobile’s performance management solution requirements,focusedonfull-stack,convergedNAPMandanalytics
Automated discovery
Dependency mapping
Topology visualization
Correlative intellignece
Root cause diagnosis
Auto-baselining
Historical reports
Converged NAPM and analytics
Digital experiencemonitoring
Application discovery,tracing and diagnostics
Application analytics
End-user experience
Full stack applicationperformance monitoring (APM)
Business transactions
Application middleware
Database
Operating system
Server (physical/virtual)
Network, storage, etc.
On-premises Cloud
21 Guide Network and Application Performance Management in the Cloud and 5G Era
8.2 CloudMobile: cloud services and QoECloudMobile is a communications service provider with significantrevenuesgeneratedthroughbusinessservices. It has three large data centers in South Africa, and wants to grow its cloud services presence but is concerned that doing so might adversely affect promised quality of service (QoE) and SLAs in the region, given the new network complexities involved.
Goals and challengesBecause CloudMobile’s main challenge is managing the complexity of its growing data center business, it is seeking a performance management solution that’s agile and providescomplete(Layer2-7)visibilityintoinfrastructure,service/application performance, and related environments. CloudMobile needs these capabilities to achieve its new revenue stream goals without negatively affecting QoE for existing customers.
CloudMobile realizes that, by changing and growing its cloud-basedarchitecture,itmustbeabletoseetheentire“service picture” through clear insight into the overall logical and physical infrastructure. Different applications, after all, have different dependencies for each customer. The only way to effectively manage this complexity is with a converged NAPM and analytics model for its performance management.
8.3 Value achieved through correlated dataandconvergedNAPMIn both cases, implementing a performance management system that correlates data, converges NAPM, and leverages analytics, gives these CSPs tools to solve a variety of performance issues and achieve their business goals.
Each of the CSPs highlighted here deployed this performance management model, in order to:
• Notice the timing of bandwidth spikes and correlate that information with a marketing campaign to change customer behavior and improve user experience.
• Observethelocationandtimingofmicro-outagesandcorrelate that information with data from other sources to identify overutilized network segments.
• Identify abnormal ring switchovers that were impacting QoE;fixingthesethroughasoftwarereleasesignificantlyimproves customer experience.
• Perform effective capacity planning by knowing exactly how much bandwidth is utilized, when, and where.
• Proactively monitor network behaviors likely to cause issues and mitigate those before they affect QoE.
• Reduce truck rolls and problem resolution costs by capturingnetworkflowsatspecificlocations,brokeringthem using smart small form factor (SFP) devices, and analyzing the data centrally for actionable insight.
• Achieveservice-andapplication-levelvisibilitythroughproactive monitoring of the network, applications, and web services.
• Accelerateresolutiontime,therebybenefitingfromsignificantROIsavingsandhighercustomersatisfaction.
• Identify and pinpoint the root cause of performance issues, quickly and easily.
• LeveragefullNorth-SouthandEast-Westtrafficmonitoring for full visibility into the data center and virtual domains.
• Offerserviceassurancetocloud-basedbusinesscustomers,through ability to troubleshoot bottlenecks and degradation and identify their origin/location.
22 Guide Network and Application Performance Management in the Cloud and 5G Era
9. Conclusion: The Future of Performance Management
Communications networks and the services running on them are getting more complicated to design, deploy, and manage. There are more endpoints, types ofconnectivity,machine-to-machineinteractions,andcapacity requirements. These drive CSPs to invest in new network and IT systems as a way to manage dynamic, on-demandservices.Makingtherightinvestmentsiscrucial to ensure these new tools and systems are both cost-effectiveandefficient,withcapabilitiesthatcansupportfullyautomatingservicequalityandend-to-endcustomer experience management. (That automation will not happen immediately, but it’s coming.)
Addingfueltothefireistheintroductionofconvergedmobileandfixedarchitecture,alongwiththemovetoward 5G with its complex set of factors—including virtualization,orchestration,multi-accessedgecomputing,networkslicing—thatdemandsend-to-endmanagementof both physical and virtual infrastructure.
9.1. A service-centric futurePerformancemanagementisshiftingtoaservice-centricarchitecture model, where the availability of applications and services and the reliability of the network underlay is key.Service-basedperformancemanagementisneeded forQoEassuranceandfulfillingservicelevelagreements(SLAs). There are several crucial steps to this:
• Collect the right performance data, with the right granularity and precision
• Use passive and active monitoring
• Correlate data from multiple sources
• Perform analytics on data
• Feed insight to orchestrators for closed-loopautomation
• Use SDN capabilities to automate QoE
Traditional network performance management approaches, embedded in the existing network architecture,aredifficultorimpossibletoscale. Toovercomethatlimitation,next-generation performance management must provide:
• Flexibilitytosupportrapid,non-disruptiveaugmentation or replacement of MPLS networks with any form of broadband connectivity.
• Unifiedvisibilityand control across the entire application network and service chain, to centrally apply business intent policies in line with QoE requirements.
• Application performance management to significantlyimproveend-usersatisfaction.
• Bandwidth management to reduce the cost of connectivity, equipment, and network administration.
• Real-time insight into every platform, link, application, anduseracrosscomplex,multi-clouddeployments.
• Scalability for easy, dynamic setup.
23 Guide Network and Application Performance Management in the Cloud and 5G Era
9.2. Business model transformationThe ability to keep customers happy—increasingly the only competitive differentiator that really matters—depends on how well you understand their overall service experience. This means 1) using active and passive monitoring to measureLayer2-7performanceand2)analyzeandtakeaction, whether manual or automated, to sustain customer loyalty. New and sustained revenue streams depend on it.
Over 70%* of CSPs say that mobile and edge cloud assets give them a performance advantage over public cloud providers, especially in the enterprise and business services market.Thechallengeliesinselling5G,IoT,SD-WAN,andrelated services with performance guarantees, and actually delivering on that promise. It’s simply not possible to do that using traditional performance management tools and methodologies.
Instead,thewayforwardiswithnext-generationperformance management that can handle virtualization, automation, network slicing, and other complex aspects of service-orientedarchitecturethatdefinesthefuture.CSPssay that most important aspects* of this are:
• Unifiedperformancemanagementvisibility across all network and application layers
• Real-time analyticsforroot-causeidentification, closed-loopQoEautomation,andnetworkedgemanagement
• Continuous monitoring of circuits for full visibility intoQoE-impactingperformanceissues
*Source: Heavy Reading 5G service provider survey Q3 2018
Network Performance
CustomerExperience
ApplicationPerformance
©2019AccedianNetworksInc.Allrightsreserved.Accedian,the Accedian logo and Skylight are trademarks or registered trademarks of Accedian Networks Inc. To view a list of Accedian trademarks visit: accedian.com/legal/trademarks
About AccedianAccedian is the leader in performance analytics and end user experience solutions, dedicated to providing our customers with the ability to assure their digital infrastructure, while helping them to unlock the full productivity of their users.
We are committed to empowering our customers with the ability to see far and wide across their IT and network infrastructure and a microscopic ability to dive deep and understand the experience of every user, helping them to delight their own customers each and every time.
Accedian has been delivering solutions to high profilecustomersgloballyforover15years.
Learn more at Accedian.com
2351Blvd.AlfredNobel,N-410 Saint-Laurent,QCH4S2A9 1866-685-8181 accedian.com