expectations to machine learning for network … l various network logs are gathered by nmss and...
TRANSCRIPT
Copyright©2016NTTcorp.AllRightsReserved.
ExpectationstoMachineLearningforNetworkManagement
NML-RGmeeting,IRTF/IETF97(Soeul)November,2016
KoheiShiomoto(NTT)
1Copyright©2016NTTcorp.AllRightsReserved.
Outline
n Networkmanagementissues(3min)n Data-drivenapproach(3min)n Twoexamplesofourpractice(9min)
1. SYSLOGanalytics2. TroubleTicketanalytics
ExplainonlyourgoalandkeyideaSkipthroughseveralslidesonmath
n Discussion
2Copyright©2016NTTcorp.AllRightsReserved.
ChallengesinICTManagement
n Diversifiedapplications&services/ModernWebtrafficn Streaming,browsing,SNS,e-commerce,e-Health,…
n HighexpectationsforAvailability&Qualityn 99.9999%availability,highresolution,lownoise,
stallingfree,quickresponse,…n ComplexICTsystemstructure
n Devices,software,protocols,…n Interactionofplayers
n Customer,ISPs,CDN,…n Communicationsgetencrypted.https://…n ...
3Copyright©2016NTTcorp.AllRightsReserved.
Disaggregation-whywepursue?
n Verticallyintegratedsystemn Networkdevices(router,switch,middle-box,
etc.)havebeenverticallyintegratedsystem.n Thoseverticallyintegratedsystemsconsistof
manyhardwareandsoftwarecomponentsprovidedbydifferentcomponentvendors.
n Theendoflifeofsomecomponentscouldriskthelifeofentiresystem.
n Addingnewfunctionalitiesisundercontrolofsystemvendor.
n Disaggregationcouldsolvetheissues.
4Copyright©2016NTTcorp.AllRightsReserved.
SDN&NFV
Software-drivenControlSoftware-DefinedNetworking(SDN)
• Routing&Signaling,trafficengineering/steering,• CommodityL2/L3Switch-Routerhardware
NetworkFunctionVirtualization(NFV)• Middlebox (NAT,FW,IDS/IPS,VPN,CDN,…)• CommodityX86machinehardware
DKMMx86NIC DK
MMx86NIC
TCAMNPNICNIC
HA middle HA middle HA middle
DKMMx86NIC DK
MMx86NIC
TCAMNPNICNIC
HA middle HA middle HA middle
5Copyright©2016NTTcorp.AllRightsReserved.
Costofdisaggregation-Increaseofcomplexity
n Disaggregationbringsaboutanumberofbenefits:costreduction,elasticcapacity,quickdeploymentofnewfunctions.
n Butthosebenefitscouldnotbeobtainedwithoutsacrifice.
n Networkdevice,whichisdisaggregatedintocomponents,introducesfurthercomplexitycausedbyinteractionsamongcomponents.
n Eachcomponentisfrequentlyandquicklyreplacedwithneweroneasnewfeaturesaredevelopedandreleased.
6Copyright©2016NTTcorp.AllRightsReserved.
Howtodealwithcomplexsystem?
n Powerfulnetworkmanagementparadigmisrequiredtodealwithcomplexity.
n Wecouldnotrelyontraditionalmechanism-drivenapproach,whichpiecestogetheraccuratemechanismsofindividualcomponents.
n Wehavetorelyonaholisticdata-drivenapproachtomodelanentiresystembyanalyzingrelationshipbetweeninputsandoutputs.n ConventionalMechanism-drivenapproach
nGivenunderstandingprecisemechanismsofcomponents,buildupamodelofentiresystem.
n Towards Data-drivenapproachnGivendata,infertherelationshipbetweeninputsandoutputs.nMachinelearningisakey.
7Copyright©2016NTTcorp.AllRightsReserved.
Varietiesofdatacanbeused.
n Trafficloadn Performancen Syslogn Troubleticketsn SNSmessages(e.g.Twitter)n …
Numerical,text,…
Copyright©2016NTTcorp.AllRightsReserved.
Spatio-TemporalFactorizationofLogDataforUnderstandingNetworkEvents
Tatsuaki Kimura,Keisuke Ishibashi,Tatsuya Mori,HiroshiSawada,Tsuyoshi Toyono,Ken Nishimatsu,AkioWatanabe,Akihiro Shimoda,Kohei Shiomoto
SYSLOGAnalytics
Background
l VariousnetworklogsaregatheredbyNMSsandmonitored– Switch,router,RADIUSsever,…– syslog,serverlog,alarm,SNMPtrap,…– logscontainusefulinformationforNWtroubleshooting
ISPIPTV VoIP
Syslog Serverlog
Example[router,switchsyslog](Cisco):%TRACKING-5-STATE:1interfaceFa0/0line-protocolUp->Down%LINK-3-UPDOWN:InterfaceFastEthernet 0/9,changedstatetodown%SYS-5-CONFIGI:Configuredfromconsolebyvty2(10.11.11.11)
[NMSalarm]:[エラータイプ 100]リンクダウンが起こりました.ホスト名:hostA IPアドレス:10.11.1.1プロトコル3[エラータイプ 100]パケットロスを検出しました.type:xxxx,host:hostB,IPアドレス:10.11.11.11<優先度:低>オペレータがログインしました.fromyyyy,host:hostC,2011/11/11
Background
l VariousnetworklogsaregatheredbyNMSsandmonitored– Switch,router,RADIUSsever,…– syslog,serverlog,alarm,SNMPtrap,…– logscontainusefulinformationforNWtroubleshooting
l Diverseandmassiveamountsoflogs– multiplevenders,multipleservices,complexnetworkevents– over1,000,000ofmessages/day
⇒ Analyzinglogshasbecomeseriousproblem
ISPIPTV VoIP
Syslog Serverlog
11Copyright©2016NTTcorp.AllRightsReserved.
Ourresearchgoal
Miningnetwork(NW)eventinformation fromlargeanddiverseNWlogdataNetworkevents=spatialandtemporalpatternsoflogmessagesmessagesassociatedwithinitializationofvariousprocesscausedbyrouterrebooteventmultiplelayerflapscausedbyL-1flap(L-2,OSPFre-convergence,BGPflap,...)virtualpathdis-connectionrelatedtophysicalmachinefailure
linkdown
L-1flaprooterreboot
HardwareFailure
MemoryError
IFflap
2012-1-1T00:00:00 %TRACKING-5-STATE: 1 interface Fa0/0 line-protocol Up->Down 2012-1-1T00:00:00 %LINK-3-UPDOWN: Interface FastEthernet 0/9, changed state to down 2012-1-1T00:00:00 %SYS-5-CONFIG I: Configured from console by vty2 (10.11.11.11)2012-1-1T01:11:00 msg [100]: STP: VLAN 1 Port 38 STP State -> DISABLED (PortDown)2012-1-1T01:11:00 msg [101]: System: Interface ethernet 38, state down2012-1-1T03:00:00 msg [200] : STP: VLAN 100 Port 22 STP State -> DISABLED (PortDown)2012-1-1T03:00:00 msg [201] : System: Interface ethernet 22, state down2012-1-1T00:00:00 %SYS-5-CONFIG I: Configured from console by vty2 (10.11.11.11)2012-1-1T10:30:00 System: Interface ethernet 1, state down2012-1-1T10:30:00 System: Interface ethernet 1, state up2012-1-1T10:30:00 System: Interface ethernet 2, state down2012-1-1T12:00:00 init: alarm-control (PID 111) terminate signal sent2012-1-1T12:00:00 init: bslockd (PID 124 ) terminate signal sent2012-1-1T12:00:00 init: ce-l2tp-service (PID 123 ) terminate signal sent2012-1-1T12:00:00 init: chassis-control (PID 1111 ) terminate signal sent2012-1-1T12:00:00 init: disk-monitoring (PID 7082 ) terminate signal sent2012-1-1T00:00:00 %SYS-5-CONFIG I: Configured from console by vty2 (10.11.11.11)2012-1-1T15:45:10 msg [200] : STP: VLAN 100 Port 22 STP State -> DISABLED (PortDown)2012-1-1T15:45:10 msg [201] : System: Interface ethernet 22, state down2012-1-1T16:12:40 System: Interface ethernet 1, state down2012-1-1T16:12:40 System: Interface ethernet 1, state up2012-1-1T16:12:40 System: Interface ethernet 2, state down2012-1-1T20:30:00 init: alarm-control (PID 111) terminate signal sent2012-1-1T20:30:00 init: bslockd (PID 124 ) terminate signal sent2012-1-1T20:30:00 init: ce-l2tp-service (PID 123 ) terminate signal sent2012-1-1T20:30:00 init: chassis-control (PID 1111 ) terminate signal sent2012-1-1T20:30:00 init: class-of-service (PID 11112) terminate signal sent
12Copyright©2016NTTcorp.AllRightsReserved.
WhyminingNWevents?
Automatically constructingdomainknowledgeofNWoperators(withoutrequiringskillsandexperience)
Manypossibleapplications:ObtainingnewalarmrulesQuickunderstandingofrootcauseofproblemsMayhelpindetectionof“silentfailure”
linkdown
L-1flaprooterreboot
HardwareFailure
MemoryError
IFflap
13Copyright©2016NTTcorp.AllRightsReserved.
Keyidea(observation)
Logdatacanbeconsideredasrank-3tensortypesoflogmessages(templates),hostname,time
Observedlogdata= ‘superposition’ofNWevents
ExtractionofNWeventsproblem⇒ TensorFactorization problem
linkflapevent
linkflapevent
cronjob
cronjobA
B
C
msg1
msg2
msg1
msg2
msg3
msg4
msg5
msg6
msg4
msg5
msg6
msg1 msg2 msg3 msg4 msg5 msg6
A ◯ ◯ △ △ △
B ◯ ◯ ◯
C △ △ △
H: hosts
J: time-windowsI: templates
14Copyright©2016NTTcorp.AllRightsReserved.
Challenges&SummaryofContributions[1]Unstructuredandmassivelogmessages
Morethan1,000,000lines/dayFormatsoflogmessagesdependonvendororservice
⇒Wepresent StatisticalTemplateExtraction(STE)・ automaticallyextractprimarytemplates fromlargelogdata
[2]Logdataare verycomplexUnderlyingnetworkeventsoccurallaroundnetworkNetworkeventsspanacrossseverallocations,networklayers,andservices
⇒Wepresent LogTensorFactorization(LTF)• extractspatialandtemporalpatterns oflogdata• basedonNonngegativeTensorFactorization(NTF)approach
15Copyright©2016NTTcorp.AllRightsReserved.
STEMotivation
Rawlogmessagescannot beuseddirectlylogmessagescontainvariousparameters(IPaddress,hostname,PID,…,etc.)Tocorrelatelogmessages,weneedtoknowlogtemplates =messageswithoutparameters
%TRACKING-5-STATE: 1 interface Fa0/0 line-protocol Up->Down %LINK-3-UPDOWN: Interface FastEthernet 0/9, changed state to down %SYS-5-CONFIG I: Configured from console by vty2 (10.11.11.11)
%TRACKING-5-STATE: * interface * line-protocol Up->Down %LINK-3-UPDOWN: Interface FastEthernet *, changed state to down %SYS-5-CONFIG I: Configured from console by * (*)
16Copyright©2016NTTcorp.AllRightsReserved.
STEMethodology
1.Scoringfrequencyofwords amongsimilarmessagesparameterwords appearinfrequentlycomparedtotemplatewords ineachposition
2.Clusteringscore,anddetermineparameterwords foreachmessagethresholdsforscoreofparameter wordsdifferdependingonlogmessagesdensity-based clusteringalgorithm(DBSCAN)
raw log messages: <189> security telnet connection 15720 with 10.7.11.11 broken<189> security telnet connection 18340 with 10.8.9.123 broken
①Scoring: If word appears in P-th position in log that contains L words:
Score(word,P,L) = Pr(word | P,L)1 2 3 4 5 6 7
<189> security telnet 15720 with 10.7.11.11 broken
<189> security telnet 18340 with 10.8.9.123 broken
log template:<189> security telnet connection * with * broken
Remove
②Clustering scores (DBSCAN): Distance between each cluster is > δ
0 1
17Copyright©2016NTTcorp.AllRightsReserved.
Spatial&temporalpatternswewanttoextract
Definition 1 [Template Group]Ø groupoftemplatesthattendtoco-occur:l = (i1, i2, ...)
ü representseventatindividualhostü e.g.linkflap:linkdown +linkup
reboot:processinitializationmessages
Definition 2 [Network Event]Ø setoftuples<host,templategroups>thattendtoco-occur
e = {(h1, l1), (h2, l2), ...} (h1, h2∈H, l1, l2∈L)ü spatialextension oftemplategroupsü e.g.linkdowneventamongneighboringhosts
host h1
template groups
host h2
host h3
tem
plat
eA
tem
plat
eB
tem
plat
eC
tem
plat
eD
l1 l2 l3 l4
Hierarchicalcorrelationsareobservedinlogdata
18Copyright©2016NTTcorp.AllRightsReserved.
LTF
LogTensor X (I × H × J)xihj: # of occurrences of template i at host h, time window j
• templates:1,..., i,..., I• hosts: 1,..., h,..., H• timewindows: 1,..., j,..., J (logdataare partitioned)
• LTFfactorizesX into1. Log template matrix V = [vl] (I × L)
2. Network event tensor Z = [zlk] (L × K × H)
3. Weight matrix W = [wk] (K × J)* K: # of network events. L: # of template groups (given)
H: hosts
J: time-windowsI: templates
X Z WV≅
I
L: templategroups
JKH
LK: network events
19Copyright©2016NTTcorp.AllRightsReserved.
IntuitiveinterpretationsofLTF
template group matrix Vl-th row of matrix V represents l-th template group ⇒ corresponds to Template Group
network event tensor Zk-th slice of Z represents which template groups occur at which hosts⇒corresponds to Network Event
VI: templates
L: template groups
tem
plat
eA
tem
plat
eB
tem
plat
eC
tem
plat
eD
1.0
0.0
Z
L: templategroups
K: network events
host A
L: template groups
H: hostshost Bhost C
20Copyright©2016NTTcorp.AllRightsReserved.
LTFProblemFormulation
Optimizationproblemwithnonnegativeconstraintsoneachtensor
Objectivefunction=KL-divergencebetweenX andV, Z, W
Algorithm(multiplicativeupdaterules;typeofEMalgorithm)
simpleanditerativeform
Formulation&Algorithm
LTFProblem
21Copyright©2016NTTcorp.AllRightsReserved.
DataSet5millionlinesoflogs(capturedinsmallnetwork,5months)
Evaluationmetrics=effectivenesscalculate#ofextractedtemplates5,000,000logs⇒ 2,000~3,000templates(lessthan1%)
Falsepositivewords⇒ 0.07~0.08%of1,000,000words
STE Evaluation
22Copyright©2016NTTcorp.AllRightsReserved.
DataSetover600,000linesof1-daylogdatadimensionsareroughly 100 (I) × 150 (H) × 150 (J)
EvaluationMetricsexpressivepower:HowwelldoesLTFfittorealdata?⇒ usewellknownmeasure‘averagetestlog-likelihood’
• predictionpowerofrandomlymaskedelements• highervaluemeansbetterfittodata
LTF Evaluation Metrics
23Copyright©2016NTTcorp.AllRightsReserved.
LTF Evaluation Results
averagetestlog-likelihoodwithdifferentKandLweusednormalNTFmodelasbaseline*L:#oftemplategroups,K:#ofnetworkevents
LTFfitsbettertorealdatathancurrentNTF
24Copyright©2016NTTcorp.AllRightsReserved.
Casestudyresults(ExamplesofoutputofLTF)1/2
Neighboringlinkflapevent
CoreRouter A
EdgeRouter B(neighbor)
Interfaceup/down
Interfaceup/downlineprotocolup/down
zikh vil
25Copyright©2016NTTcorp.AllRightsReserved.
Tunnelingpathdisconnectionevent
CoreRouter
EdgeRouter AModulereloadeventvirtualpathdisconnection
EdgeRouter A
Tunnelingpathdisconnection
...
Casestudyresults(ExamplesofoutputofLTF)2/2
Morethan50connections
26Copyright©2016NTTcorp.AllRightsReserved.
SummaryofSYLOGAnalytics
PresentedSTEExtractingprimarytemplatesfromnoisylogmessagesCompressing5,000,000linesto3000templates
Presented LTFModelinggenerationoflogsasrank-3tensorandfactorizingintotemplategroupsandnetworkeventsMuchbetterthancurrentNTFCancorrectlyextracthiddencomplexnetworkevents
Copyright©2016NTTcorp.AllRightsReserved.
WorkflowExtractionforServiceOperationusingMultipleUnstructuredTroubleTickets
AkioWatanabe,KeisukeIshibashi,TsuyoshiToyono,Tatsuaki Kimura,Keishiro Watanabe,YoichiMatsuo,Kohei ShiomotoNTTNetworkTechnologyLaboratoriesApril25,NOMS2016
TroubleTicketAnalytics
Background
• Wewouldliketospecifytroubleshootingprocess– MTTRstronglydependsonthetimeuntildecidingtheresolution
• Whyprocessisnotdefined?– variousprocessforthousandsoffailures– requiringtacitknowledge– includingdomainrule
Detection Isolation Resolutiondecision Resolution
Troubleshootingprocedureforsystemmanagement
Research targets: Specify Troubleshooting
What is the cause?
What should we do?
Present understanding of process
• Operators search the resolution from trouble tickets– amount of valuable knowledge about failures
• Much information are written by natural language
Ourgoal
• Automatically specifying troubleshooting process from trouble tickets
• Generating Workflow; graphical flowchart of actions for each failure
summarizingmultipletroubletickets
clarifyingdeciderules
Two challenges
• Finding action sequences for each trouble ticket
• Finding isolating action; that have multiple transitions to resolutions
Seeingdetectederrors
CheckingNetworkConnectivity
Checkingsystemlog
Senderrorlogstovendor
CloseTicket
Replacingmodule
Seeingdetectederrors
CheckingNetworkConnectivity
Checkingsystemlog
Senderrorlogstovendor
CloseTicket
ReplacingmoduleCloseTicket
ReplacingCable
Seeingdetectederrors
CheckingNetworkConnectivity
Checkingsystemlog
Senderrorlogstovendor
CloseTicket
ReplacingmoduleCloseTicket
ReplacingCable
Seeingdetectederrors
Checking Network Connectivity
Approach overview
• 3 steps for extract workflow from multiple trouble tickets– 1. extract only action sentences from trouble tickets
– 2. align the same messages in different tickets
– 3. find operational change as a branch
wedetectederror
verifiedconnectivity
systemdetectederror
connectivityisOK
1.labeling2.alignment
werebootedserverwereplacedmodule
3.findingisolatingaction
1. Action Sentence Labeling
• Extracting sentences about (operator/system’s) actions– Append sentences to labels indicating if is written about actions or
not
• Supervised learning from labeled texts
– Naive Bayes are used as classifier
– Character-2gram is used as features
10:00Managementsystemdetectedthefollowingerror
10:00:00Routerxxxwentdown
ConnectionOkay
Wecheckedtherouterlog.
#loginr001#showsystem
11:30Rebootmessagewasdetected.
12:00WesentsystemlogstoAInc.2/210:00receivedreportMailfrom...
Action
ActionAction
ActionAction
Other
OtherOther
Other
Errorcountinuing,wedecidedonreplacement.
11:15Replacementbegan.ActionOther
11:30Replacementdone.ActionAction
2. Action Alignment
• Aligning sentences describing the same action
consider aligned numbers as action sequence
Action sentences1 Actionsentences2 Actionsentences31 Systemdetected temporary
errorError isdetected. Error: Nodedown
2 Weverified connectivity.3 PingisOK Ping NG4 Manyerrors wereinlog Thereisnoerror inlog
1 1 1
3
4
2 3
Formulation as maximal matching Problem
• Corresponding similar sentences
– maximize the sum of similarities of aligned sentences
• Solving by multiple sequence alignment (MSA) method
– MUSCLE algorithm [18] is used for aligning multiple sequences of sentences
– dice coefficient as similarity of sentence pair
3. Isolating Action Searching
• Finding isolating action i.e. branch of transitions to two resolutions
• Problem: many imitate branches caused by loss or noise in action sequence
checkIProute
checksession
checkIProutereplacemodule poweroff
pulloutcable
insertnewmodule
bootmodule
checkconnectivity
checkifaddresscanresolve checkIProute
findisolatingaction=branchoftransitions
imitatebranchbynoise imitatebranchbylossofaction
Key observation
• the following actions of isolating action can be clustered by resolutions
followingactions
isolating action
seems to have two kinds of resolution
imitate branch
Method for finding isolating action
• 1. dividing the following actions using Spectral Clusteringfor each actions
• 2. choosing the action that has the best clustering score
+
similarity of clustered sentences
dissimilarity of not clustered sentences
Scoreoftheaction
Experiments
• Dataset: practical trouble tickets for network system– Written by Japanese
• primitive linguistic preprocessing are executed
– Separated into subsets by detected error
• given parameter– the threshold of similarity for alignment– the number of isolating actions
thenumberoftickets resolutions/oftickets(i) 5 (A) turnonbreaker(x2)
(B) wait& see(x3)
(ii) 4 (A)replacemodule(x2)(B)replaceportinterface(x1)(C)replacecable(x1)
(iii) 3 (A) poweroutage(x2)(B) replace ONU(x1)
(iv) 29 (A) wait&see(x12)(B) detail loganalysis(x15)(C) replacemodule(x2)
Quantitative evaluation
• We compared obtained result with ground truth• Comparison of
– alignment result & manually appended action ID– clustering result & manually checked true resolution for each ticket– words of extracted isolating actions & isolating action in true operation
document
• miss of alignment is limited the case in
– exchange of order
• miss of isolating action searching is caused by
– the loss of isolating action
– multiple (three or more) isolations
Case study result
• frequent actions are the same with the actions in manual document
• causes are described into the next actions of isolating actions
Summary of Trouble Ticket Analytics
• Proposed extracting method for workflow automatically from multiple trouble tickets
– Action Sentence Labeling
– Action Alignment
– Isolating Action Searching
• Future works
– relaxing the limitation of order of actions
– finding multiple isolating actions
43Copyright©2016NTTcorp.AllRightsReserved.
ExpectationstoMachineLearning
n CorrelationandCausalityInferencen AnomalyDetectionn RootCauseAnalysisn KnowledgeDiscovery
44Copyright©2016NTTcorp.AllRightsReserved.
Diagnosing&Trouble-shooting
n Predictionn Detectionn RootCauseAnalysisn Recovery
45Copyright©2016NTTcorp.AllRightsReserved.
Concludingremarks- Data-drivenapproach
n Disaggregateverticallyintegratedsystemintocomponentstoachievesustainablehealthygrowth.
n Hardtounderstandprecisemechanismsofeverycomponentofentiresystem.
n Measureandcollectbigdataoninputsandoutputsofthesystemtoinfertherelationshipbetweenthem.
n Mathematicaltools,e.g.,machinelearningareavailablehere.
n Keytosuccessisinter-playbetweenmathematicsandnetworkengineering.