Problem Identification by Mining Trouble Tickets - ?· Problem Identification by Mining Trouble Tickets…

Download Problem Identification by Mining Trouble Tickets - ?· Problem Identification by Mining Trouble Tickets…

Post on 27-Jul-2018




0 download

Embed Size (px)


<ul><li><p>76</p><p>Problem Identification by Mining Trouble Tickets</p><p>Vikrant Shimpi, Maitreya Natu, Vaishali Sadaphal, Vaishali KulkarniTata Research Development and Design Centre</p><p>{vikrant.shimpi, maitreya.natu, vaishali.sadaphal, vaishali.kulkarni}</p><p>ABSTRACTIT systems of todays enterprises are continuously monitoredand managed by a team of resolvers. Any problem in thesystem is reported in the form of trouble-tickets. A ticketcontains various details of the observed problem. However,the knowledge of the actual problem is hidden in the ticketdescription along with other information. Knowledge of is-sues helps the service providers to better plan to improvecost and quality of operations. In this paper, we address theproblem of extracting the issues from ticket descriptions. Wediscuss various challenges in issue extraction and present al-gorithms to handle different scenarios. We demonstrate theeffectiveness of the proposed algorithms through two real-world case-studies.</p><p>1. INTRODUCTIONWith the increasing reliance of business on IT, the health</p><p>of IT systems is continuously monitored. The IT serviceproviders manage these systems with a team of resolvers.These resolvers act on the reported problems and take cor-rective actions to ensure smooth functioning of IT and busi-ness. The system problems are reported to the resolvers intwo ways:</p><p> System-generated issues: Components such as busi-ness applications, processes, CPU, disk, and networkinterfaces are monitored to detect anomalies. Themonitoring tools capture and report any abnormal be-havior of system components in the form of alerts.</p><p> User-generated issues: The users of IT systems reportproblems through calls, emails, or chats.</p><p>Both system-generated and user-generated issues are re-ported in the form of tickets to a team of resolvers. A ticketcontains various details of the reported problem such as thereporting time, system-of-origin, severity, and other details.In addition to this, each ticket also contains a descriptionfield that contains the details of the observed problem. In</p><p>Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.The 20th International Conference on Management of Data (COMAD),17th-19th Dec 2014 at Hyderabad, India.Copyright c2014 Computer Society of India (CSI).</p><p>case of system-generated tickets, this description is auto-matically generated by the monitoring and alerting tools.This description is often structured and is based on howthe alerting tools have been configured to report a problem.However, in case of user-generated tickets, this descriptioncontains free-form text written by users. This description isunstructured and contains variations, ill-formed sentences,and spelling and grammar mistakes.</p><p>While the ticket description contains many details, it isimportant to extract the information of the actual problemreferred by the ticket. We refer to the problem referred bythe ticket as issue. For instance, consider the following ticketdescription</p><p>PATROL alert for filesystem apps-orabase is 90% full ond-2hh4knn</p><p>The issue referred by this description is</p><p>PATROL alert for filesystem full</p><p>Extraction of issues occurring in the IT system can pro-vide crucial information to better understand and controlthe IT operations. It can assist in improving both the ITsystem and the human system involved in the operations.The IT system consists of the business functions, applica-tions, and the infrastructure. The human system consists ofthe teams of resolvers that manage the IT systems. Beloware some examples of how the knowledge of issues can assistin improving these systems.</p><p>1. Improving the IT system: (a) The knowledge of is-sues observed in the IT system helps in inferring prob-lem trends and frequent problematic areas in the sys-tem. (b) The issues referred by the high-severity tick-ets can provide insights into the critical areas thatcause customer unrest or business instability. Theseissues can be prioritized for taking corrective actionsby problem management. (c) Knowledge of frequentissues observed in specific domains can assist the ser-vice providers to prepare a service catalog. A serviceprovider uses the service catalog for knowledge acquisi-tion, generating training plans, developing automationscripts, etc.</p><p>2. Improving the human system: (a) The knowledge of is-sues can be used to identify issues that consume max-imum effort of the resolvers. These issues can be con-sidered for full or partial automation. (b) The knowl-edge of frequent issues can be used to prepare trainingplans to train resolvers with appropriate skills. (c) Theknowledge of arrival patterns of the issues can be used</p></li><li><p>77</p><p>to plan shifts with right number of resolvers with rightskills.</p><p>Mining issue from the ticket descriptions presents severalchallenges. (a) The same problem is represented in differentways in different tickets. (b) The organizations customizethe structure of the description according to their own pref-erences. (c) The descriptions are domain-dependent andcontain domain-specific words. For example, Oracle issueshave words such as tables, SQL query, filesystem, disk, andOracle-specific error codes. (d) The descriptions in user-generated tickets are written in free-form text. They areoften ambiguous, and contain spelling and grammar errors.</p><p>Currently, service providers lack a systematic approachto capture the knowledge of issues. They either rely on thecoarse-grained information from the ticketing tools or on theintuition-driven inputs provided by the resolvers.</p><p> Ticketing tools such as ServiceNow [1] and BMC Rem-edy [2] allow resolvers to classify each ticket in variouscategories and subcategories. These categories are of-ten referred as Category, Type, Item, and Summary(CTIS) [3]. However, most often these tools are notconfigured to sufficient level of details. For instance,various disk-related issues such as disk full, disk fail-ure, or disk corruption are all classified as Category= Infrastructure, Type = Hardware, Item = Storage,Summary = Disk issue. In addition to that, resolversand users make mistakes in rightly classifying the is-sue. As a result, the analysis based simply on thisclassification can often lead to incorrect and insuffi-cient knowledge of the issues.</p><p> Another approach that is adopted to infer the issuesis by analyzing the fixlogs used for ticket resolution.A fixlog of an issue is a document that contains thedetails of the resolution steps to act on the issue. Aresolver after acting on an issue makes an entry inthe ticketing tool of the fixlog used to resolve a ticket.However, there are several practical challenges facedin this approach. Often the fixlogs are not availablefor all issues. Many times, the fixlogs are too genericand are used for many issues. In addition to this, theresolvers often do not make an entry of the fixlog inthe ticketing system.</p><p> The most common approach used to infer issues isthe manual intuition-driven approach. The serviceproviders infer the issue information either from themonthly reports manually produced by the teams orby talking to the domain experts. This approach car-ries the risk of being incomplete, inaccurate, and ofvariable quality.</p><p>In this paper, we propose an analytics-driven approach tomine the textual descriptions of tickets to extract issues. Wepropose the following approach of issue extraction.</p><p> We propose to use information retrieval techniquesalong with domain knowledge to extract the issuesfrom tickets.</p><p> We propose two different algorithms to extract is-sues from system-generated and user-generated tick-ets. There is an inherent difference in the struc-ture and heterogeneity in the descriptions of system-</p><p>Figure 1: Functional diagram of the proposed ap-proach.</p><p>generated and user-generated tickets. The system-generated tickets have structured and consistent de-scriptions. We propose a clustering-based techniquefor system-generated tickets. On the other hand, theuser-generated tickets have verbose and ambiguous de-scriptions. We propose a keyword-discovery-based ap-proach for user-generated tickets.</p><p> We capture the knowledge of IT and business domainsin the form of dictionaries of include and exclude list ofwords and regular expressions. The knowledge consistsof words and patterns pertaining to (a) the technologydomain such as Oracle, Windows, and Linux, (b) busi-ness domain such as banking, retail, and finance, (c)monitoring tools such as ControlM, Autosys, and BMCPatrol, and (d) organization-specific naming conven-tions</p><p> Finally, we present validation of the effectiveness ofour techniques through two real word case-studies.</p><p>In Section 2, we present design rationale of the proposedissue extraction techniques. In Section 3, we present the pro-posed approach for extracting issues for system-generatedand user-generated tickets. In Section 4, we demonstratethe effectiveness of the proposed approach using two real-world case-studies. We present related work in Section 5and conclude in Section 6.</p><p>2. DESIGN RATIONALEIn this section, we present the proposed approach for ex-</p><p>tracting issues from ticket descriptions. Various factors needto be addressed while extracting issues from ticket descrip-tions.</p><p>Consider the following description:</p><p>ControlM Alert job=orahAAAA04 node=aaaa04 msg=sev3 jobfailed AAAA04</p><p>The issue referred by this description is as follows:</p><p>ControlM Alert job failed</p><p>The description contains other details such as name of theapplication job=orahAAAA04, name of the script, and severity level msg=sev3. To ac-curately extract the issues, it is critical to separate the fac-tual knowledge of the issue from the specific details of itsindividual manifestations.</p><p>Different tickets describe the same issue in different ways.For instance, the above example can be reported as</p><p>ControlM Sev 3 Alert on Job Failure:job=orahAAAA04, script</p></li><li><p>78</p><p>In case of user-generated tickets, this problem becomes morechallenging because of lack of structure, too many variations,and spelling and grammar errors. For instance, consider thefollowing issue.</p><p>application not responding</p><p>This issue is addressed by users in a variety of ways, suchas,</p><p>application hung,application dead,No response from application since last 5 mins,application not responding for the fifth time sincemorning.</p><p>While extracting issues, it is important to address such vari-ations and generate a uniform issue for its various manifes-tations.</p><p>As discussed in Section 1, there is an inherent difference inthe structure and heterogeneity in the descriptions of user-generated and system-generated tickets. Hence, we proposetwo different approaches to extract issues. Figure 1 presentsthe functional diagram of the proposed approach. The inputis a set of ticket descriptions and the output is the issueextracted for each description.</p><p>Below we present the details of different functional ele-ments of the proposed approach:</p><p> Data preparation: The first step is to clean the descrip-tions, remove unwanted details, and tokenize. Thisstep needs to be customized by knowledge of specificdomains to ensure that all domain-specific words, pat-terns, and phrases are retained during data prepara-tion.</p><p> Mapping to service catalog: The commonly occurringissues in a domain are often captured in the form ofa service catalog. For instance, service catalog of atechnology domain of Oracle contains the list of com-monly occurring issues in Oracle such as tablespacefull, query running slow, index not built, etc. We useservice catalog to mine these commonly occurring is-sues from the ticket descriptions. For the remainingticket descriptions, we adopt the following approach.</p><p> System-generated tickets: System-generated descrip-tions have a fixed structure and limited variations.Hence, we use clustering to group similar issues. Weform label that best represents the descriptions in acluster.</p><p> User-generated tickets: On the other hand, user-generated descriptions demonstrate too many varia-tions. Here, clustering is not effective. Hence, wefirst apply various techniques to address these vari-ations such as use of stemming, synonym detection,and spelling corrections. We then extract keywordsand group similar issues based on the commonality ofkeywords. We represent each group using an n-gramof keywords.</p><p> Knowledge cartridge: Note that, the proposed ap-proach is independent of any specific technology do-main, and is designed to be configurable to extractissues for ticket descriptions of any domain. The effec-tiveness of the proposed approach can be significantlyincreased by domain-specific customizations of data</p><p>preparation and service catalog mapping. We proposeto make cartridges for technology domains (Unix, Ora-cle, Windows, etc.), business domains (banking, retail,finance, etc.), and tool domains (BMC Patrol, Con-trol M, Tivolli TWS, Autosys, etc.) For each domainwe maintain a dictionary of domain-specific words andpatterns to include or exclude while extracting issues.We also maintain a service catalog of known issueswithin the domain.</p><p>In the rest of paper, we present details of these functionalelements.</p><p>3. PROPOSED APPROACHIn this section, we present the approach of issue extraction</p><p>from descriptions in system-generated and user-generatedtickets.</p><p>3.1 Data preparationAs discussed earlier, issue is the problem reported in the</p><p>ticket description. Ticket description contains lot of addi-tional information such as timestamp, location, threshold,system-of-origin, severity, etc. This additional informationchanges as the same issue occurs at different timestamp,location, system-of-origin, etc. Though this information isimportant, it misleads issue extraction and hence must beseparated. For example, consider a description such as</p><p>Unix server dt2n1g1 down.</p><p>Here, the actual issue is</p><p>Unix server down.</p><p>The name of system-of-origin dt2n1g1 is additional informa-tion. Similarly, consider a description such as</p><p>filesystem 01hw4884 80% full.</p><p>Here, the actual issue is</p><p>filesystem full.</p><p>Here, the name of the system-of-origin and the current valueof occupancy of space in the filesystem is additional infor-mation.</p><p>The challenge is to find the right set of words that shouldbe removed or retained. Removing and retaining of wordscould have significant impact on subsequent issue extrac-tion steps. We approach this problem by making followingobservations.</p><p> Issue is made up of dictionary words: Consider follow-ing examples of description.</p><p>1. A-IM002930432.Pl reset the Unix password for login idbchho02 @,</p><p>2. 27488732 CREATED unable to backup data after up-grade to windows 7,</p><p>3. A-IM002971223Need to revoke admin access,</p><p>4. Domain Dr...</p></li></ul>


View more >