itil v3 story

www.differ.cz Story of support and maintenance according ITIL v3, part I.

© 2011 Jaroslav Procházka, www.differ.cz

Story of support and maintenance according ITIL v3 Part I. Operational activities

Jaroslav Procházka www.differ.cz

version 1.0 August 2011

http://www.differ.cz/


http://choo.fis.utoronto.ca/mgt/KM.xeroxCase.html

http://www.kmworld.com/Articles/Editorial/Feature/Best-Practices-Eureka!-Xerox-discovers-way-to-grow-community-knowledge.-.-And-customer-satisfaction-9140.aspx



StorymotivationWe are nowadays driven by strong rationality (logical, rational, scientific, verifiable facts matter) and forget irrational aspects and emotions in human decision making. If humans are rational, why the hell they buy Apple products? ;) The same statement is valid for stories and their power. Stories are part of our cultures for many thousand years and are the best way to transfer the knowledge, see sociological, psychological or cognitive studies, e.g. Campbell: The hero with a Thousand Faces or Turner: The Literary Mind. You know, all the old epics, Bible or story of Buddha are stories that are attractive for us, we would like to hear the same variations of hero’s journey again and again. And that’s stories what can differ us, our service, product or company from many other vendors providing the same. Stories matter. Other application of stories in business is in knowledge management and sharing domain. Be honest, how often do you use your logical-structured-fact-based Knowledge base? How easy is to remember such record content, steps and outcomes in longer term? And now, compare it with story of your colleague dramatically describing the same situation (you could hear it in kitchen, during lunch, in the pub)? Which one is easier to remember and follow? Big and respected companies like XEROX1, 3M or NASA use stories as the approach to store and share knowledge inside the company. Story telling is also part of modern leadership. Next motivation factor that was trigger for me to write this e-book is hard understanding of process frameworks like IBM RUP® or ITIL®. Such misunderstanding causes problems with support, operations and maintenance of IT infrastructure leading to weak quality, revenue, dissatisfied teams and customers. Goal of this e-book is to spread service-driven (ITSM) philosophy and service thinking using stories. The story is focused on principles and concepts described by ITIL v3. We start with end user affected by some issue and solve also hidden root cause (in ITIL terms Incident and Problem Management). Proactive investigation of root causes is weak point of many teams and companies. We’ll emphasize key ideas of this approach, doesn’t matter if you call it Problem management, Kaizen, TQM, CMMI. Part of the story is also Configuration Management that’s taking care of IT infrastructure items, Change Management processing change requests and Release and Deploy Management building the change. The second part of this story that would follow soon is focused on tactical and strategic activities of ITSM, namely service thinking, connection to business and its scenarios, predictions, proactive thinking, contracting and service measurement (so called SLA). This is the core of ITSM/ITIL thinking.

1 Více viz http://choo.fis.utoronto.ca/mgt/KM.xeroxCase.html i http://www.kmworld.com/Articles/Editorial/Feature/Best-Practices-Eureka!-Xerox-discovers-way-to-grow-community-knowledge.-.-And-customer-satisfaction-9140.aspx




Note: This e-book does not replace ITIL training or certification. You will neither set up the right environment based on it. The meaning and goal of it is to raise awareness about ITIL version 3. What is it, how can it help with solving my issues and what are the differences from version 2. This material could bring insight for busy people to study ITIL, typically sales people, customer representatives, customers, architects, higher managers and other key people. If you have any comments, improvement proposals or ideas how to improve this e-book or you would like to cover also your domain in the story (only application management as incident contributor is covered), send it please to me via email ([email protected]). I also hope that this short story inspires you to write your team/unit knowledge base in form of short stories! It is more memorable, writing it is fun and thus they bring better value to its creators and consumers ;)

mailto:[email protected]




AshortintroductionITSM means IT Service Management, thus the story covers mostly introduction of the concept of IT service thinking and operational activities connected to this concept. The most known and used ITSM framework nowadays is called ITIL (IT Infrastructure Library) – the library guiding us in IT infrastructure management covering software, hardware, networking, people etc. ITIL brings process approach to ITSM and its key benefit is definition of common terminology that is very important for communication between IT and business and among different vendors in the chain. Nowadays (July 2011), ITIL exists in version 3, but new refresh is prepared for release and it would be called ITIL 2011. Key difference between version 2 and 3 is newly introduced lifecycle of the service (see picture), starting with its idea, strategy (Service Strategy phase) and ending with daily use and support (Service Operation). Story described in this e-book covers concepts and processes of version 3, specifically Service Transition and Service Operation part.

Daily operations and support deal with necessary activities such as monitoring, data backups, implementation of law amendments (e.g. ERP applications) or reflecting changes in assembly process of production and assembly lines. Change or new functionality recording, assessing, implementing, testing and integrating is one part of those necessary activities. But specific actions need to be performed also in case of application/service incident that affects end users and thus the value we provide to the customer. Depending on number of users affected or importance of application, the cost of incident can be really huge:

e.g. 30 minutes of stopped assembly line can mean 20 cars not assembled and delivered = 20 cars x 15.000 Eur / 1 car = 300.000 EUR losses just in 30 minutes!

Specific domain that needs our attention and automation is physical changes in IT infrastructure: hardware or network. If not secured and automated properly, it can cause severe incidents with huge financial impact. More described example calculation of the cost of incident impact shows following box:




Due to this fact, we need also early identification and uncovering of incidents with high level of automation. Jang part of this Jin is built-in proactive root cause identification and solution (so called Problem Management). Necessary backend functionality supporting efficient monitoring and problem management is providing knowledge about infrastructure: hardware and software configurations, software versions, licenses, people locations, access right politics etc. Advanced teams use (semi)automated knowledge base storing and proposing already solved issues, incidents, problems or complicated changes with many dependencies.

Starring: Mary ….… business user affected by system incident, Pete ……. application programmer, John ….… system administrator, Adam …... Service Desk support specialist, And other starts…

Simple Incident cost calculation: Employee cost ………...... 100 EUR / h Headcount ........................ 200 in total Incident length .................. 3h 30 people cannot work for 3 hours because of system incident. The cost of impact can be simply calculated: 30 employees x 3h x 100 = 9000 EUR of costs in one day not generating any value to the customer! If you multiply this number by total amount of incidents per year, you could get pretty high number that could cover e.g. year budget for IT or the cost of totally new system or assembly line.




Typical scenario of daily operations Mary used paper evidence of incoming orders until now. Although her company had implemented information system for assembly line and economic agenda, order processing was not part of the project. Paper evidence is not very efficient and brings problem if some order needs to be find quickly. Also archiving is a bit problematic. Orders fade and their readability is harder and harder. Mary is happy, because order processing was recently automated by software program and integrated to assembly line information system. Rework, searching and archiving issues are limited to almost zero now and Mary can enjoy her work. It’s Monday morning. Mary uses application called WarehouseAndOrders v1.1 to process orders to assembly line, but after one hour of work software client crashes and she’s not able to run it again. So she calls her friend Pete, application programmer, to help her with solving this incident2. Pete is employed by IT company delivering and operating this application and knows Mary from university times. Actually, they are still friends and meet regularly. Pete is happy to hear from Mary again, so they have little chat and by the way Mary also mentions the incident. Pete makes some note, but forgets it immediately because of heavy load caused by upcoming release. Mary is awaiting resolution from Pete and performs some unimportant tasks not to be bored. She reminds herself at lunch on Monday when they used to go at lunch with all old university group. Pete asks about some symptoms observed by Mary in the morning (any error message, behavior of system etc.), not to be ashamed. But after few Mary’s comments Pete immediately talks about something else. You know, it is few hours since incident happens, so Mary doesn’t remember anything significant and Pete is annoyed by it. It is no surprise that Pete continues with development tasks after the lunch and forgets about Mary and her issue. Mary still could not use the application and process incoming orders.

2 We’ll start to differentiate between the key terms incident and problem. The reason is totally different meaning in ITIL terminology: Incident is an event causing availability or quality problems of IT service or its part perceived by end user. It could be response time, number of processed transactions, volume, no accessibility to service etc. Incident can be usually solved by so called workaround (typically server or application restart), but this solution or process doesn’t remove hidden root cause! It only allows service to operate again under agreed quality. Problem is hidden root cause of one or more incidents that can be already evident or cannot. Problem can be solved only by structural solution, e.g. change in IT infrastructure or bugfix of software application source code.




It is Monday afternoon and Mary is calling Pete again to hear more about the progress. Pete gets angry because Mary interrupts him repeatedly. He needs to finish build testing for upcoming deployment. Pete wants to be freed of Mary so he stops testing and starts incident investigation. Mary is still waiting and performing not important tasks and incoming orders are not processed. Mary realizes that this day will not bring the solution and goes home earlier. Pete stays until 8 pm busy with infrastructure identification: what are the parts of this nasty program? Which servers are used to operate it? What middleware, databases and other connectors does it use?

Tuesday morning is an important deadline for Pete, he needs to finish new release package for deployment. This is the reason why he comes earlier in the morning even though he finished late the day before. Mary arrives later this morning to be secured that incident is already solved and her time is not wasted. Pete focuses on finishing build testing and packaging. When ready, he continues with incident investigation. Finally he realizes what servers are used to operate WarehouseAndOrders application and both are Linux servers! Thanks IT God, Pete is Linux fan and skilled Linux programmer, so he wants to start investigation but missing account immediately stops him. Pete is proactive and calls John (system admin) to get any account to get in. John as good friend shares root account with an assumption that Pete will create his personal one and will upload there some new movies and mp3s. Why the hell would he otherwise ask for access to this server?

Pete skips lunch today because of his heavy load. Tuesday afternoon brings following steps. Pete logs in Linux server as root and searches for WarehouseAndOrders program directory and other underlying applications and database servers. He plans to investigate logs to learn more about the situation, but accidentally when starting MC (Midnight Commander) he notices full server hard drive. Because Pete is busy but wants to help Mary at the same time, he does not care with creation of his account and setting the rights but just deletes some temp and log files as root. He calls John to restart Oracle DB and also Apache Tomcat web server, both were down and are used by WarehouseAndOrders application. In fact, Pete does not want to waste time by looking for admin interface. What more, it’s John’s responsibility anyway. John is confused by this request (Pete is not usually working for the customer using those servers), but he does what is asked for without any notice to end users. John informs Pete after restart to check what was expected.




Pete can now call Mary that WarehouseAndOrders application v1.1 is running again. Mary is very grateful, thanks to Pete and starts to process orders waiting in queue. Pete forgets the whole story and continues with his assignment. Build needs to be tested and packaged for tomorrow’s deployment. Pete stays in office again until 8pm to finish all required steps.

Wednesday morning looks like ordinary day when Mary processes the orders in queue. After 2 hours the same incident occurs again and it makes Mary angry. She calls Pete if he knows anything about the issue; maybe he’s improving the application she assumes. But nobody replies to office call. The reason is obvious for our reader, but not for Mary. Pete travels to the customer premises to install new release, because it cannot be done remotely. Mary is not doomed to waiting, since she calls Pete’s mobile phone and explains the situation. Pete contacts John and quickly synchronizes about the issue and its context. John finally gets the point why Pete asked for Linux account and server restart. The reason was not mp3s or new movies but incident! But thanks to this John knows some context, servers used and symptoms of the issue. Pete continues with installation of customer release, finally without any disturbances. John starts IT environment investigation and notices full server disk. He backs up chosen log files in different server for further analysis, deletes original ones and tries to restart Oracle DB and only failed instances of Apache Tomcat web server. He tries WarehouseAndOrders application and sees everything working but he still does not contact Mary before he’s sure incident will not occur again.

John as system admin is surprised by full server disk. There cannot be so many movies and mp3s stored on server, he thinks loudly… He postpones lunch and starts investigation of incident’s deeper root cause. How can be server disk full? He writes workaround script that will back up chosen log and temp files in different server regularly and remove the original files after this procedure. John wanted to download log files to his computer for further investigation and analysis and notices accidentally so big Oracle DB log (only just because long download time)! How the hack can today’s Oracle log have almost 3 GB? He opens the log in original server folder and after few minutes of investigation notices programmer’s error reports. He updates workaround script with this log as well after this finding. Then the script is quickly tested with expected result, so nothing hinders its deployment. Only after this action John calls Mary to use the application again. John still wonders what error can cause such a huge Oracle log and if this is only contributor to full disk. He searches Internet forums if somebody already tackled similar issue, but founds nothing. He reports this defect to Oracle Corporation and waits for any reply. Finally he can go for a Wednesday’s lunch.




Scenario conclusion: Albeit some actions described in this scenario can be striking and funny, many IT and non-IT organizations follow this setup. And if you discuss the topic with them and emphasize some anti-patterns, they are not aware about anything weird and are surprised by your statement about efficiency and potential risks. Moreover, this story is our personal experience from previous assignments. Let’s conclude the story:

Mary could not process orders for almost 2 days = it could affect company’s cash flow and name or even generate losses but nobody cared.

Pete was frequently disturbed, switched context and was overloaded. John as system administrator started to investigate hidden root cause of incident (doing his job)

only after 2 days from first incident discovery. Due to disturbances and Pete’s tiredness build could contain unnoticed defects. Pete accessed restricted production servers as root and deleted files there as root. Same incident occurred again in short time and affected end user. Hidden root cause generating incident is still not uncovered and resolved.




ITIL v3 scenario Let’s discuss same story following ITSM principles. This is how it looked like after 3 month of implementation effort. Same stars perform this story, but the approach to incident resolution is different. We focus on Service Transition and Service Operation activities again.

Situation with IT systems is the same as described in the first scenario. We start the story on Monday morning again when Mary enters office and starts to use Orders&Warehousing IT Service3, not WarehouseAndOrders application anymore. She does not need to care about different parts of the service, start program client or prolong licenses. She just uses her browser and link to run Orders&Warehousing IT service. IT service works as expected, no warning symptoms occur. Standard monitoring and event reporting4 is set up and working at the same time. Business users, Mary as one of them, do not even know about this monitoring. IT specialists together with Service Desk specialists set the thresholds for specific components, servers and their events. These events can trigger deeper investigation by specialist or can automatically report an incident. Monitoring system started to report several “lack of free disk space” events of Orders&Warehousing IT service server this morning5. Service desk specialists started to investigate those events but meanwhile Orders&Warehousing IT service has frozen and had not responded.

Mary reports an incident using Service desk (SD) tool. SD is the only single point of contact (SPOC), together with phone, to be used for communication with IT service vendor. Such a reported incident record contains incident description (observed symptoms), priority for user (e.g. only one using the service and being affected, department or team affected or whole company affected) and the name of service chosen from list of provided services. This action causes automatic notification of the incident to relevant service (and/or customer) Incident Manager. Incident Manager does the first incident record check, assigns expected category (e.g. hardware, network, application, premises, licenses) and priority in the context of end user perception but also other services and business impact. Resulting priority in this case is high although only Mary uses the service. But the service supports processing of incoming orders, and its unavailability can stop assembly line and affect company’s business and cash flow. Adam is assigned to this incident because he is marked as free in Service Desk dashboard and is automatically notified about it, the same is Mary. All these steps happen just in few minutes, approximately the same time as reading this page.

3 IT service is a mean for customer value delivery using IT resources. Customer gets specific outcomes needed to run the business without owning and managing costs and risks connected to IT. Customer does not care about software, hardware, networks, licenses, premises, people, upgrades and patches or monitoring. Customer just buys IT service as commodity and external or internal vendor takes care about operations, support and maintenance. 4 Typically log changes, state monitoring or user events are processed for incident triggering. 5 These activities are performed as part of Event Management and are tightly connected with monitoring and monitoring systems.




Incident reporting example using Outlook




Incident reporting example using Jira tool

Adam reads obtained notification and immediately starts incident investigation. The first steps performed are following checks:

Checking Knowledge Base (KB) – it contains solutions to existing problems and incidents. If well structured, readable and user friendly then KB can ease and speed up incident resolution as well as knowledge sharing among the team at the same time.

Checking Configuration Management System (CMS) – it contains description, version, location and bindings of IT infrastructure components (end user stations, servers, accessories). Such system can significantly help with incident localization (which server or station is used by this service and what is the configuration, versions) and root cause identification.

And checking automated monitoring tool records and events (Event management records). These functions are often performed by specific team or department called Control Desk.

Mentioned tools allow quicker incident resolution but also require less technically skilled Service Desk specialists (needed information is stored in the tool and does not need to be mined in complicated way). Adam knows what components are used to operate Orders&Warehousing IT service thanks to CMS and IT service catalogue (see following table and figure).




IT service name Users Responsibilities Configuration Items (CI) Orders&Warehousing

Mary Management

Users: Reporting incident using

Service Desk (tool or phone) Participating regular monthly

SLA reviews …

WarehouseAndOrders v1.1 Tomcat 6 Oracle 9i Red Hat Enterprise Linux 5 HW Server Prague Net Switch S1 Net Switch S2 Intranet

Internet All users See internal rules for using Internet (link to intranet document)

Internet Service Provider Firewall Zone v3.2

Example IT service catalogue records. Configuration items column is visible only for IT vendor

CMS part: visual information about IT service infrastructure (basically visualized Configuration Items

column in table above)

Events in IT infrastructure show insufficient (no) free server disk space onto core IT service operational server. Adam backs-up temp and log files and starts to investigate Oracle database and Tomcat web server logs only, because he knows from IT service catalogue that these are used by the service. Thanks to monitoring tools Adam also knows that only this service is down. He notices too big Oracle DB log consuming several GB of disk space. He backs-up and deletes Oracle log, restarts the service and tries its functionality. At the same time he also creates automatic script that backs-up and deletes original Oracle DB log file in regular interval (so called workaround solution). He verifies and installs the script, restarts Oracle DB and relevant Tomcat instance, checks monitoring tools, IT service functionality and backed-up file. Everything works, so Adam creates problem record in SD tool and assigns it, together with link to Oracle log file, to Oracle group that solves Oracle related problems. Problem ticket is raised to solve deeper root cause. Adam only used interim workaround solution for incident that allows running




IT service again. But why is Oracle log so huge? What causes this? How to fix this? These questions are still not answered. As final step, Adam updates incident record (Work Log and solution) and closes it. Mary is notified about solved incident via e-mail, so she knows she can start to use the service again. Mary needs to try the service and if the solution is ok, she needs to accept incident solution (or it can be done automatically after some period of time, not to annoy end user). Mary accepts the solution because everything works well.

It’s still Monday but already after the lunch. Adam creates Knowledge base record describing this incident and symptoms and encloses solution workaround (script). This KB record is linked to original incident record and to created problem record too. The goal of KB record is to speed up solution of similar incidents in the future.

We used Service Desk function, or tool, and Event and Incident Management processes to register and process incident record. Only incident was solved in the story, root cause is still unclear. Reader could notice how appropriate tools and monitoring can make incident management process much more efficient and quick. Thanks to this is incident processed in several minutes and resolved in tens of minutes. Mary could continue with her work and there is no significant impact on company’s business (at least not 3 days as in previous story). But our job is not done yet. We need to uncover and solve the problem (ITIL term for unknown root cause) causing the incident. Let’s continue with the story then to uncover hidden problem using Problem Management process and implement the change using Change and Release and Deploy Management. The whole lifecycle and process relations are depicted in following figure:

Relations of ITSM Service Operation and Service Transition processes introduced in our story

Since Adam created problem ticket in Service Desk related to Oracle database group, Problem Management team is formed on demand. This team consists of skilled and experienced administrators and database programmers that are involved only in more complicated issues (Level 2 and 3 in Service Desk hierarchy model). The reason is labor cost of those professionals. Rachel, Oracle specialist is notified as Problem Manager and starts to investigate problem record as well as incident record with workaround, Knowledge Base description and mainly linked Oracle log file. Thanks to her knowledge of “standard” Oracle log, she uncovers quickly




programmer’s error reports being part of this log. She’s surprised how this could happen, because she’s never experienced this before. Rachel logs in Oracle defect reporting tool (maintenance fee grants access to this database) and searches for this issue, but founds nothing. She is allowed to create a defect in Oracle defect tool, so she does, describes the log issue and attaches log snapshot to demonstrate it. Rachel receives reply from Oracle after several days informing about new patch released by Oracle to fix this defect. Rachel creates request for change (RfC) to implement this patch to operational environment. Part of this RfC is description, reason, importance and impact of this new patch.

Now we get to the moment when root cause was identified and solution exists. Before releasing to production environment we need to approve the request (there could be upcoming conflicting or depending changes), test it (there can be other contributors to this root cause) and finally deploy. For these steps are responsible Change, Release and Deploy Management processes and roles. Change assessment, testing and deployment could look like activities in following chapter.

Change request is assessed and approved by Change Manager Mike because no conflict or dependency with upcoming changes was found, implementation costs are very low and we save backups disk space when remove workaround. Uncovered root cause and proposed solution is structural one, solves the issue at low cost and allows removing workaround solution. Oracle patch is first installed and tested in testing environment (mirror copy of production environment) and is ready for production deployment only after all tests are finished and no other symptoms are observed. It seems that Release and Deploy team can now finally distribute and deploy patch to production environment. But before they proceed with this step they need to prepare strategy plan called rollback plan. Orders&Warehousing IT service is so important so IT vendor cannot afford another incident in a row (definitely it would affect SLA6). Rollback plan secures the team with strategy used if patch deployment fails. If it happens we have to be able restore previous working version and configuration. Necessary input for rollback plan is again CMS system containing information about current versions of software and hardware systems, their configurations and provides information about authorized storage of source, configuration and executable files.

Now we are finally ready to deploy patch to production environment (really done-done). Deployment is done during agreed so called maintenance window. IT vendor can do changes and stop services for maintenance purposes only during this time. It is from 2.00 am to 3.00 am in this case. When team deploys the patch and runs verification production tests, they remove existing workaround (backup and delete script) together with Rachel. After check Mike closes this RfC as successfully implemented. Rachel now updates problem solution (Oracle patch) and closes problem as successfully implemented as well. She still needs to update Knowledge Base record to have all information synchronized. After that she’s done.

Bit this is not the end of the story yet. Now there exist discrepancy between real production environment (Oracle database patch – micro version change) and information about it in CMS. We need to update this information in CMS and IT service catalogue to keep these tools useful.

6 SLA – Service Level Agreement – defines agreed quality parameters and conditions under which is service provided. It is usually contract appendix, because it is not a formal contract.




Update can be done manually7 or using automated tool8 depending on vendor’s automation maturity.

Simple Rollback plan example

If we compare the first and second scenario, we can see big difference. Using more formal ITSM/ITIL procedures supported by automated tools allowed processing all necessary activities more efficiently and without needless emotions. We solved also deeper root case with structural, not just interim solution causing more complex IT infrastructure and its support and maintenance. But do not take these statements as a rule or the only truth. There is a hidden trap when shifting our way of working from informal, ad hoc to process oriented way of working. The trap is omitting or suppressing human aspect and becoming only ticket driven machine so commonly seen in big corporations.

Anyway, we can conclude the scenario as following:

Incident was closed much earlier than in the first scenario. People responsible for incident solving did the job, no other IT roles, e.g. programmers, were

disturbed. People involved in ITSM activities knew what and how to do (it was also boosted by proper

process automation). 7 We recommend simple checklist being part of change record (or work log) that will enforce/remind manual update. 8 Update can happen without any manual action (monitoring system inform about this change in infrastructure and updates the information) or semi-automatically (manual trigger for automatic IT infrastructure audit).




Incident root cause investigation and structural solution design (not just accepting workaround solution) started very first day with the aim to prevent recurring incidents.

Proper automated tools (have you noticed, no Excel was mentioned ;)) speeded up diagnoses, information gathering and incident resolution process. Every step is recorded in Service Desk tool and it’s easy to track or report all steps and actions performed.

Updated user friendly knowledge base (KB) could help with similar incident/problem solution.




The whole story following ITSM/ITILv3 processes is depicted in following picture:

Flow starting with identified incident (also reported events) and ending with implementation of structural

solution to identified problem (workaround is not a final destination) Story conclusion As the result of this significant incident affecting Orders&Warehousing IT service was conducted extra SLA review meeting between IT and business. Scope of this meeting was not only to follow thresholds and actual values of service quality attributes but also possible financial losses caused by this significant incident. It triggered additional actions on IT vendor side that should lead to better understanding of business, improved capacity and load predictions and proactive steps uncovering potential problems (in terms of ITIL terminology). But these steps are already a trailer of upcoming second part of this ITILv3 story ;)




Changehistory Version Date Author Change history

V1.0 August 2011 Jarek Procházka First English version created




Differ! www.differ.cz Improve your IT development, support, maintenance and operation using Agile and Lean practices

Articles and experience Agile and Lean IT

development, support and maintenance

Human aspect in IT Agile and Lean

management

Practical templates and checklists

Books review

Free e-books

ITIL in practice Experience from

projects

Services Creative workshop

Lean workshop Consultations



itil v3 story

Technology