Posts tagged incident management
Practical problem & incident management
0
I found this few days ago while surfing on the net and liked it very much really. In all honesty its nothing but the truth in today’s scenario of problem management and incident management.
Sometimes this is exactly what happens !
Unsure of who is the author but cool work here !

Managing the self inflicted incidents – why its important and how you avoid it?
0I wrote quite a few articles recently on Incident Management and you can find all of them here. In this article I am trying to put my thoughts on the incidents that are categorized as “Self inflicting” or “invited incidents” (henceforth SIIs) and how to protect / prevent them from occurring.
Self inflicting incidents that result into service outage or disruption are normally followed by remedies to the vendor providing the application support services. The customers now a days are sensitive in putting the remedy clauses in their contracts and thus its overly more important to keep the incidents, let alone self-inflicted incidents away by doing additional monitoring & proactive measures in place.
I wrote earlier about the DDR framework to manage the incidents and how it is important for major incidents to be detected earlier, diagnosed quicker and resolved sooner. It is worth reading the article if you have not done as yet.
It is important to understand if the incident could be classified as a self inflicted or not, while you are doing the incident management. The sooner you detect the type of incident (self inflicted or situational) the more chances you have to “manage” the incident appropriately and avoid heavy fines / remedies against your organization.
The most often cause of having SIIs is manual overlooking, carelessness while doing a change to the production system. Any change done to the production system without understanding its implications could be really harmful and could come back and bite you hard. Hence its really important for application support teams to understand each change going on the platform, around the platform and then line up the implementation steps, pre & post implementation checks accordingly to safeguard from potential SIIs.
Once you detect an incident as an SII, its very important to “manage” it properly. Two key lessons you can keep in mind while managing an SII are,
-
Never hide from customer about any SIIs
-
Never lie about the facts around the SIIs
Most of the times, I think the support team management would take a political route for handling the aftermath of the SIIs to save themselves from potential remedies & fines. While, in some cases, it make sense to do so, more often than not, for a wiser and slightly smarter customer, it falls right on the face. After all, your reputation is on the line !
If you have a very good working relationship with the customer, try to speak with the customer and explain the situation in a full honesty. While you do that, its equally important to learn the lesson and ensure to take steps not to repeat the incident again. No point in giving fake promises to the customers if your team can not keep it. If there are situations that have forced your team into managing a SII, then explain the customer about the situation and see how this could be overcome. In most the cases, where the customer is slightly sensible (rather than horrible
), this trick would prevail.
Remember ! It is always important to keep the customer informed and not keep him in dark over the investigation. After all, you are the service provider and he is paying you for your services.
Now, moving on to tricks on avoiding the SIIs. Well, there is no defined process or guaranteed path that would ensure that there will not be any SII while you are providing application support, but surely there are enough tips and tricks that would help you reduce the probability.
First of all, find out the most common root causes of the incidents happened in past one year. More often than not if the incident has happened in past one year and root cause has been found and the fix has applied, there is a learning you could take from that experience.
Have a very good checklist for doing the health check of the system. Automate the monitoring of the components and potential failure points as much as you could so that in case of an incident, they would be useful to gather any evidence.
Have a useful incident checklist handy with you. You can read about how to prepare a incident checklist on my previous topic here. You need to take all your understanding of the platform, its connection points, failure points in consideration when you create the incident checklist to detect and diagnose the incident.
Most importantly, for all scheduled / unscheduled changes on the platform, ensure that they are thoroughly checked, implications are understood and risk is flagged accordingly. There is no point in keeping quiet if you know that a network change might cause an outage to your portal if its switched over. You might want to give a heads up to the customer and seek an approval prior to such change than keep on explaining why you allowed it on production later on ! If the application support team are able to detect and predict which changes are potentially harmful to the system before they are approved for implementation, your more than half of the job is done.
While doing the change implementation, obviously be very careful on what you are working on. Even if the change sounds simple and non intrusive or disruptive to the service, there no point in being careless about doing it. I have an experience of managing an incident where one of my colleague (few years ago) had deleted production database tables, instead of the reference database tables and the system went down for full 3 days !
There are lot of things you could as a application support team to avoid the potential SIIs and then eventually ensuring you maintain a stable system. I have noted few of them above, you might want to let me know if there are more and share your knowledge with me too !
Cheers !
Incident checklists – why one should have it & how one should prepare it
0I wrote few weeks ago about how you can manage the incidents effectively. The post is available here for your reading. I mentioned there about the Detect Diagnose Resolve framework and how you can use it to effectively to manage the incidents.
It is very important to quickly detect the incident cause when you are into the incident management process. Unless you find out where the cause lies, it would take a long time to actually diagnose & resolve the issue.
‘Incident Checklist’ is the most important tool / process one should have with every application support analyst so as to quickly start the incident analysis and rule out obvious causes of the errors.
However what is the ideal way one should prepare and subsequently the incident checklists?
Why
I guess I do not need to convince the application support community about the need to have incident checklists prepared for their use. They are handy documents / tools that give essential information that you could use during the incident. Such as,
-
Important phone numbers & emails
-
Stakeholder lists
-
Technical task list for carrying out health check
-
Quick tips to help make decisions
-
Escalation paths
-
Other support group contacts
Once you have a good checklist consisting the above details, you should try and review & update it as often as you should to ensure that stays useful.
How
The most important aspects of the incident checklist are that it should,
-
Not be overly cluttered
-
Simple and easy to understand
-
With clear instructions on to do’s & not to do’s.
-
Not contain any sensitive data i.e., passwords, user ids etc.
You might be wondering what is the best format for you to prepare the incident checklist? Should it be a MS word document, Excel, PowerPoint or an Image or a PDF or an online tool?
![]()
In my opinion, “Flight safety cards” are the best example of the incident checklists. They give all the necessary information of how one should react to the emergency situation, important information such as nearest exits, Do’s & Don’ts during the crisis and so on.
Support teams should actually take this example and prepare their incident checklists in a way it satisfies the criteria I mentioned earlier.
Cheers
Application support & Web 2.0
0Normally the application support and related teams stay away from the jazzy world of technical innovations and are more into daily maintenance tasks and do more of a routine work. It feels weird to relate the work done by the core support team to the concept like Web 2.0 and related technologies, doesn’t it?
As per the definition on Wikipedia,
“Web 2.0 is the business revolution in the computer industry caused by the move to the Internet as a platform, and an attempt to understand the rules for success on that new platform.”
Or rather in simpler terms, Web 2.0 provides platform for achieving greater collaboration between people (read resources) via technology tools such as Wikis, Blogs, online automation etc., that promote knowledge sharing, information retrieving & automation of daily jobs.
Now consider the simple definition in terms of the daily job done by a typical support analyst. The application support team needs to have a knowledge management tool, needs to have a reference and error database, needs quick retrieval of the information and they also need the automation tools to do proactive monitoring & related stuff.
Lets have a quick roundup on some of the daily scenarios on how a typical application support analyst (Josh) works and how he can make his life easier by using the Web 2.0 tools.
- Incident Management – Josh is told about an incident by the helpdesk and he immediately looks up on the Support Wiki to find out if there have been related incidents & finds related information with a proper resolution path. Josh executes the steps mentioned on the resolution plan and quickly restores the service. He adds his experience of dealing the incident in the wiki and makes sure that the wiki is up to date with the information.
- Service Management – An important task of service management is key stakeholder management and keeping them aware of the progress. Mr Service Manager uses a pre-defined mailing list (i.e., mailman) and email templates keep the stakeholders aware of the issue. That way he saves time and ensures the consistency in the communications sent. The RCA is verified over the wiki and is approved and stored in the knowledge base for future usage.
- Automation of daily tasks & automated reports – Many tasks that Josh does every day such as conducting the health check of the systems, generate the health report and send them. He uses the custom built automated tools to make the tasks simpler and integrate it with the reporting system that would be available to stakeholders. Simple example of this is, creating the automated report that would retrieve the order status per hour and display on a near-real time graph so the business people can track and retrieve the information as and when they need it, rather than raising a request with ASG and waiting for it to be fulfilled. Mind you, one of the important aspect of Web 2.0 is to make information available socially to a community and empower the users. Josh has done exactly that by using the automated reporting system and making the information available to the community.
- Sharing information & updates in project – Josh also uses blogs to update the community about the project happenings, latest reports and other updates via putting them on the blog. The community participants subscribe them via RSS, Atom feeds and keep themselves updated.
- Conducting the trainings – All the training material is uploaded on intranet video streaming servers so all the users who need the training can straightaway go to the intranet site, register themselves and avail the training as per their convenience.
- Making product implementation or release on production – While doing the work on production to release a software or implement a product, Josh can use a tool like coveritlive.com, to cover the progress of the issue whereas the rest of the team can track the progress online, letting Josh to concentrate on the work rather than doing talking on conference.
Above are some of the scenarios I have tried to explain where the Web 2.0 tools and technology could be made of a good use for increasing the effectiveness and efficiency within the project.
Obviously, as I said earlier in my posts, the usage of above and how well you can combine the two is absolutely dependent on the project you are working and the demands of the project. Simple fact is you don’t want to land up in a situation where your work its worth $100 and your spend on technology adaptation is >$100 !
Cheerio !
Effective incident management – Detect – Diagnose – Resolve framework
1In most of the IT organizations, who specializes the support and maintenance contracts to the customer’s IT estate and their software estate, most of the emphasis is given to achieving 100% Availability and adhering to Service Level Agreements.
However, one of the most important aspect of the Service Management that helps achieve above is to have an effective incident management system.
The ITIL definition of an Incident is as follows,
"any event which is not part of the standard operation of a service and which causes, or may cause, an interruption to, or a reduction in, the quality of that service"
Fair enough definition, simple to understand and crystal clear !
In my opinion, to have the incident management process as effective in operations as possible, you need to have a basic framework ready that obviously involves right people, right processes and right tools. In this blog post, I would currently concentrate on the process part out of these three.
There are three main parts of the incident management process.
- Detect – detecting the occurrence of incident and understanding the nature of incident & its implications
- Diagnose – diagnosing the cause and carrying out the investigation to find a solution
- Resolve – resolving the incident either by putting a permanent solution or a workaround in place

I would like to go a little bit deep in explaining the tasks that would be done typically by the support team in above phases of incident management.
Detect
First of all it is very very important to detect if an incident has actually occurred. What I have seen many of a times that someone reports an incident and the team starts investigating and potentially wastes time in investigating a non-cause issue.
In this phase, it is really very important to understand and establish nature of incident and define the context in terms of the impact & urgency, assign the priority to the incident & progress with carrying out the quick impact analysis.
Use of technology & tools is very important in this phase. It is always recommended to have proactive monitoring in place to keep an eye on the system components to ensure that enough alarms and alerts are in place to inform the respective teams of any potential issues / incidents within the system components and associated services.
Some of the suggested tasks that need to be done immediately after establishing the incident context are as follows,
- Understand if this is manual or automated trap. In all cases if its a manual report of issue, you would have made some customer unhappy with your service !
- If this is an automated alert then you have done a good job. Now establish the source of the alert and establish in what scenarios the alert raised of the alarm is triggered.
- Establish if this alert or alarm has turned into an incident. Sometimes proactive monitoring raises one off traps and the system goes back to stable state.
- If an incident has been detected, establish the nature of incident and do a quick impact analysis and understand the business implication of the incident
- Inform key stakeholders of the incident occurrence
- Refer to known error database and knowledge base for any information / clues that might help you in diagnosing the cause and resolving the same.
- Proceed with diagnose
Diagnose
Once you establish the context of the incident and have informed the key stakeholders of the nature of the incident, it is really important to proceed with diagnosing the cause of the incident and doing a thorough investigation to resolve the same.
Some of the key tasks that need to be done as a part of diagnose are as follows,
- It is expected that the support teams will have a checklist to do a health check of the system that would help them understand the cause of the incident and whether the cause lies within the supported components or not. If you have one, first thing you do is run through the checklist and perform most common tasks.
- Most probably the common tasks will be checking your servers individually, machine to machine connectivity, network checks via telnet & ping and http checks for the web pages. Do them and see if you can find the cause, if the cause if obvious, you are most likely to find it within 5-10 mins if you have a good health check plan and a supporting checklist for managing the incidents.
- Visit the log files, error files – they are in all probability and in all likelihood are expected to contain information about the errors and misbehaviour of the system
- Check the components individually for errors and establish if the cause could be isolated to a component / system / software piece or any other entity.
- Prepare a resolution plan. – mind you, during the incident management it is topmost priority to do a quick service restoration. The RCA analysis could always be done later, as long as the information and evidences are kept secure during the resolution process.
- Inform the stakeholders of the progress of the incident.
- Proceed to resolution
Resolve
This phase makes life easy of the incident manager, only if and a big if, you had done your work in earlier two phases diligently. When you enter this phase it is expected that you have already found out the cause of incident and you have a plan in place to resolve the incident and you are ready to implement the incident.
Some of the most common tasks that are expected within this phase are as follows, mostly process oriented and obviously depends on project to project / team to team.
- Gather evidences, backup log files & necessary stuff that would help the later RCA analysis
- Obtain the necessary approvals & sign offs for the incident resolution plan. i.e., if the resolution involves bounce of a server, can this be done in day time? Does the business manager agree to a daytime bounce?
- Implement the resolution on production
- Inform the stakeholders of the resolution and update on the expected RCA completion timeline
- Update the knowledge base with the resolution steps & common symptoms of the problems that would help you detect it quicker next time if it occurs
- Progress with the RCA on the basis of the gathered evidences and complete the analysis. Add the same to knowledge base and if required, proceed for the permanent resolution i.e., code fix, patch upgrade, software upgrade etc.
In a real world scenario and where the teams are working under the pressure of supporting business critical systems such as banks, share trading sites & financial transaction sites. Its really and utterly important to have thorough checklists, proactive monitoring and very strong processes based on above points to ensure that you resolve your incident to satisfactory level.
Further to having a strong incident management process based on the above three parts, its equally essential to complete a thorough RCA of the incident and now to allow the repeat incident of same nature.
In the next series on the incident management, I would like to cover the information about various RCA techniques and how you should put it to the practice.
Huh .. its well over midnight now and my small daughter is now crying a bit in her sleep. So time to go back and execute the duty of a
father … !
Cheerio