Posts Tagged inlife

Incident checklists – why one should have it & how one should prepare it

I wrote few weeks ago about how you can manage the incidents effectively. The post is available here for your reading. I mentioned there about the Detect Diagnose Resolve framework and how you can use it to effectively to manage the incidents.

It is very important to quickly detect the incident cause when you are into the incident management process. Unless you find out where the cause lies, it would take a long time to actually diagnose & resolve the issue.

‘Incident Checklist’ is the most important tool / process one should have with every application support analyst so as to quickly start the incident analysis and rule out obvious causes of the errors.

However what is the ideal way one should prepare and subsequently the incident checklists?

Why

I guess I do not need to convince the application support community about the need to have incident checklists prepared for their use. They are handy documents / tools that give essential information that you could use during the incident. Such as,

  • Important phone numbers & emails
  • Stakeholder lists
  • Technical task list for carrying out health check
  • Quick tips to help make decisions
  • Escalation paths
  • Other support group contacts

Once you have a good checklist consisting the above details, you should try and review & update it as often as you should to ensure that stays useful. 

How

The most important aspects of the incident checklist are that it should,

  • Not be overly cluttered
  • Simple and easy to understand
  • With clear instructions on to do’s & not to do’s.
  • Not contain any sensitive data i.e., passwords, user ids etc.

You might be wondering what is the best format for you to prepare the incident checklist? Should it be a MS word document, Excel, PowerPoint or an Image or a PDF or an online tool?

Qantas380_1QF380_2In my opinion, “Flight safety cards” are the best example of the incident checklists. They give all the necessary information of how one should react to the emergency situation, important information such as nearest exits, Do’s & Don’ts during the crisis and so on.

Support teams should actually take this example and prepare their incident checklists in a way it satisfies the criteria I mentioned earlier.

Cheers

Tags: , , , ,

Effective incident management – Detect – Diagnose – Resolve framework

In most of the IT organizations, who specializes the support and maintenance contracts to the customer’s IT estate and their software estate, most of the emphasis is given to achieving 100% Availability and adhering to Service Level Agreements.

However, one of the most important aspect of the Service Management that helps achieve above is to have an effective incident management system.

The ITIL definition of an Incident is as follows,

"any event which is not part of the standard operation of a service and which causes, or may cause, an interruption to, or a reduction in, the quality of that service"

Fair enough definition, simple to understand and crystal clear !

In my opinion, to have the incident management process as effective in operations as possible, you need to have a basic framework ready that obviously involves right people, right processes and right tools. In this blog post, I would currently concentrate on the process part out of these three.

There are three main parts of the incident management process.

  • Detect – detecting the occurrence of incident and understanding the nature of incident & its implications
  • Diagnose – diagnosing the cause and carrying out the investigation to find a solution
  • Resolve – resolving the incident either by putting a permanent solution or a workaround in place

IncidentManagement_DDR_Framework

I would like to go a little bit deep in explaining the tasks that would be done typically by the support team in above phases of incident management.

Detect

First of all it is very very important to detect if an incident has actually occurred. What I have seen many of a times that someone reports an incident and the team starts investigating and potentially wastes time in investigating a non-cause issue.

In this phase, it is really very important to understand and establish nature of incident and define the context in terms of the impact & urgency, assign the priority to the incident & progress with carrying out the quick impact analysis.

Use of technology & tools is very important in this phase. It is always recommended to have proactive monitoring in place to keep an eye on the system components to ensure that enough alarms and alerts are in place to inform the respective teams of any potential issues / incidents within the system components and associated services.

Some of the suggested tasks that need to be done immediately after establishing the incident context are as follows,

  • Understand if this is manual or automated trap. In all cases if its a manual report of issue, you would have made some customer unhappy with your service !
  • If this is an automated alert then you have done a good job. Now establish the source of the alert and establish in what scenarios the alert raised of the alarm is triggered.
  • Establish if this alert or alarm has turned into an incident. Sometimes proactive monitoring raises one off traps and the system goes back to stable state.
  • If an incident has been detected, establish the nature of incident and do a quick impact analysis and understand the business implication of the incident
  • Inform key stakeholders of the incident occurrence
  • Refer to known error database and knowledge base for any information / clues that might help you in diagnosing the cause and resolving the same.
  • Proceed with diagnose

 

Diagnose

Once you establish the context of the incident and have informed the key stakeholders of the nature of the incident, it is really important to proceed with diagnosing the cause of the incident and doing a thorough investigation to resolve the same.

Some of the key tasks that need to be done as a part of diagnose are as follows,

  • It is expected that the support teams will have a checklist to do a health check of the system that would help them understand the cause of the incident and whether the cause lies within the supported components or not.  If you have one, first thing you do is run through the checklist and perform most common tasks.
  • Most probably the common tasks will be checking your servers individually, machine to machine connectivity, network checks via telnet & ping and http checks for the web pages. Do them and see if you can find the cause, if the cause if obvious, you are most likely to find it within 5-10 mins if you have a good health check plan and a supporting checklist for managing the incidents.
  • Visit the log files, error files – they are in all probability and in all likelihood are expected to contain information about the errors and misbehaviour of the system
  • Check the components individually for errors and establish if the cause could be isolated to a component / system / software piece or any other entity.
  • Prepare a resolution plan. – mind you, during the incident management it is topmost priority to do a quick service restoration. The RCA analysis could always be done later, as long as the information and evidences are kept secure during the resolution process.
  • Inform the stakeholders of the progress of the incident.
  • Proceed to resolution

 

Resolve

This phase makes life easy of the incident manager, only if and a big if, you had done your work in earlier two phases diligently. When you enter this phase it is expected that you have already found out the cause of  incident and you have a plan in place to resolve the incident and you are ready to implement the incident.

Some of the most common tasks that are expected within this phase are as follows, mostly process oriented and obviously depends on project to project / team to team.

  • Gather evidences, backup log files & necessary stuff that would help the later RCA analysis
  • Obtain the necessary approvals & sign offs for the incident resolution plan. i.e., if the resolution involves bounce of a server, can this be done in day time? Does the business manager agree to a daytime bounce?
  • Implement the resolution on production
  • Inform the stakeholders of the resolution and update on the expected RCA completion timeline
  • Update the knowledge base with the resolution steps & common symptoms of the problems that would help you detect it quicker next time if it occurs
  • Progress with the RCA on the basis of the gathered evidences and complete the analysis. Add the same to knowledge base and if required, proceed for the permanent resolution i.e., code fix, patch upgrade, software upgrade etc.

In a real world scenario and where the teams are working under the pressure of supporting business critical systems such as banks, share trading sites & financial transaction sites. Its really and utterly important to have thorough checklists, proactive monitoring and very strong processes based on above points to ensure that you resolve your incident to satisfactory level.

Further to having a strong incident management process based on the above three parts, its equally essential to complete a thorough RCA of the incident and now to allow the repeat incident of same nature.

In the next series on the incident management, I would like to cover the information about various RCA techniques and how you should put it to the practice.

Huh .. its well over midnight now and my small daughter is now crying a bit in her sleep. So time to go back and execute the duty of a
father … !

Cheerio

Tags: , , , ,

Agile ready support processes – Key AIS recommendations

 

Yesterday I posted an article about how someone should carry out the Acceptance into Service testing for the product that is being developed using the agile framework. I am continuing my thoughts on the same today and want to point out few key recommendations while following the process I put forward yesterday,

Agile development framework supports a proven way of working, Collaborative working and encourages close work interaction between various teams that are designing & developing the product, testing the product and providing the support to the product and various others so that the end product is developed that is most complete from each aspect and with minimum possible bugs with a fast turnaround time.

I would like to put forward the following recommendations to assist the support teams to carry out the AIS as effectively possible to ensure better serviceability of the product when it comes in-life,

  • The support teams should act as an internal customer to the product / project development teams and have a very close working relationship with the individuals in the development team
  • Support team should provide an acceptance criteria upfront and make sure all the demands and recommendations on which the sign off is based are understood clearly by the development team.
  • Have a daily interaction with the development teams to ensure that all the recommendations are being incorporated in the actual code development.
  • Do a weekly verification cycle by doing the acceptance testing of the development package on agreeable environment / platform. The development team should be able to demonstrate the evidences of recommendation implementations.
  • The acceptance testing does not necessarily need to be done on the pre-production environment. In all possibilities most of the test cases could be carried out on the development environments and should be accepted as a credible proof.
  • The final verification should a quick one and based on the evidences observed in the development testing so that the turnaround time will be quicker and help agile framework to release the product on live
  • Share the responsibility with the development team and try to understand the application / product while it is being developed. Understand the key components and document them as a part of support manual. This could later be referenced in case of incident and problem management.

In the following table I try to summarize the last recommendation about shared responsibility into three categories to make it easy to understand.  The following example is in case of the documentation involved in the process.

Sample list of documentation from Delivery

Sample list of documentation written / contributed by Support teams

Format in which documents can be delivered / maintained / created

  • Design overview
  • e2e journey
  • Agreed user stories
    Functional
  • Acceptance criteria
  • Various error conditions
  • Sample of logs coded
  • Platform protection information
  • Monitoring requirements etc
  • WIKI site
  • Word document
  • Excel template
  • Any readable format of information :-)

Tags: , , , , , ,

Agile ready support processes – Acceptance Into Service

There is a lot of push currently within the IT organizations to use the ‘Agile development’ methodology for delivering new products & projects. The customer demands changes rapidly and that is what precisely the agile development framework provides  to the service providers.

But .. and the big but is that the ‘agile development framework’ talks only about the development of a software and does not talk about the necessary processes that are required post go-live and related testing efforts to accept the product into service.

Yeah, I am precisely talking about the support processes that need to be aligned with development efforts and must be completed before the go-live of a product. In fact, I am talking about “Acceptance into Service” testing that needs to be done before the product is accepted into the service.

In a traditional world when the development follows the waterfall model, the window for the acceptance service is normally reserved a week before the scheduled go-live to complete the activities such as knowledge handover to the ASG teams, create support documents and related testing.  However, the agile framework does not allow such window for doing the acceptance testing at the end of the development cycle and expects the teams to collaborate while the development is being carried out.

The following table precisely summarizes how I perceive the difference between the way Acceptance into Service will work in the Agile way as against the traditional way.

Waterfall Model Agile Model
  • Delivery team develops the application
  • Hands over to support team
  • Support group tests i.e., buster testing, resilience testing, log sampling & analysis etc
  • Feed back the issues
  • Delivery team fixes in next release
  • Delivery team still develops the application, but support team gives requirements upfront which will help them support application better post-live
  • Collaborative working, rather than hand over, the testing is done jointly, daily against the acceptance criteria
  • Weekly check point to ensure & track the progress against pre-decided acceptance criteria
  • AIS phase itself on reference becomes verification phase whether application works as desired on production-level environment
  • All the issues found will be fixed in next development iteration

 

In a nutshell, I would like to summarized my proposed Agile method for performing the acceptance into service is as follows,

  • AIS is no longer a separate phase but a joint effort of delivering bug free, support ready & self aware application on production
  • High level AIS process
    • Agree on the acceptance criteria upfront
    • Verify compliance
    • Understand the application as it evolves
    • Build the support repository as the application evolves
    • Build the support document together
    • AIS tasks are now part of the delivery sprint
    • Resilience testing is done as a separate phase if need be
  • AIS is a process to help delivery teams deliver better software – than pointing mistakes / finding errors

Hope you find the above useful as a summary.

This is of course my first post of this series and of course I will try to post more information on the agile processes and related testing mechanisms and how they would help you test the applications thoroughly prior to being accepted into service for support.

Tags: , , , , , ,

Top 5 myths related to application support domain

Did you ever notice that in a multi-service oriented IT organization, that delivers new projects / products as well as provides IT support services to customers, more focus is normally given to delivery of products than providing high quality maintenance & support services? Well, if you have, then you are not the only one!

What I have seen at least in the Indian IT companies is the poor comparison given to the professionals working in the IT service support domain than to the professionals working in delivering new functionality to the customers. I suppose the roots of such unfair comparisons are in few myths that are present in the IT area.

  1. Call centre perception – Yeah, most of the people I spoke with did not have any idea when I asked and interviewed them about working in support projects. I asked them ‘What do you think working in support means for you?’ The quickest answer I got back is ‘I think it’s like working in a call centre, you need to be on phone all the time and answer customer queries. I am not interested in becoming a phone operator!’ Well, to a certain extent they are right. If you are providing support services from an off-site based location (Offshore or nearshore locations) then yes it does involve lots of speaking on phone. But if you look the way IT is working, speaking on phone, joining conference calls etc., has become just the way of work.
  2. You do not learn much when you work in support - This is the most common excuse given by professionals (at least young professionals) when they are asked to work on support projects. They feel that working in support does not offer them opportunities to enrich their skills in technical area. I certainly do not agree with this! In my view, working in the support function just gives you an opportunity to view at things differently and gives you a closer and practical look on how projects work on production platform and how a product is used in practice. You get to know the project and products from a close angle and you are always on the edge to keep the software / service working. Imagine how it would be like to have an outage on a banking site!
  3. You do not feel enough challenged while working in support – This is another common myth that makes house in the minds of fellow software professionals. They feel that working in support projects does not offer enough to challenge their intellectual wits. They feel that the work is repetitive and routine and thus they would get bored after some time in working in the project. I certainly do not think so! Ask this question to the people who support the online trading on bank website, or stock market website where even a minute’s outage may mean you lose millions of dollars in business. Is this not a challenge to keep them running?
  4. My professional growth will stagnate – Lots of people whom I interviewed for projects often complained to me that if they join the support project, their growth opportunity will be limited and they would get stereotyped for rest of their career. To the certain extent I agree with later part (i.e., stereotyped) but definitely do not agree with the first argument. The growth of the individual in an IT organization depends on how the person senses opportunity and makes use of it and not on what project he or she works on. After all, you are likely to get a better recognition if you save a bank site from falling over than writing a piece of software!
  5. My value amongst my peers will go down! – This is one of the silliest excuses I have ever come across; indeed I did come across this. The person who was telling me this was saying that he will feel inferior compared to a person working in a delivery and writing Java code. Fuf.. ! I did not have any answer for this, but all I would say is the inferiority is in the mind and depends on confidence of one rather than natural capability.

Interesting enough right? Let me know if you have come across any more myths about professionals saying no to work in support.

As earlier, pasting here a fantastic Dilbert comic strip that gives perfect example of how the world perceives of the Tech / Application support !

dilbert-tech-support

Tags: , , , ,