A Manager's Guide to Handling Major IT Disasters

 
 
By Chris Preimesberger  |  Posted 2015-08-28
 
 
 
 
 
 
 
 
 
  • Previous
    1 - A Manager's Guide to Handling Major IT Disasters
    Next

    A Manager's Guide to Handling Major IT Disasters

    To effectively handle disaster incidents and outages, IT managers need to identify major incident response workflows and have a reliable response team in place.
  • Previous
    2 - Clearly Define a Major Incident
    Next

    Clearly Define a Major Incident

    When an issue causes a huge business impact on users, it can be categorized as a major incident. This is one that forces an organization to deviate from existing incident management processes. Usually, high-priority incidents are wrongly perceived as major incidents. This is probably due to the absence of clear ITIL guidelines. Therefore, to avoid any confusion, a manager must define a major incident clearly based on factors such as urgency, impact and severity.
  • Previous
    3 - Have Exclusive Workflows
    Next

    Have Exclusive Workflows

    Implementing a robust workflow helps you restore a disrupted service quickly. Separate workflows for major incidents help in seamless resolutions. Focus on automating and simplifying the following when you formulate a workflow for major incidents: identifying the major incident; communicating to the impacted stakeholders; assigning the right people to solve it; tracking the major incident throughout its life cycle; escalation upon breach of SLAs; resolution and closure; generation and analyses of reports.
  • Previous
    4 - Reel in the Right Resources
    Next

    Reel in the Right Resources

    Ensure that your best resources are working on major incidents. Also, clearly define their roles and responsibilities due to the high impact these incidents have on business. You could have a dedicated or a temporary team depending on how often major incidents occur. Some organizations have a dedicated major incident team headed by a major incident manager, while others have an ad hoc team that has experts from various departments. Your primary objective must be to keep your resources engaged and avoid conflict of time and priorities.
  • Previous
    5 - Train Your Personnel, Equip Them With the Right Tools
    Next

    Train Your Personnel, Equip Them With the Right Tools

    You don't know when a major incident can strike your IT, but the first step to handling it is by being prepared. Divide your major incident management team into sub-teams and train them in major incident management. Assign responsibilities by mapping skills with requirements. Run simulation tests on a regular basis to identify strengths, evaluate performance and address gaps as needed. This would also help your team to cope with stress and be prepared when facing real-time scenarios. Equip your team with the right tools such as smartphones and tablets with seamless connectivity for them to work from anywhere during an emergency.
  • Previous
    6 - Configure Stringent SLAs and Hierarchical Escalations
    Next

    Configure Stringent SLAs and Hierarchical Escalations

    Define stringent SLAs (service-level agreements) for major incidents. Set up separate response and resolution SLAs with clear escalation points for any breach of the process. In addition, follow a manual escalation process if the assigned technician lacks the expertise to resolve the incident. Moreover, ensure that a backup technician is always available.
  • Previous
    7 - Keep Your Stakeholders Informed
    Next

    Keep Your Stakeholders Informed

    Throughout the life cycle of major incidents, send announcements, notifications and status updates to the stakeholders. Announcements in the self-service portal will prevent end users from raising duplicate tickets and overloading the help desk. Also send hourly or bi-hourly updates during a service downtime caused by a major incident. Have a dedicated line to respond to major incidents immediately and offer support to stakeholders. Use the fastest means of communication, such as telephone calls, direct walk-ins, live chat and remote control desktop, instead of relying on email.
  • Previous
    8 - Tie Major Incidents With Other ITIL Processes
    Next

    Tie Major Incidents With Other ITIL Processes

    After a major incident is resolved, perform a root-cause analysis by using problem management methods. Then, implement organization-wide changes to prevent the occurrence of similar incidents in the future by following the change management process. Speed up the entire incident, problem and change management process by providing detailed information about the assets involved by using asset management.
  • Previous
    9 - Improvise Your Knowledge Base
    Next

    Improvise Your Knowledge Base

    Formulate simple knowledge base article templates that capture critical details such as the type of major incident the article relates to, the latest issue resolved using the article, owner of the article and the resources that would be needed to implement the solution. Create and track solutions separately for major incidents so that you can access them quickly with very little effort.
  • Previous
    10 - Review, Report on Major Incidents
    Next

    Review, Report on Major Incidents

    Document and analyze all major incidents so that you can identify areas of improvement. This will help your team efficiently handle similar issues in the future. Also, generate major incident-specific reports for analysis, evaluation and decision-making. You could generate the following reports to help in efficient decision-making: 1) number of major incidents raised and closed each month; 2) average resolution time for major incidents; 3) percentage of downtime cause of major incidents; and 4) problems and changes linked to major incidents.
  • Previous
    11 - Document These Processes for Continuous Service Improvement
    Next

    Document These Processes for Continuous Service Improvement

    It is a best practice to document major incident processes and workflows for ready reference. This could capture details, such as number of personnel involved, their roles and responsibilities, communication channels, tools used for the fix, approval and escalation workflows and the overall strategy along with baseline metrics for response and resolution. Top management must evaluate processes on a regular basis to check if targeted performance levels in major incident management are met. This can help rectify flaws and serve for continual service improvement.
 

Major IT outages and disaster incidents can hit organizations large or small; the size and scope of an organization is hardly relevant. Unplanned outages, such as bank transaction server crashes, airline check-in software crashes and stock market outages, place adverse impact on everybody—company and customers alike. Under such circumstances, help desks are slammed with calls, adding to the panic and chaos. It becomes a race against time to find a fix, since every hour of outage could translate to thousands, if not millions, of lost dollars. IT staff often find themselves answering calls and replying to emails rather than trying to find a fix. What does it take for a manager to keep a cool head and steer the organization out of the situation? The following 10 best practices can be used as a guide to deal with major incidents that can happen at any time, night or day. Bookmark this or print it out and keep it in a place where you can find it in case of such an outage. This slide show was resourced using eWEEK archives, ITIL (formerly known as the Information Technology Infrastructure Library) and industry insight from Prithiv Rajkumar Rajan, marketing analyst at ManageEngine.

 
 
 
 
 
 
 
 
 
 
 

Submit a Comment

Loading Comments...
 
Manage your Newsletters: Login   Register My Newsletters























 
 
 
 
 
 
 
 
 
Rocket Fuel