Skip to main content

Incident Response Process

This document describes our incident handling process (inspired by work elsewhere in MOJ like the Cloud Platform).

Confirm that an event constitutes an incident

We define an incident as an event which:

  • Requires immediate response to return normal service
  • Severely degrades user experience of the service
  • Compromises application security, resulting in a breach or potential breach
  • Has the potential for loss or compromise of data

Examples:

  • Live service is unreachable due to denial of service
  • Connectivity to a database is down and so services cannot be accessed
  • A problem in one service means another is no longer functioning e.g. api gateway in Sirius causing issues in Use An LPA
  • Misconfiguration means that secrets are exposed on a live service
  • Unauthorized access to an API which can result in attackers accessing data

Declare an incident

From your teams channel you can use the OPG incidents Slack tool to declare an incident.

/opg-incident Something's happened

Slack will pop open a form to to fill in extra information. At this stage, you only need to select whether its a live incident or something that’s already happened that you’d like to report. All the other information can be filled in later.

The Slack bot will automatically post a message into the #opg-incident channel and create an entry on the incident response site. This marks the offical start of an incident

Immediate first steps

As soon as the tool has posted a message into #opg-incident you should do these steps immediatly

  1. Use the create communtications channel button in the message (only available for active incidents)
  2. Page the on-call incident lead
  3. If appropriate, start a conference call for further discussion

Ensure roles have been assigned

There are two roles that must be filled for every incident. They are the Incident Lead and Incident Scribe

In rare cases, the same person might fill both roles but this is discouraged.

Incident Reporter

This is whoever discovered the issue and declared the incident.

Incident Lead

The incident reporter should call in a designated incident lead, typically a Technical Architect, Lead or Senior WebOps or Lead Dev.

The incident lead is the lead coordination role and should be someone who has experience running incidents in OPG. It is preferable if they are not in a team affected by the incident so that they can provide an unbiased view.

Responsibilities

  • Coordinate our response to the incident
  • Decide on any additional roles required (e.g. a communications lead may be required)
  • Ensure that all required roles are filled
  • Ensure that all tasks which need to be handled are being done
  • Make the final decision whenever we need to choose a course of action
  • Set the schedule for any regular team check-ins, if those are deemed necessary
  • Declare the incident closed, when appropriate
  • Ensure that the post-incident process is followed

NB: The incident lead needs to ensure that things are being done, they do not need to do them

Once appointed, the incident lead should update the following information using the Slack bot in the dedicated communications channel.

  • incident summary @opg-incident-response summary <describe the incident>
  • incident severity @opg-incident-response sev <Critical, Major, Trivial>
  • incident impact @opg-incident-response impact <describe the impact>

Incident Scribe

The incident scribe can be someone in a team affected by the incident. Ask for volunteers verbally or via Slack.

Responsibilities

The scribe is responsible for keeping a log of the incident, including:

  • Important events
  • Discussion topics
  • Decisions
  • Actions
  • Results of actions/investigations

NB: this log is not intended to be a verbatim transcript of discussions. Rather, things like “xxx suggested the disk might be full. yyy to investigate and report back”

Once appointed, the scribe should update the incident header at the top of the channel with a link to the living document on the incident response site.

The scribe can pin important message in the incident channel and the Slack tool will automatically add those into the timeline summary. They can also add actions to the incident log using the command @opg-incident-response action <action_description>

When conversation happens verbally, it is the scribe’s responsibility to ensure anything that needs to be logged is written up in the incident channel.

Communications Lead

Depending on the impact/duration of an incident, it may be desirable to appoint a communications lead.

It is usually best if this is a member of the product profession to facilitate clear communications with the business.

Responsibilities

  • Communicate the incident out to those who are impacted
  • Decide what, how and how often to issue updates
  • Give updates at regular intervals
  • Field enquiries so the team can focus on fixing the incident without interruptions
  • Disseminate a post-incident report

You can update the Statuspage via the following command in the dedicated incident channel

  • @opg-incident-response statuspage

Transferring roles

It may be necessary to transfer roles from one team member to another, e.g. during long-running incidents. In this case, it is the responsibility of whoever is in a role to ensure that someone else takes it over.

Whoever assumes a role should announce it in the incident channel, so that the team is aware and updated

Fixing the problem

Please bear in mind that not every incident requires the whole team to be involved.

Things you may want to consider:

  • Start a conference call to hold ongoing discussions
  • Swarm on the problem
  • Step back if you are not contributing and you don’t have one of the key roles
  • Bring in members of other teams with relevant knowledge
  • Ask in #opg-all-team-digital, #opg-webops-community or #opg-developers for relevant experience from outside your team

End the incident

When work on the problem has ceased, either because the problem has been resolved or because resolution is blocked until a later date, the incident lead should update the Slack tool to close the incident.

@opg-incident-response close

This marks the official end of the incident. If users were notified of the incident, the communications lead should send an appropriate message via the same channels to tell them it’s over.

Post Incident

After the incident is resolved:

  • Schedule a root cause analysis (RCA) session to identify where we can improve
  • Document the outcome of the RCA in the OPG Security confluence space
  • Share the RCA outcomes with the delivery team and wider OPG Digital so they can learn from it too

Records and retention

The dedicated incident channel should be archived once the issue is fixed.

The incident response site will keep an archive of the messages pinned by the incident scribe.

Where appropriate, the root cause analysis may shared with the Amazon Technical Account Managers.

This page was last reviewed on 14 August 2024. It needs to be reviewed again on 13 November 2024 by the page owner #opg-webops-community .
This page was set to be reviewed before 13 November 2024 by the page owner #opg-webops-community. This might mean the content is out of date.