Incident Response Process

This document describes our incident handling process (inspired by work elsewhere in MOJ like the Cloud Platform).

Confirm that an event constitutes an incident

We define an incident as an event which:

Requires immediate response to return normal service
Severely degrades user experience of the service
Compromises application security, resulting in a breach or potential breach
Has the potential for loss or compromise of data

Examples:

Live service is unreachable due to denial of service
Connectivity to a database is down and so services cannot be accessed
A problem in one service means another is no longer functioning e.g. api gateway in Sirius causing issues in Use An LPA
Misconfiguration means that secrets are exposed on a live service
Unauthorized access to an API which can result in attackers accessing data

Declare an incident

From your teams channel you can use the OPG incidents Slack tool to declare an incident.

/opg-incident Something's happened

Slack will pop open a form to to fill in extra information. At this stage, you only need to select whether its a live incident or something that’s already happened that you’d like to report. All the other information can be filled in later.

The Slack bot will automatically post a message into the #opg-incident channel and create an entry on the incident response site. This marks the offical start of an incident

Immediate first steps

As soon as the tool has posted a message into #opg-incident you should do these steps immediatly

Use the create communtications channel button in the message (only available for active incidents)
Page the on-call incident lead
If appropriate, start a conference call for further discussion

Ensure roles have been assigned

There are two roles that must be filled for every incident. They are the Incident Lead and Incident Scribe

In rare cases, the same person might fill both roles but this is discouraged.

Incident Reporter

This is whoever discovered the issue and declared the incident.

Incident Lead

The incident reporter should call in a designated incident lead, typically a Technical Architect, Lead or Senior WebOps or Lead Dev.

The incident lead is the lead coordination role and should be someone who has experience running incidents in OPG. It is preferable if they are not in a team affected by the incident so that they can provide an unbiased view.

Responsibilities

Coordinate our response to the incident
Decide on any additional roles required (e.g. a communications lead may be required)
Ensure that all required roles are filled
Ensure that all tasks which need to be handled are being done
Make the final decision whenever we need to choose a course of action
Set the schedule for any regular team check-ins, if those are deemed necessary
Declare the incident closed, when appropriate
Ensure that the post-incident process is followed

NB: The incident lead needs to ensure that things are being done, they do not need to do them

Once appointed, the incident lead should update the following information using the Slack bot in the dedicated communications channel.

incident summary @opg-incident-response summary <describe the incident>
incident severity @opg-incident-response sev <Critical, Major, Trivial>
incident impact @opg-incident-response impact <describe the impact>

Incident Scribe

The incident scribe can be someone in a team affected by the incident. Ask for volunteers verbally or via Slack.

Responsibilities

The scribe is responsible for keeping a log of the incident, including:

Important events
Discussion topics
Decisions
Actions
Results of actions/investigations

NB: this log is not intended to be a verbatim transcript of discussions. Rather, things like “xxx suggested the disk might be full. yyy to investigate and report back”

Once appointed, the scribe should update the incident header at the top of the channel with a link to the living document on the incident response site.

The scribe can pin important message in the incident channel and the Slack tool will automatically add those into the timeline summary. They can also add actions to the incident log using the command @opg-incident-response action <action_description>

When conversation happens verbally, it is the scribe’s responsibility to ensure anything that needs to be logged is written up in the incident channel.

Communications Lead

Depending on the impact/duration of an incident, it may be desirable to appoint a communications lead.

It is usually best if this is a member of the product profession to facilitate clear communications with the business.

Responsibilities

Communicate the incident out to those who are impacted
Decide what, how and how often to issue updates
Give updates at regular intervals
Field enquiries so the team can focus on fixing the incident without interruptions
Disseminate a post-incident report

You can update the Statuspage via the following command in the dedicated incident channel

@opg-incident-response statuspage

Transferring roles

It may be necessary to transfer roles from one team member to another, e.g. during long-running incidents. In this case, it is the responsibility of whoever is in a role to ensure that someone else takes it over.

Whoever assumes a role should announce it in the incident channel, so that the team is aware and updated

Fixing the problem

Please bear in mind that not every incident requires the whole team to be involved.

Things you may want to consider:

Start a conference call to hold ongoing discussions
Swarm on the problem
Step back if you are not contributing and you don’t have one of the key roles
Bring in members of other teams with relevant knowledge
Ask in #opg-all-team-digital, #opg-webops-community or #opg-developers for relevant experience from outside your team

End the incident

When work on the problem has ceased, either because the problem has been resolved or because resolution is blocked until a later date, the incident lead should update the Slack tool to close the incident.

@opg-incident-response close

This marks the official end of the incident. If users were notified of the incident, the communications lead should send an appropriate message via the same channels to tell them it’s over.

Post Incident

After the incident is resolved:

Schedule a root cause analysis (RCA) session to identify where we can improve
Document the outcome of the RCA in the OPG Security confluence space
Share the RCA outcomes with the delivery team and wider OPG Digital so they can learn from it too

Records and retention

The dedicated incident channel should be archived once the issue is fixed.

The incident response site will keep an archive of the messages pinned by the incident scribe.

Where appropriate, the root cause analysis may shared with the Amazon Technical Account Managers.

This page was last reviewed on 14 August 2024. It needs to be reviewed again on 13 November 2024 by the page owner #opg-webops-community .

This page was set to be reviewed before 13 November 2024 by the page owner #opg-webops-community. This might mean the content is out of date.