Incident Response Process
This document describes our incident handling process (inspired by work elsewhere in MOJ like the Cloud Platform).
Confirm that an event constitutes an incident
We define an incident as an event which:
- Requires immediate response to return normal service
- Severely degrades user experience of the service
- Compromises application security, resulting in a breach or potential breach
- Has the potential for loss or compromise of data
Examples:
- Live service is unreachable due to denial of service
- Connectivity to a database is down and so services cannot be accessed
- A problem in one service means another is no longer functioning e.g. api gateway in Sirius causing issues in Use An LPA
- Misconfiguration means that secrets are exposed on a live service
- Unauthorized access to an API which can result in attackers accessing data
Declare an incident
From your teams channel you can use the OPG incidents Slack tool to declare an incident.
/opg-incident Something's happened
Slack will pop open a form to to fill in extra information. At this stage, you only need to select whether its a live incident or something that’s already happened that you’d like to report. All the other information can be filled in later.
The Slack bot will automatically post a message into the #opg-incident
channel and create an entry on the incident response site. This marks the offical start of an incident
Immediate first steps
As soon as the tool has posted a message into #opg-incident
you should do these steps immediatly
- Use the create communtications channel button in the message (only available for active incidents)
- Page the on-call incident lead
- If appropriate, start a conference call for further discussion
Ensure roles have been assigned
There are two roles that must be filled for every incident. They are the Incident Lead and Incident Scribe
In rare cases, the same person might fill both roles but this is discouraged.
Incident Reporter
This is whoever discovered the issue and declared the incident.
Incident Lead
The incident reporter should call in a designated incident lead, typically a Technical Architect, Lead or Senior WebOps or Lead Dev.
The incident lead is the lead coordination role and should be someone who has experience running incidents in OPG. It is preferable if they are not in a team affected by the incident so that they can provide an unbiased view.
Responsibilities
- Coordinate our response to the incident
- Decide on any additional roles required (e.g. a communications lead may be required)
- Ensure that all required roles are filled
- Ensure that all tasks which need to be handled are being done
- Make the final decision whenever we need to choose a course of action
- Set the schedule for any regular team check-ins, if those are deemed necessary
- Declare the incident closed, when appropriate
- Ensure that the post-incident process is followed
NB: The incident lead needs to ensure that things are being done, they do not need to do them
Once appointed, the incident lead should update the following information using the Slack bot in the dedicated communications channel.
- incident summary
@opg-incident-response summary <describe the incident>
- incident severity
@opg-incident-response sev <Critical, Major, Trivial>
- incident impact
@opg-incident-response impact <describe the impact>
Incident Scribe
The incident scribe can be someone in a team affected by the incident. Ask for volunteers verbally or via Slack.
Responsibilities
The scribe is responsible for keeping a log of the incident, including:
- Important events
- Discussion topics
- Decisions
- Actions
- Results of actions/investigations
NB: this log is not intended to be a verbatim transcript of discussions. Rather, things like “xxx suggested the disk might be full. yyy to investigate and report back”
Once appointed, the scribe should update the incident header at the top of the channel with a link to the living document on the incident response site.
The scribe can pin important message in the incident channel and the Slack tool will automatically add those into the timeline summary. They can also add
actions to the incident log using the command @opg-incident-response action <action_description>
When conversation happens verbally, it is the scribe’s responsibility to ensure anything that needs to be logged is written up in the incident channel.
Communications Lead
Depending on the impact/duration of an incident, it may be desirable to appoint a communications lead.
It is usually best if this is a member of the product profession to facilitate clear communications with the business.
Responsibilities
- Communicate the incident out to those who are impacted
- Decide what, how and how often to issue updates
- Give updates at regular intervals
- Field enquiries so the team can focus on fixing the incident without interruptions
- Disseminate a post-incident report
You can update the Statuspage via the following command in the dedicated incident channel
@opg-incident-response statuspage
Transferring roles
It may be necessary to transfer roles from one team member to another, e.g. during long-running incidents. In this case, it is the responsibility of whoever is in a role to ensure that someone else takes it over.
Whoever assumes a role should announce it in the incident channel, so that the team is aware and updated
Fixing the problem
Please bear in mind that not every incident requires the whole team to be involved.
Things you may want to consider:
- Start a conference call to hold ongoing discussions
- Swarm on the problem
- Step back if you are not contributing and you don’t have one of the key roles
- Bring in members of other teams with relevant knowledge
- Ask in #opg-all-team-digital, #opg-webops-community or #opg-developers for relevant experience from outside your team
End the incident
When work on the problem has ceased, either because the problem has been resolved or because resolution is blocked until a later date, the incident lead should update the Slack tool to close the incident.
@opg-incident-response close
This marks the official end of the incident. If users were notified of the incident, the communications lead should send an appropriate message via the same channels to tell them it’s over.
Post Incident
After the incident is resolved:
- Schedule a root cause analysis (RCA) session to identify where we can improve
- Document the outcome of the RCA in the OPG Security confluence space
- Share the RCA outcomes with the delivery team and wider OPG Digital so they can learn from it too
Records and retention
The dedicated incident channel should be archived once the issue is fixed.
The incident response site will keep an archive of the messages pinned by the incident scribe.
Where appropriate, the root cause analysis may shared with the Amazon Technical Account Managers.