Introduction to incidents

See also: Incident process

Incidents sound scary, the term “incident” just means something is going wrong or not working as expected. Incidents can happen anywhere, and we face technology outages that will impact our users.

We can’t stop incidents from happening, but we can make sure we are ready to deal with them.

What is incident response?

This is the broad term that refers to the processes we follow when something happens. These are the things we should be thinking about apart from actually fixing the issue:

Communicating with ourselves
Communication with users
Clearly defining roles
Getting the right people involved
Tracking what’s happened

Incident Priority

Assign a priority level to incidents based on their complexity, urgency and resolution time. Incident severity also determines response times and support level.

Incident priority table

Classification	Type	Example	Response time	Update frequency
P1	Critical	Complete outage, or ongoing unauthorised access	20 minutes	30 Minutes
P2	Major	Substantial degradation of service	60 minutes	1 hours
P3	Significant	Users experiencing intermittent or degraded service due to platform issue	2 hours	Once after 2 business days
P4	Minor	Component failure that does not immediately impact a service, or an unsuccessful DoS attempt	1 business day	Once after 5 business days

Guidance for products

The main incident response process is stored here in the opg-technical-guidance respository. Each product should have a product runbook also located in this repository, you should create a link to this runbook from your repository in the README.md.

This page was last reviewed on 7 February 2024. It needs to be reviewed again on 7 August 2024 by the page owner #opg-webops-community .

This page was set to be reviewed before 7 August 2024 by the page owner #opg-webops-community. This might mean the content is out of date.