Skip to main content

Introduction to incidents

See also: Incident process

Incidents sound scary, the term “incident” just means something is going wrong or not working as expected. Incidents can happen anywhere, and we face technology outages that will impact our users.

We can’t stop incidents from happening, but we can make sure we are ready to deal with them.

What is incident response?

This is the broad term that refers to the processes we follow when something happens. These are the things we should be thinking about apart from actually fixing the issue:

  • Communicating with ourselves
  • Communication with users
  • Clearly defining roles
  • Getting the right people involved
  • Tracking what’s happened

Incident Priority

Assign a priority level to incidents based on their complexity, urgency and resolution time. Incident severity also determines response times and support level.

Incident priority table

Classification Type Example Response time Update frequency
P1 Critical Complete outage, or ongoing unauthorised access 20 minutes 30 Minutes
P2 Major Substantial degradation of service 60 minutes 1 hours
P3 Significant Users experiencing intermittent or degraded service due to platform issue 2 hours Once after 2 business days
P4 Minor Component failure that does not immediately impact a service, or an unsuccessful DoS attempt 1 business day Once after 5 business days

Guidance for products

The main incident response process is stored here in the opg-technical-guidance respository. Each product should have a product runbook also located in this repository, you should create a link to this runbook from your repository in the README.md.

This page was last reviewed on 7 February 2024. It needs to be reviewed again on 7 August 2024 by the page owner #opg-webops-community .
This page was set to be reviewed before 7 August 2024 by the page owner #opg-webops-community. This might mean the content is out of date.