ADR-007 Healthcheck Definitions
Date: 14-02-2023
Status
✅ Accepted
Context
Across our teams there are currently slight differences in how we use the various health checks within services.
This is mainly in the fidelity of the checks and what they connect to. For example one service might use a health check to just check if a service’s containers are running. Another might check that in-app connectivity is working, such as front end to internal API to DB. Another might check that the service can connect out to related services it depends on, such as Notify or other API endpoints.
This means that an engineer moving between products, either for temporary support or moving into a new role, could be confused at the intent of the healthcheck or make assumptions based on past experience that turn out not to be true. This is particularly true if the healthcheck is used for multiple purposes which could be at odds, for example if a healthcheck connecting to an external service was used to cause containers to scale or cycle when they didn’t need to.
Decision
We need to be clear on the purpose of each healthcheck within a service and be consistent across services in how we name, use and check them. Therefore we should categorise them as: container health checks, service health checks and dependency health checks so we are clear about what type we implement and how they are intended to be used. Urls given below are our standard suggestions, which teams should mvoe towards.
Container Health Checks
These are used to determine if an individual container is running as expected. Reports that the container has finished its startup and is ready to handle requests. Should start to fail if the container needs to be cycled out.
/health-check
Service Health Checks
These are used to determine if a service is “up” and working for the purposes of uptime calculation and proving that end to end within a service boundary is functioning normally. That is the front end can talk to the api and the api can talk to the DB. Typically a front end service health check will talk to the api version of the same health check.
/health-check/service
Dependency Health Checks
These are used to determine if a service’s external dependencies (external APIs like Notify or the Sirius integrations) are down and may cause service degredation. They should return details of each service status. A more developed version of this will also indicate what impact that service being down will have on the system. Teams shoudl take time to work out the likely impacts, for example Notify being down could just mean email is queued until it is working again.
/health-check/dependencies
Monitoring Health Checks
It is worth noting that for Route53 health checks AWS considers them unhealthy if they take longer than 4 seconds to connect and longer than 2 seconds to respond with a 200 or 3xx response. This means any health check connecting outside the service boundary (dependency health checks) are not recommended for these. See AWS documentation.
Teams are encouraged to create other alerts consistent with the nature of their service to alert on broader service healthcheck failures. Some teams may want to use Pingdom, but this is not suitable for internal-only services behind an allowlist.
Consequences
- Work will need to happen to normalise the various health checks across the estate.
- Teams will need to indentify what a dependency failure means for a given service and make that explicit with Product colleagues.
- We will report Service Health Checks in any request for uptime stats.