Health checks

Health check endpoints are found at /_health/ping on many Arvados services. The purpose of the health check is to offer a simple method of determining if a service can be reached and allow the service to self-report any problems, suitable for integrating into operational alert systems.

To access health check endpoints, services must be configured with a management token .

Health check endpoints return a JSON object with the field health. This has a value of either OK or ERROR. On error, it may also include a field error with additional information. Examples:

{
  "health": "OK"
}
{
  "health": "ERROR"
  "error": "Inverted polarity in the warp core"
}

Health check aggregator

The service arvados-health performs health checks on all configured services and returns a single value of OK or ERROR for the entire cluster. It exposes the endpoint /_health/all .

The healthcheck aggregator uses the Services section of the cluster-wide config.yml configuration file.

Health check command

The arvados-server check command is another way to perform the same health checks as the health check aggregator service. It does not depend on the aggregator service.

If all checks pass, it writes health check OK to stderr (unless the -quiet flag is used) and exits 0. Otherwise, it writes error messages to stderr and exits with error status.

arvados-server check -yaml outputs a YAML document on stdout with additional details about each service endpoint that was checked.

Checks:
  "arvados-api-server+http://localhost:8004/_health/ping":
    ClockTime: "2022-11-16T16:08:57Z"
    ConfigSourceSHA256: e2c086ae3dd290cf029cb3fe79146529622279b6280cf6cd17dc8d8c30daa57f
    ConfigSourceTimestamp: "2022-11-07T18:08:24.539545Z"
    HTTPStatusCode: 200
    Health: OK
    Response:
      health: OK
    ResponseTime: 0.017159
    Server: nginx/1.14.0 + Phusion Passenger(R) 6.0.15
    Version: 2.5.0~dev20221116141533
  "arvados-controller+http://localhost:8003/_health/ping":
    ClockTime: "2022-11-16T16:08:57Z"
    ConfigSourceSHA256: e2c086ae3dd290cf029cb3fe79146529622279b6280cf6cd17dc8d8c30daa57f
    ConfigSourceTimestamp: "2022-11-07T18:08:24.539545Z"
    HTTPStatusCode: 200
    Health: OK
    Response:
      health: OK
    ResponseTime: 0.004748
    Server: ""
    Version: 2.5.0~dev20221116141533 (go1.18.8)
# ...

Previous: Metrics Next: Inspecting active requests

The content of this documentation is licensed under the Creative Commons Attribution-Share Alike 3.0 United States licence.
Code samples in this documentation are licensed under the Apache License, Version 2.0.