Metrics

Some Arvados services publish Prometheus/OpenMetrics-compatible metrics at /metrics. Metrics can help you understand how components perform under load, find performance bottlenecks, and detect and diagnose problems.

To access metrics endpoints, services must be configured with a management token. When accessing a metrics endpoint, prefix the management token with "Bearer " and supply it in the Authorization request header.

curl -sfH "Authorization: Bearer your_management_token_goes_here" "https://0.0.0.0:25107/metrics"

The plain text export format includes “help” messages with a description of each reported metric.

When configuring Prometheus, use a bearer_token or bearer_token_file option to authenticate requests.

scrape_configs:
  - job_name: keepstore
    bearer_token: your_management_token_goes_here
    static_configs:
    - targets:
      - "keep0.ClusterID.example.com:25107"
Component Metrics endpoint
arvados-api-server
arvados-controller
arvados-dispatch-cloud
arvados-git-httpd
arvados-node-manager
arvados-ws
composer
keepproxy
keepstore
keep-balance
keep-web
workbench1
workbench2

Node manager

The node manager does not export prometheus-style metrics, but its /status.json endpoint provides a snapshot of internal status at the time of the most recent wishlist update.

curl -sfH "Authorization: Bearer your_management_token_goes_here" "http://0.0.0.0:8989/status.json"
Attribute Type Description
nodes_booting int Number of nodes in booting state
nodes_unpaired int Number of nodes in unpaired state
nodes_busy int Number of nodes in busy state
nodes_idle int Number of nodes in idle state
nodes_fail int Number of nodes in fail state
nodes_down int Number of nodes in down state
nodes_shutdown int Number of nodes in shutdown state
nodes_wish int Number of nodes in the current wishlist
node_quota int Current node count ceiling due to cloud quota limits
config_max_nodes int Configured max node count

Example

{
  "actor_exceptions": 0,
  "idle_times": {
    "compute1": 0,
    "compute3": 0,
    "compute2": 0,
    "compute4": 0
  },
  "create_node_errors": 0,
  "destroy_node_errors": 0,
  "nodes_idle": 0,
  "config_max_nodes": 8,
  "list_nodes_errors": 0,
  "node_quota": 8,
  "Version": "1.1.4.20180719160944",
  "nodes_wish": 0,
  "nodes_unpaired": 0,
  "nodes_busy": 4,
  "boot_failures": 0
}

Previous: Logging Next: Health checks

The content of this documentation is licensed under the Creative Commons Attribution-Share Alike 3.0 United States licence.
Code samples in this documentation are licensed under the Apache License, Version 2.0.