Metrics
Some Arvados services publish Prometheus/OpenMetrics-compatible metrics at /metrics
, and some provide additional runtime status at /status.json
. Metrics can help you understand how components perform under load, find performance bottlenecks, and detect and diagnose problems.
To access metrics endpoints, services must be configured with a management token. When accessing a metrics endpoint, prefix the management token with "Bearer "
and supply it in the Authorization
request header.
curl -sfH "Authorization: Bearer your_management_token_goes_here" "https://0.0.0.0:25107/status.json"
Keep-web
Keep-web exports metrics at /metrics
— e.g., https://collections.zzzzz.arvadosapi.com/metrics
.
Name |
Type |
Description |
request_duration_seconds |
summary |
elapsed time between receiving a request and sending the last byte of the response body (segmented by HTTP request method and response status code) |
time_to_status_seconds |
summary |
elapsed time between receiving a request and sending the HTTP response status code (segmented by HTTP request method and response status code) |
Metrics in the arvados_keepweb_collectioncache
namespace report keep-web’s internal cache of Arvados collection metadata.
Name |
Type |
Description |
arvados_keepweb_collectioncache_requests |
counter |
cache lookups |
arvados_keepweb_collectioncache_api_calls |
counter |
outgoing API calls |
arvados_keepweb_collectioncache_permission_hits |
counter |
collection-to-permission cache hits |
arvados_keepweb_collectioncache_pdh_hits |
counter |
UUID-to-PDH cache hits |
arvados_keepweb_collectioncache_hits |
counter |
PDH-to-manifest cache hits |
arvados_keepweb_collectioncache_cached_manifests |
gauge |
number of collections in the cache |
arvados_keepweb_collectioncache_cached_manifest_bytes |
gauge |
memory consumed by cached collection manifests |
Keepstore
Keepstore exports metrics at /status.json
— e.g., http://keep0.zzzzz.arvadosapi.com:25107/status.json
.
Root
volumeStatusEnt
VolumeStatus
Attribute |
Type |
Description |
MountPoint |
string |
|
DeviceNum |
uint64 |
|
BytesFree |
uint64 |
|
BytesUsed |
uint64 |
|
ioStats
Attribute |
Type |
Description |
Errors |
uint64 |
|
Ops |
uint64 |
|
CompareOps |
uint64 |
|
GetOps |
uint64 |
|
PutOps |
uint64 |
|
TouchOps |
uint64 |
|
InBytes |
uint64 |
|
OutBytes |
uint64 |
|
PoolStatus
Attribute |
Type |
Description |
BytesAllocatedCumulative |
uint64 |
|
BuffersMax |
int |
|
BuffersInUse |
int |
|
WorkQueueStatus
Attribute |
Type |
Description |
InProgress |
int |
|
Queued |
int |
|
Example response
{
"Volumes": [
{
"Label": "[UnixVolume /var/lib/arvados/keep0]",
"Status": {
"MountPoint": "/var/lib/arvados/keep0",
"DeviceNum": 65029,
"BytesFree": 222532972544,
"BytesUsed": 435456679936
},
"InternalStats": {
"Errors": 0,
"InBytes": 1111,
"OutBytes": 0,
"OpenOps": 1,
"StatOps": 4,
"FlockOps": 0,
"UtimesOps": 0,
"CreateOps": 0,
"RenameOps": 0,
"UnlinkOps": 0,
"ReaddirOps": 0
}
}
],
"BufferPool": {
"BytesAllocatedCumulative": 67108864,
"BuffersMax": 20,
"BuffersInUse": 0
},
"PullQueue": {
"InProgress": 0,
"Queued": 0
},
"TrashQueue": {
"InProgress": 0,
"Queued": 0
},
"RequestsCurrent": 1,
"RequestsMax": 40,
"Version": "dev"
}
Keep-balance
Keep-balance exports metrics at /metrics
— e.g., http://keep.zzzzz.arvadosapi.com:9005/metrics
.
Name |
Type |
Description |
arvados_keep_total_{replicas,blocks,bytes} |
gauge |
stored data (stored in backend volumes, whether referenced or not) |
arvados_keep_garbage_{replicas,blocks,bytes} |
gauge |
garbage data (unreferenced, and old enough to trash) |
arvados_keep_transient_{replicas,blocks,bytes} |
gauge |
transient data (unreferenced, but too new to trash) |
arvados_keep_overreplicated_{replicas,blocks,bytes} |
gauge |
overreplicated data (more replicas exist than are needed) |
arvados_keep_underreplicated_{replicas,blocks,bytes} |
gauge |
underreplicated data (fewer replicas exist than are needed) |
arvados_keep_lost_{replicas,blocks,bytes} |
gauge |
lost data (referenced by collections, but not found on any backend volume) |
arvados_keep_dedup_block_ratio |
gauge |
deduplication ratio (block references in collections ÷ distinct blocks referenced) |
arvados_keep_dedup_byte_ratio |
gauge |
deduplication ratio (block references in collections ÷ distinct blocks referenced, weighted by block size) |
arvados_keepbalance_get_state_seconds |
summary |
time to get all collections and keepstore volume indexes for one iteration |
arvados_keepbalance_changeset_compute_seconds |
summary |
time to compute changesets for one iteration |
arvados_keepbalance_send_pull_list_seconds |
summary |
time to send pull lists to all keepstore servers for one iteration |
arvados_keepbalance_send_trash_list_seconds |
summary |
time to send trash lists to all keepstore servers for one iteration |
arvados_keepbalance_sweep_seconds |
summary |
time to complete one iteration |
Each arvados_keep_
storage state statistic above is presented as a set of three metrics:
*_blocks |
distinct block hashes |
*_bytes |
bytes stored on backend volumes |
*_replicas |
objects/files stored on backend volumes |
Node manager
The node manager status end point provides a snapshot of internal status at the time of the most recent wishlist update.
Attribute |
Type |
Description |
nodes_booting |
int |
Number of nodes in booting state |
nodes_unpaired |
int |
Number of nodes in unpaired state |
nodes_busy |
int |
Number of nodes in busy state |
nodes_idle |
int |
Number of nodes in idle state |
nodes_fail |
int |
Number of nodes in fail state |
nodes_down |
int |
Number of nodes in down state |
nodes_shutdown |
int |
Number of nodes in shutdown state |
nodes_wish |
int |
Number of nodes in the current wishlist |
node_quota |
int |
Current node count ceiling due to cloud quota limits |
config_max_nodes |
int |
Configured max node count |
Example
{
"actor_exceptions": 0,
"idle_times": {
"compute1": 0,
"compute3": 0,
"compute2": 0,
"compute4": 0
},
"create_node_errors": 0,
"destroy_node_errors": 0,
"nodes_idle": 0,
"config_max_nodes": 8,
"list_nodes_errors": 0,
"node_quota": 8,
"Version": "1.1.4.20180719160944",
"nodes_wish": 0,
"nodes_unpaired": 0,
"nodes_busy": 4,
"boot_failures": 0
}
Previous: Health checks
Next: Management token