Computing with Crunch

Crunch is the name for the Arvados system for managing computation. It provides an abstract API to various clouds and HPC resource allocation and scheduling systems, and integrates closely with Keep storage and the Arvados permission system.

Container API

  1. To submit work, create a container request in the Committed state.
  2. The system will fufill the container request by creating or reusing a Container object and assigning it to the container_uuid field. If the same request has been submitted in the past, it may reuse an existing container. The reuse behavior can be suppressed with use_existing: false in the container request.
  3. The dispatcher process will notice a new container in Queued state and submit a container executor to the underlying work queuing system (such as SLURM).
  4. The container executes. Upon termination the container goes into the Complete state. If the container execution was interrupted or lost due to system failure, it will go into the Cancelled state.
  5. When the container associated with the container request is completed, the container request will go into the Final state.
  6. The output_uuid field of the container request contains the uuid of output collection produced by container request.

Understanding RAM requests for containers

The runtime_constraints section of a container specifies working RAM (ram) and Keep cache (keep_cache_ram). If not specified, containers get a default Keep cache (container_default_keep_cache_ram, default 256 MiB). The total RAM requested for a container is the sum of working RAM, Keep cache, and an additional RAM reservation configured by the admin (ReserveExtraRAM in the dispatcher configuration, default zero).

The total RAM request is used to schedule containers onto compute nodes. RAM allocation limits are enforced using kernel controls such as cgroups. A container which requests 1 GiB RAM will only be permitted to allocate up to 1 GiB of RAM, even if scheduled on a 4 GiB node. On HPC systems, a multi-core node may run multiple containers at a time.

When running on the cloud, the memory request (along with CPU and disk) is used to select (and possibly boot) an instance type with adequate resources to run the container. Instance type RAM is derated 5% from the published specification to accomodate virtual machine, kernel and system services overhead.

Calculate minimum instance type RAM for a container

(RAM request + Keep cache + ReserveExtraRAM) * (100/95)

For example, for a 3 GiB request, default Keep cache, and no extra RAM reserved:

(3072 + 256) * 1.0526 = 3494 MiB

To run this container, the instance type must have a published RAM size of at least 3494 MiB.

Calculate the maximum requestable RAM for an instance type

(Instance type RAM * (95/100)) – Keep cache – ReserveExtraRAM

For example, for a 3.75 GiB node, default Keep cache, and no extra RAM reserved:

(3840 * 0.95) – 256 = 3392 MiB

To run on this instance type, the container can request at most 3392 MiB of working RAM.

Job API (deprecated)

  1. To submit work, create a job . If the same job has been submitted in the past, it will return an existing job in Completed state.
  2. The dispatcher process will notice a new job in Queued state and attempt to allocate nodes to run the job.
  3. The job executes.
  4. Retrieve the output field with the portable data hash of the collection with the output files of the job.

Previous: Storage in Keep Next: Permission model

The content of this documentation is licensed under the Creative Commons Attribution-Share Alike 3.0 United States licence.
Code samples in this documentation are licensed under the Apache License, Version 2.0.