crunch-dispatch-slurm
is only relevant for on premises clusters that will spool jobs to Slurm. Skip this section if you use LSF or if you are installing a cloud cluster.
This assumes you already have a Slurm cluster, and have set up all of your compute nodes with Docker or Singularity. Slurm packages are available on all distributions supported by Arvados. Please see your distribution package repositories. For information on installing Slurm from source, see this install guide
The Arvados Slurm dispatcher can run on any node that can submit requests to both the Arvados API server and the Slurm controller (via sbatch
). It is not resource-intensive, so you can run it on the API server node.
Crunch-dispatch-slurm reads the common configuration file at /etc/arvados/config.yml
.
Add a DispatchSLURM entry to the Services section, using the hostname where crunch-dispatch-slurm
will run, and an available port:
Services:
DispatchSLURM:
InternalURLs:
"http://hostname.zzzzz.arvadosapi.com:9007
": {}
The following configuration parameters are optional.
Each Arvados container that runs on your HPC cluster will bring up a long-lived connection to the Arvados controller and keep it open for the entire duration of the container. This connection is used to access real-time container logs from Workbench, and to enable the container shell feature.
Set the MaxGatewayTunnels
config entry high enough to accommodate the maximum number of containers you expect to run concurrently on your HPC cluster, plus incoming container shell sessions.
API: MaxGatewayTunnels: 2000
Also, configure Nginx (and any other HTTP proxies or load balancers running between the HPC and Arvados controller) to allow the expected number of connections, i.e., MaxConcurrentRequests + MaxQueuedRequests + MaxGatewayTunnels
.
crunch-dispatch-slurm polls the API server periodically for new containers to run. The PollInterval
option controls how often this poll happens. Set this to a string of numbers suffixed with one of the time units ns
, us
, ms
, s
, m
, or h
. For example:
Containers:
PollInterval: 3m30s
Extra RAM to reserve (in bytes) on each Slurm job submitted by Arvados, which is added to the amount specified in the container’s runtime_constraints
. If not provided, the default value is zero. Helpful when using -cgroup-parent-subsystem
, where crunch-run
and arv-mount
share the control group memory limit with the user process. In this situation, at least 256MiB is recommended to accommodate each container’s crunch-run
and arv-mount
processes.
Supports suffixes KB
, KiB
, MB
, MiB
, GB
, GiB
, TB
, TiB
, PB
, PiB
, EB
, EiB
(where KB
is 103, KiB
is 210, MB
is 106, MiB
is 220 and so forth).
Containers:
ReserveExtraRAM: 256MiB
If Slurm is unable to run a container, the dispatcher will submit it again after the next PollPeriod. If PollPeriod is very short, this can be excessive. If MinRetryPeriod is set, the dispatcher will avoid submitting the same container to Slurm more than once in the given time span.
Containers:
MinRetryPeriod: 30s
Some Arvados installations run a local keepstore on each compute node to handle all Keep traffic. To override Keep service discovery and access the local keep server instead of the global servers, set ARVADOS_KEEP_SERVICES in SbatchEnvironmentVariables:
Containers:
SLURM:
SbatchEnvironmentVariables:
ARVADOS_KEEP_SERVICES: "http://127.0.0.1:25107"
crunch-dispatch-slurm adjusts the “nice” values of its Slurm jobs to ensure containers are prioritized correctly relative to one another. This option tunes the adjustment mechanism.
PrioritySpread
can help avoid reaching that limit.scontrol
.The smallest usable value is 1
. The default value of 10
is used if this option is zero or negative. Example:
Containers:
SLURM:
PrioritySpread: 1000
When crunch-dispatch-slurm invokes sbatch
, you can add arguments to the command by specifying SbatchArguments
. You can use this to send the jobs to specific cluster partitions or add resource requests. Set SbatchArguments
to an array of strings. For example:
Containers:
SLURM:
SbatchArgumentsList:
- "--partition=PartitionName"
Note: If an argument is supplied multiple times, slurm
uses the value of the last occurrence of the argument on the command line. Arguments specified through Arvados are added after the arguments listed in SbatchArguments. This means, for example, an Arvados container with that specifies partitions
in scheduling_parameter
will override an occurrence of --partition
in SbatchArguments. As a result, for container parameters that can be specified through Arvados, SbatchArguments can be used to specify defaults but not enforce specific policy.
If your Slurm cluster uses the task/cgroup
TaskPlugin, you can configure Crunch’s Docker containers to be dispatched inside Slurm’s cgroups. This provides consistent enforcement of resource constraints. To do this, use a crunch-dispatch-slurm configuration like the following:
Containers:
CrunchRunArgumentsList:
- "-cgroup-parent-subsystem=memory"
When using cgroups v1, the choice of subsystem (“memory” in this example) must correspond to one of the resource types enabled in Slurm’s cgroup.conf
. The specified subsystem is singled out only to let Crunch determine the name of the cgroup provided by Slurm. Limits for other resource types will also be respected.
When doing this, you should also set ReserveExtraRAM .
Some versions of Docker (at least 1.9), when run under systemd, require the cgroup parent to be specified as a systemd slice. This causes an error when specifying a cgroup parent created outside systemd, such as those created by Slurm.
You can work around this issue by disabling the Docker daemon’s systemd integration. This makes it more difficult to manage Docker services with systemd, but Crunch does not require that functionality, and it will be able to use Slurm’s cgroups as container parents. To do this, configure the Docker daemon on all compute nodes to run with the option --exec-opt native.cgroupdriver=cgroupfs
.
Older Linux kernels (prior to 3.18) have bugs in network namespace handling which can lead to compute node lockups. This by is indicated by blocked kernel tasks in “Workqueue: netns cleanup_net”. If you are experiencing this problem, as a workaround you can disable use of network namespaces by Docker across the cluster. Be aware this reduces container isolation, which may be a security risk.
Containers:
CrunchRunArgumentsList:
- "-container-enable-networking=always"
- "-container-network-mode=host"
# dnf install crunch-dispatch-slurm
# apt install crunch-dispatch-slurm
# systemctl enable --now crunch-dispatch-slurm
# systemctl status crunch-dispatch-slurm
[...]
If systemctl status
indicates it is not running, use journalctl
to check logs for errors:
# journalctl --since -5min -u crunch-dispatch-slurm
Make sure the cluster config file is up to date on the API server host then restart the API server and controller processes to ensure the configuration changes are visible to the whole cluster.
# systemctl restart nginx arvados-controller
# arvados-server check
The content of this documentation is licensed under the
Creative
Commons Attribution-Share Alike 3.0 United States licence.
Code samples in this documentation are licensed under the
Apache License, Version 2.0.