Arvados CWL Extensions

Arvados provides several extensions to CWL for workflow optimization, site-specific configuration, and to enable access the Arvados API.

To use Arvados CWL extensions, add the following $namespaces section at the top of your CWL file:

$namespaces:
  arv: "http://arvados.org/cwl#"
  cwltool: "http://commonwl.org/cwltool#"

For portability, most Arvados extensions should go into the hints section of your CWL file. This makes it possible for your workflows to run other CWL runners that do not recognize Arvados hints. The difference between hints and requirements is that hints are optional features that can be ignored by other runners and still produce the same output, whereas requirements will fail the workflow if they cannot be fulfilled. For example, arv:IntermediateOutput should go in hints as it will have no effect on non-Arvados platforms, however if your workflow explicitly accesses the Arvados API and will fail without it, you should put arv:APIRequirement in requirements.

RunInSingleContainer
SeparateRunner
RuntimeConstraints
PartitionRequirement
APIRequirement
IntermediateOutput
Secrets
WorkflowRunnerResources
ClusterTarget
OutputStorageClass
ProcessProperties
OutputCollectionProperties
CUDARequirement
UsePreemptible
PreemptionBehavior
OutOfMemoryRetry

hints:
  arv:RunInSingleContainer: {}

  arv:SeparateRunner:
    runnerProcessName: $(inputs.sample_id)

  arv:RuntimeConstraints:
    keep_cache: 123456
    outputDirType: keep_output_dir

  arv:PartitionRequirement:
    partition: dev_partition

  arv:APIRequirement: {}

  arv:IntermediateOutput:
    outputTTL: 3600

  cwltool:Secrets:
    secrets: [input1, input2]

  arv:WorkflowRunnerResources:
    ramMin: 2048
    coresMin: 2
    keep_cache: 512

  arv:ClusterTarget:
    cluster_id: clsr1
    project_uuid: clsr1-j7d0g-qxc4jcji7n4lafx

  arv:OutputStorageClass:
    intermediateStorageClass: fast_storage
    finalStorageClass: robust_storage

  arv:ProcessProperties:
    processProperties:
      property1: value1
      property2: $(inputs.value2)

  arv:OutputCollectionProperties:
    outputProperties:
      property1: value1
      property2: $(inputs.value2)

  cwltool:CUDARequirement:
    cudaVersionMin: "11.0"
    cudaComputeCapability: "9.0"
    cudaDeviceCountMin: 1
    cudaDeviceCountMax: 1

  arv:UsePreemptible:
    usePreemptible: true

  arv:PreemptionBehavior:
    resubmitNonPreemptible: true

  arv:OutOfMemoryRetry:
    memoryRetryMultiplier: 2
    memoryErrorRegex: "custom memory error"

arv:RunInSingleContainer

Apply this to a workflow step that runs a subworkflow. Indicates that all the steps of the subworkflow should run together in a single container and not be scheduled separately. If you have a sequence of short-running steps (less than 1-2 minutes each) this enables you to avoid scheduling and data transfer overhead by running all the steps together at once. To use this feature, cwltool must be installed in the container image.

arv:SeparateRunner

Apply this to a workflow step that runs a subworkflow. Indicates that Arvados should launch a new workflow runner to manage that specific subworkflow instance. If used on a scatter step, each scatter item is launched separately. Using this option has three benefits:

Better organization in the “Subprocesses” table of the main workflow, including the ability to provide a custom name for the step
When re-running a batch that has run before, an entire subworkflow may be reused as a unit, which is faster than determining reuse for each step.
Significantly faster submit rate compared to invoking arvados-cwl-runner to launch individual workflow instances separately.

The disadvantage of this option is that because it does launch an additional workflow runner, that workflow runner consumes more compute resources compared to having all the steps managed by a single runner.

Field	Type	Description
runnerProcessName	optional string	Name to assign to the subworkflow process. May be an expression with an input context of the post-scatter workflow step invocation.

arv:RuntimeConstraints

Set Arvados-specific runtime hints.

Field	Type	Description
keep_cache	int	Size of file data buffer for Keep mount in MiB. Default is 256 MiB. Increase this to reduce cache thrashing in situations such as accessing multiple large (64+ MiB) files at the same time, or performing random access on a large file.
outputDirType	enum	Preferred backing store for output staging. If not specified, the system may choose which one to use. One of local_output_dir or keep_output_dir

local_output_dir: Use regular file system local to the compute node. There must be sufficient local scratch space to store entire output; specify this with outdirMin of ResourceRequirement. Files are batch uploaded to Keep when the process completes. Most compatible, but upload step can be time consuming for very large files.

keep_output_dir: Use writable Keep mount. Files are streamed to Keep as they are written. Does not consume local scratch space, but does consume RAM for output buffers (up to 192 MiB per file simultaneously open for writing.) Best suited to processes which produce sequential output of large files (non-sequential writes may produced fragmented file manifests). Supports regular files and directories, does not support special files such as symlinks, hard links, named pipes, named sockets, or device nodes.|

arv:PartitionRequirement

Select preferred compute partitions on which to run jobs.

Field	Type	Description
partition	string or array of strings

arv:APIRequirement

For CWL v1.1 scripts, if a step requires network access but not specifically access to the Arvados API server, prefer the standard feature NetworkAccess . In the future, these may be differentiated by whether ARVADOS_API_HOST and ARVADOS_API_TOKEN is injected into the container or not.

Indicates that process wants to access to the Arvados API. Will be granted network access and have ARVADOS_API_HOST and ARVADOS_API_TOKEN set in the environment. Tools which rely on the Arvados API being present should put arv:APIRequirement in the requirements section of the tool (rather than hints) to indicate that that it is not portable to non-Arvados CWL runners.

Use arv:APIRequirement in hints to enable general (non-Arvados-specific) network access for a tool.

arv:IntermediateOutput

Specify desired handling of intermediate output collections.

Field	Type	Description
outputTTL	int	If the value is greater than zero, consider intermediate output collections to be temporary and should be automatically trashed. Temporary collections will be trashed `outputTTL` seconds after creation. A value of zero means intermediate output should be retained indefinitely (this is the default behavior). Note: arvados-cwl-runner currently does not take workflow dependencies into account when setting the TTL on an intermediate output collection. If the TTL is too short, it is possible for a collection to be trashed before downstream steps that consume it are started. The recommended minimum value for TTL is the expected duration of the entire workflow.

cwltool:Secrets

Indicate that one or more input parameters are “secret”. Must be applied at the top level Workflow. Secret parameters are not stored in keep, are hidden from logs and API responses, and are wiped from the database after the workflow completes.

Note: currently, workflows with secrets must be submitted on the command line using arvados-cwl-runner. Workflows with secrets submitted through Workbench will not properly obscure the secret inputs.

Field	Type	Description
secrets	array	Input parameters which are considered “secret”. Must be strings.

arv:WorkflowRunnerResources

Specify resource requirements for the workflow runner process (arvados-cwl-runner) that manages a workflow run. Must be applied to the top level workflow. Will also be set implicitly when using --submit-runner-ram on the command line along with --create-workflow or --update-workflow. Use this to adjust the runner’s allocation if the workflow runner is getting “out of memory” exceptions or being killed by the out-of-memory (OOM) killer.

Field	Type	Description
ramMin	int	RAM, in mebibytes, to reserve for the arvados-cwl-runner process. Default 1 GiB
coresMin	int	Number of cores to reserve to the arvados-cwl-runner process. Default 1 core.
keep_cache	int	Size of collection metadata cache for the workflow runner, in MiB. Default 256 MiB. Will be added on to the RAM request when determining node size to request.

arv:ClusterTarget

Specify which Arvados cluster should execute a container or subworkflow, and the parent project for the container request.

Field	Type	Description
cluster_id	string	The five-character alphanumeric cluster id (uuid prefix) where a container or subworkflow will execute. May be an expression.
project_uuid	string	The uuid of the project which will own container request and output of the container. May be an expression.

arv:OutputStorageClass

Specify the storage class to use for intermediate and final outputs.

Field	Type	Description
intermediateStorageClass	string or array of strings	The storage class for output of intermediate steps. For example, faster “hot” storage.
finalStorageClass_uuid	string or array of strings	The storage class for the final output.

arv:ProcessProperties

Specify extra properties that will be set on container requests created by the workflow. May be set on a Workflow or a CommandLineTool. Setting custom properties on a container request simplifies queries to find the workflow run later on.

Field	Type	Description
processProperties	key-value map, or list of objects with the fields {propertyName, propertyValue}	The properties that will be set on the container request. May include expressions that reference `$(inputs)` of the current workflow or tool.

arv:OutputCollectionProperties

Specify custom properties that will be set on the output collection of the workflow step.

Field	Type	Description
outputProperties	key-value map, or list of objects with the fields {propertyName, propertyValue}	The properties that will be set on the output collection. May include expressions that reference `$(inputs)` of the current workflow or tool.

cwltool:CUDARequirement

Request support for Nvidia CUDA GPU acceleration in the container. Assumes that the CUDA runtime (SDK) is installed in the container, and the host will inject the CUDA driver libraries into the container (equal or later to the version requested).

Field	Type	Description
cudaVersionMin	string	Required. The CUDA SDK version corresponding to the minimum driver version supported by the container (generally, the SDK version ‘X.Y’ the application was compiled against).
cudaComputeCapability	string	Required. The minimum CUDA hardware capability (in ‘X.Y’ format) required by the application’s PTX or C++ GPU code (will be JIT compiled for the available hardware).
cudaDeviceCountMin	integer	Minimum number of GPU devices to allocate on a single node. Required.
cudaDeviceCountMax	integer	Maximum number of GPU devices to allocate on a single node. Optional. If not specified, same as `cudaDeviceCountMin`.

arv:UsePreemptible

Specify whether a workflow step should request preemptible (e.g. AWS Spot market) instances. Such instances are generally cheaper, but can be taken back by the cloud provider at any time (preempted) causing the step to fail. When this happens, Arvados will automatically re-try the step, up to the configuration value of Containers.MaxRetryAttempts (default 3) times.

Field	Type	Description
usePreemptible	boolean	Required, true to opt-in to using preemptible instances, false to opt-out.

arv:PreemptionBehavior

This option determines the behavior when arvados-cwl-runner detects that a workflow step was cancelled because the preemptible (spot market) instance it was running on was reclaimed by the cloud provider. If ‘true’, instead of the retry behavior described above in ‘UsePreemptible’, on the first failure the workflow step will be re-submitted with preemption disabled, so it will be scheduled to run on non-preemptible (on-demand) instances.

When preemptible instances are reclaimed, this is a signal that the cloud provider has restricted capacity for low priority preemptible instance. As a result, the default behavior of turning around and rescheduling or launching on another preemptible instance has higher risk of being preempted a second or third time, spending more time and money but making no progress. This option provides an alternate fallback behavior, by attempting to run the step on a preemptible instance the first time (saving money), but re-running the step as non-preemptible if the first attempt was preempted (ensuring continued progress).

This behavior applied to each step individually. If a step is preempted, then successfully re-run as non-preemptible, it does not affect the behavior of the next step, which will first be launched as preemptible, and so forth.

Field	Type	Description
resubmitNonPreemptible	boolean	Required. If true, then when a workflow step is cancelled because the instance was preempted, re-submit the step with preemption disabled.

arv:OutOfMemoryRetry

Specify that when a workflow step appears to have failed because it did not request enough RAM, it should be re-submitted with more RAM. Out of memory conditions are detected either by the container being unexpectedly killed (exit code 137) or by matching a pattern in the container’s output (see memoryErrorRegex). Retrying will increase the base RAM request by the value of memoryRetryMultiplier. For example, if the original RAM request was 10 GiB and the multiplier is 1.5, then it will re-submit with 15 GiB.

Containers are only re-submitted once. If it fails a second time after increasing RAM, then the worklow step will still fail.

Also note that expressions that use $(runtime.ram) (such as dynamic command line parameters) are not reevaluated when the container is resubmitted.

Field	Type	Description
memoryRetryMultiplier	float	Optional, default value is 2. The retry will multiply the base memory request by this factor to get the retry memory request.
memoryErrorRegex	string	Optional, a custom regex that, if found in the stdout, stderr or crunch-run logging of a program, will trigger a retry with greater RAM. If not provided, the default pattern matches “out of memory” (with or without spaces), “memory error” (with or without spaces), “bad_alloc” and “container using over 90% of memory”.

arv:dockerCollectionPDH

This is an optional extension field appearing on the standard DockerRequirement. It specifies the portable data hash of the Arvados collection containing the Docker image. If present, it takes precedence over dockerPull or dockerImageId.

requirements:
  DockerRequirement:
    dockerPull: "debian:10"
    arv:dockerCollectionPDH: "feaf1fc916103d7cdab6489e1f8c3a2b+174"

Deprecated extensions

The following extensions are deprecated because equivalent features are part of the CWL v1.1 standard.

hints:
  cwltool:LoadListingRequirement:
    loadListing: shallow_listing
  arv:ReuseRequirement:
    enableReuse: false
  cwltool:TimeLimit:
    timelimit: 14400

cwltool:LoadListingRequirement

For CWL v1.1 scripts, this is deprecated in favor of loadListing or LoadListingRequirement

In CWL v1.0 documents, the default behavior for Directory objects is to recursively expand the listing for access by parameter references an expressions. For directory trees containing many files, this can be expensive in both time and memory usage. Use cwltool:LoadListingRequirement to change the behavior for expansion of directory listings in the workflow runner.

Field	Type	Description
loadListing	string	One of `no_listing`, `shallow_listing`, or `deep_listing`

no_listing: Do not expand directory listing at all. The listing field on the Directory object will be undefined.

shallow_listing: Only expand the first level of directory listing. The listing field on the toplevel Directory object will contain the directory contents, however listing will not be defined on subdirectories.

deep_listing: Recursively expand all levels of directory listing. The listing field will be provided on the toplevel object and all subdirectories.

arv:ReuseRequirement

For CWL v1.1 scripts, this is deprecated in favor of WorkReuse .

Enable/disable work reuse for current process. Default true (work reuse enabled).

Field	Type	Description
enableReuse	boolean	Enable/disable work reuse for current process. Default true (work reuse enabled).

cwltool:TimeLimit

For CWL v1.1 scripts, this is deprecated in favor of ToolTimeLimit

Set an upper limit on the execution time of a CommandLineTool or ExpressionTool. A tool execution which exceeds the time limit may be preemptively terminated and considered failed. May also be used by batch systems to make scheduling decisions.

Field	Type	Description
timelimit	int	Execution time limit in seconds. If set to zero, no limit is enforced.

Previous: CWL Resources Next: CWL version support