Arvados provides several extensions to CWL for workflow optimization, site-specific configuration, and to enable access the Arvados API.
To use Arvados CWL extensions, add the following $namespaces
section at the top of your CWL file:
$namespaces: arv: "http://arvados.org/cwl#" cwltool: "http://commonwl.org/cwltool#"
For portability, most Arvados extensions should go into the hints
section of your CWL file. This makes it possible for your workflows to run other CWL runners that do not recognize Arvados hints. The difference between hints
and requirements
is that hints
are optional features that can be ignored by other runners and still produce the same output, whereas requirements
will fail the workflow if they cannot be fulfilled. For example, arv:IntermediateOutput
should go in hints
as it will have no effect on non-Arvados platforms, however if your workflow explicitly accesses the Arvados API and will fail without it, you should put arv:APIRequirement
in requirements
.
hints: arv:RunInSingleContainer: {} arv:RuntimeConstraints: keep_cache: 123456 outputDirType: keep_output_dir arv:PartitionRequirement: partition: dev_partition arv:APIRequirement: {} cwltool:LoadListingRequirement: loadListing: shallow_listing arv:IntermediateOutput: outputTTL: 3600 arv:ReuseRequirement: enableReuse: false cwltool:Secrets: secrets: [input1, input2] cwltool:TimeLimit: timelimit: 14400 arv:WorkflowRunnerResources: ramMin: 2048 coresMin: 2 keep_cache: 512 arv:ClusterTarget: cluster_id: clsr1 project_uuid: clsr1-j7d0g-qxc4jcji7n4lafx
Apply this to a workflow step that runs a subworkflow. Indicates that all the steps of the subworkflow should run together in a single container and not be scheduled separately. If you have a sequence of short-running steps (less than 1-2 minutes each) this enables you to avoid scheduling and data transfer overhead by running all the steps together at once. To use this feature, cwltool
must be installed in the container image.
Set Arvados-specific runtime hints.
Field | Type | Description |
---|---|---|
keep_cache | int | Size of file data buffer for Keep mount in MiB. Default is 256 MiB. Increase this to reduce cache thrashing in situations such as accessing multiple large (64+ MiB) files at the same time, or performing random access on a large file. |
outputDirType | enum | Preferred backing store for output staging. If not specified, the system may choose which one to use. One of local_output_dir or keep_output_dir |
local_output_dir: Use regular file system local to the compute node. There must be sufficient local scratch space to store entire output; specify this with outdirMin
of ResourceRequirement
. Files are batch uploaded to Keep when the process completes. Most compatible, but upload step can be time consuming for very large files.
keep_output_dir: Use writable Keep mount. Files are streamed to Keep as they are written. Does not consume local scratch space, but does consume RAM for output buffers (up to 192 MiB per file simultaneously open for writing.) Best suited to processes which produce sequential output of large files (non-sequential writes may produced fragmented file manifests). Supports regular files and directories, does not support special files such as symlinks, hard links, named pipes, named sockets, or device nodes.|
Select preferred compute partitions on which to run jobs.
Field | Type | Description |
---|---|---|
partition | string or array of strings |
For CWL v1.1 scripts, if a step requires network access but not specifically access to the Arvados API server, prefer the standard feature NetworkAccess . In the future, these may be differentiated by whether ARVADOS_API_HOST and ARVADOS_API_TOKEN is injected into the container or not.
Indicates that process wants to access to the Arvados API. Will be granted network access and have ARVADOS_API_HOST
and ARVADOS_API_TOKEN
set in the environment. Tools which rely on the Arvados API being present should put arv:APIRequirement
in the requirements
section of the tool (rather than hints
) to indicate that that it is not portable to non-Arvados CWL runners.
Use arv:APIRequirement
in hints
to enable general (non-Arvados-specific) network access for a tool.
Specify desired handling of intermediate output collections.
Field | Type | Description |
---|---|---|
outputTTL | int | If the value is greater than zero, consider intermediate output collections to be temporary and should be automatically trashed. Temporary collections will be trashed outputTTL seconds after creation. A value of zero means intermediate output should be retained indefinitely (this is the default behavior).Note: arvados-cwl-runner currently does not take workflow dependencies into account when setting the TTL on an intermediate output collection. If the TTL is too short, it is possible for a collection to be trashed before downstream steps that consume it are started. The recommended minimum value for TTL is the expected duration of the entire workflow. |
Indicate that one or more input parameters are “secret”. Must be applied at the top level Workflow. Secret parameters are not stored in keep, are hidden from logs and API responses, and are wiped from the database after the workflow completes.
Note: currently, workflows with secrets must be submitted on the command line using arvados-cwl-runner
. Workflows with secrets submitted through Workbench will not properly obscure the secret inputs.
Field | Type | Description |
---|---|---|
secrets | array |
Input parameters which are considered “secret”. Must be strings. |
Specify resource requirements for the workflow runner process (arvados-cwl-runner) that manages a workflow run. Must be applied to the top level workflow. Will also be set implicitly when using --submit-runner-ram
on the command line along with --create-workflow
or --update-workflow
. Use this to adjust the runner’s allocation if the workflow runner is getting “out of memory” exceptions or being killed by the out-of-memory (OOM) killer.
Field | Type | Description |
---|---|---|
ramMin | int | RAM, in mebibytes, to reserve for the arvados-cwl-runner process. Default 1 GiB |
coresMin | int | Number of cores to reserve to the arvados-cwl-runner process. Default 1 core. |
keep_cache | int | Size of collection metadata cache for the workflow runner, in MiB. Default 256 MiB. Will be added on to the RAM request when determining node size to request. |
Specify which Arvados cluster should execute a container or subworkflow, and the parent project for the container request.
Field | Type | Description |
---|---|---|
cluster_id | string | The five-character alphanumeric cluster id (uuid prefix) where a container or subworkflow will execute. May be an expression. |
project_uuid | string | The uuid of the project which will own container request and output of the container. May be an expression. |
This is an optional extension field appearing on the standard DockerRequirement
. It specifies the portable data hash of the Arvados collection containing the Docker image. If present, it takes precedence over dockerPull
or dockerImageId
.
requirements: DockerRequirement: dockerPull: "debian:10" arv:dockerCollectionPDH: "feaf1fc916103d7cdab6489e1f8c3a2b+174"
The following extensions are deprecated because equivalent features are part of the CWL v1.1 standard.
For CWL v1.1 scripts, this is deprecated in favor of loadListing or LoadListingRequirement
In CWL v1.0 documents, the default behavior for Directory objects is to recursively expand the listing
for access by parameter references an expressions. For directory trees containing many files, this can be expensive in both time and memory usage. Use cwltool:LoadListingRequirement
to change the behavior for expansion of directory listings in the workflow runner.
Field | Type | Description |
---|---|---|
loadListing | string | One of no_listing , shallow_listing , or deep_listing |
no_listing: Do not expand directory listing at all. The listing
field on the Directory object will be undefined.
shallow_listing: Only expand the first level of directory listing. The listing
field on the toplevel Directory object will contain the directory contents, however listing
will not be defined on subdirectories.
deep_listing: Recursively expand all levels of directory listing. The listing
field will be provided on the toplevel object and all subdirectories.
For CWL v1.1 scripts, this is deprecated in favor of WorkReuse .
Enable/disable work reuse for current process. Default true (work reuse enabled).
Field | Type | Description |
---|---|---|
enableReuse | boolean | Enable/disable work reuse for current process. Default true (work reuse enabled). |
For CWL v1.1 scripts, this is deprecated in favor of ToolTimeLimit
Set an upper limit on the execution time of a CommandLineTool or ExpressionTool. A tool execution which exceeds the time limit may be preemptively terminated and considered failed. May also be used by batch systems to make scheduling decisions.
Field | Type | Description |
---|---|---|
timelimit | int | Execution time limit in seconds. If set to zero, no limit is enforced. |
The content of this documentation is licensed under the
Creative
Commons Attribution-Share Alike 3.0 United States licence.
Code samples in this documentation are licensed under the
Apache License, Version 2.0.