To get the best perfomance from your workflows, be aware of the following Arvados features, behaviors, and best practices.
Use cwltool:CUDARequirement to request nodes with GPUs.
Try using preemptible (spot) instances .
If you have a sequence of short-running steps (less than 1-2 minutes each), use the Arvados extension arv:RunInSingleContainer to avoid scheduling and data transfer overhead by running all the steps together in the same container on the same node. To use this feature, cwltool
must be installed in the container image. Example:
class: Workflow cwlVersion: v1.0 $namespaces: arv: "http://arvados.org/cwl#" inputs: file: File outputs: [] requirements: SubworkflowFeatureRequirement: {} steps: subworkflow-with-short-steps: in: file: file out: [out] # This hint indicates that the subworkflow should be bundled and # run in a single container, instead of the normal behavior, which # is to run each step in a separate container. This greatly # reduces overhead if you have a series of short jobs, without # requiring any changes the CWL definition of the sub workflow. hints: - class: arv:RunInSingleContainer run: subworkflow-with-short-steps.cwl
InlineJavascriptRequirement
or ShellCommandRequirement
Avoid declaring InlineJavascriptRequirement
or ShellCommandRequirement
unless you specifically need them. Don’t include them “just in case” because they change the default behavior and may add extra overhead.
When combining a parameter value with a string, such as adding a filename extension, write $(inputs.file.basename).ext
instead of $(inputs.file.basename + 'ext')
. The first form is evaluated as a simple text substitution, the second form (using the +
operator) is evaluated as an arbitrary Javascript expression and requires that you declare InlineJavascriptRequirement
.
ExpressionTool
to efficiently rearrange input filesUse ExpressionTool
to efficiently rearrange input files between steps of a Workflow. For example, the following expression accepts a directory containing files paired by _R1_
and _R2_
and produces an array of Directories containing each pair.
class: ExpressionTool cwlVersion: v1.0 inputs: inputdir: Directory outputs: out: Directory[] requirements: InlineJavascriptRequirement: {} expression: | ${ var samples = {}; for (var i = 0; i < inputs.inputdir.listing.length; i++) { var file = inputs.inputdir.listing[i]; var groups = file.basename.match(/^(.+)(_R[12]_)(.+)$/); if (groups) { if (!samples[groups[1]]) { samples[groups[1]] = []; } samples[groups[1]].push(file); } } var dirs = []; for (var key in samples) { dirs.push({"class": "Directory", "basename": key, "listing": [samples[key]]}); } return {"out": dirs}; }
Available compute nodes types vary over time and across different cloud providers, so it is important to limit the RAM requirement to what the program actually needs. However, if you need to target a specific compute node type, see this discussion on calculating RAM request and choosing instance type for containers.
Instead of a scatter step that feeds into another scatter step, prefer to scatter over a subworkflow.
With the following pattern, step1
has to wait for all samples to complete before step2
can start computing on any samples. This means a single long-running sample can prevent the rest of the workflow from moving on:
cwlVersion: v1.0 class: Workflow inputs: inp: File steps: step1: in: {inp: inp} scatter: inp out: [out] run: tool1.cwl step2: in: {inp: step1/inp} scatter: inp out: [out] run: tool2.cwl step3: in: {inp: step2/inp} scatter: inp out: [out] run: tool3.cwl
Instead, scatter over a subworkflow. In this pattern, a sample can proceed to step2
as soon as step1
is done, independently of any other samples.
Example: (note, the subworkflow can also be put in a separate file)
cwlVersion: v1.0 class: Workflow steps: step1: in: {inp: inp} scatter: inp out: [out] run: class: Workflow inputs: inp: File outputs: out: type: File outputSource: step3/out steps: step1: in: {inp: inp} out: [out] run: tool1.cwl step2: in: {inp: step1/inp} out: [out] run: tool2.cwl step3: in: {inp: step2/inp} out: [out] run: tool3.cwl
To write workflows that are easy to modify and portable across CWL runners (in the event you need to share your workflow with others), there are several best practices to follow:
DockerRequirement
Workflows should always provide DockerRequirement
in the hints
or requirements
section.
Build a reusable library of components. Share tool wrappers and subworkflows between projects. Make use of and contribute to community maintained workflows and tools and tool registries such as Dockstore .
CommandLineTools wrapping custom scripts should represent the script as an input parameter with the script file as a default value. Use secondaryFiles
for scripts that consist of multiple files. For example:
cwlVersion: v1.0 class: CommandLineTool baseCommand: python inputs: script: type: File inputBinding: {position: 1} default: class: File location: bclfastq.py secondaryFiles: - class: File location: helper1.py - class: File location: helper2.py inputfile: type: File inputBinding: {position: 2} outputs: out: type: File outputBinding: glob: "*.fastq"
You can get the designated temporary directory using $(runtime.tmpdir)
in your CWL file, or from the $TMPDIR
environment variable in your script.
Similarly, you can get the designated output directory using $(runtime.outdir), or from the HOME
environment variable in your script.
ResourceRequirement
Avoid specifying resources in the requirements
section of a CommandLineTool
, put it in the hints
section instead. This enables you to override the tool resource hint with a workflow step level requirement:
cwlVersion: v1.0 class: Workflow inputs: inp: File steps: step1: in: {inp: inp} out: [out] run: tool1.cwl step2: in: {inp: step1/inp} out: [out] run: tool2.cwl requirements: ResourceRequirement: ramMin: 2000 coresMin: 2 tmpdirMin: 90000
The content of this documentation is licensed under the
Creative
Commons Attribution-Share Alike 3.0 United States licence.
Code samples in this documentation are licensed under the
Apache License, Version 2.0.