Best Practices for writing CWL

  • To run on Arvados, a workflow should provide a DockerRequirement in the hints section.
  • When combining a parameter value with a string, such as adding a filename extension, write $(inputs.file.basename).ext instead of $(inputs.file.basename + 'ext'). The first form is evaluated as a simple text substitution, the second form (using the + operator) is evaluated as an arbitrary Javascript expression and requires that you declare InlineJavascriptRequirement.
  • Avoid declaring InlineJavascriptRequirement or ShellCommandRequirement unless you specifically need them. Don’t include them “just in case” because they change the default behavior and may imply extra overhead.
  • Don’t write CWL scripts that access the Arvados SDK. This is non-portable; a script that access Arvados directly won’t work with cwltool or crunch v2.
  • CommandLineTools wrapping custom scripts should represent the script as an input parameter with the script file as a default value. Use secondaryFiles for scripts that consist of multiple files. For example:
cwlVersion: v1.0
class: CommandLineTool
baseCommand: python
inputs:
  script:
    type: File
    inputBinding: {position: 1}
    default:
      class: File
      location: bclfastq.py
      secondaryFiles:
        - class: File
          location: helper1.py
        - class: File
          location: helper2.py
  inputfile:
    type: File
    inputBinding: {position: 2}
outputs:
  out:
    type: File
    outputBinding:
      glob: "*.fastq"
  • You can get the designated temporary directory using $(runtime.tmpdir) in your CWL file, or from the $TMPDIR environment variable in your script.
  • Similarly, you can get the designated output directory using $(runtime.outdir), or from the HOME environment variable in your script.
  • Use ExpressionTool to efficiently rearrange input files between steps of a Workflow. For example, the following expression accepts a directory containing files paired by _R1_ and _R2_ and produces an array of Directories containing each pair.
class: ExpressionTool
cwlVersion: v1.0
inputs:
  inputdir: Directory
outputs:
  out: Directory[]
requirements:
  InlineJavascriptRequirement: {}
expression: |
  ${
    var samples = {};
    for (var i = 0; i < inputs.inputdir.listing.length; i++) {
      var file = inputs.inputdir.listing[i];
      var groups = file.basename.match(/^(.+)(_R[12]_)(.+)$/);
      if (groups) {
        if (!samples[groups[1]]) {
          samples[groups[1]] = [];
        }
        samples[groups[1]].push(file);
      }
    }
    var dirs = [];
    for (var key in samples) {
      dirs.push({"class": "Directory",
                 "basename": key,
                 "listing": [samples[key]]});
    }
    return {"out": dirs};
  }
  • Avoid specifying resource requirements in CommandLineTool. Prefer to specify them in the workflow. You can provide a default resource requirement in the top level hints section, and individual steps can override it with their own resource requirement.
cwlVersion: v1.0
class: Workflow
inputs:
  inp: File
hints:
  ResourceRequirement:
    ramMin: 1000
    coresMin: 1
    tmpdirMin: 45000
steps:
  step1:
    in: {inp: inp}
    out: [out]
    run: tool1.cwl
  step2:
    in: {inp: step1/inp}
    out: [out]
    run: tool2.cwl
    hints:
      ResourceRequirement:
        ramMin: 2000
        coresMin: 2
        tmpdirMin: 90000
  • Instead of scattering separate steps, prefer to scatter over a subworkflow.

With the following pattern, step1 has to wait for all samples to complete before step2 can start computing on any samples. This means a single long-running sample can prevent the rest of the workflow from moving on:

cwlVersion: v1.0
class: Workflow
inputs:
  inp: File
steps:
  step1:
    in: {inp: inp}
    scatter: inp
    out: [out]
    run: tool1.cwl
  step2:
    in: {inp: step1/inp}
    scatter: inp
    out: [out]
    run: tool2.cwl
  step3:
    in: {inp: step2/inp}
    scatter: inp
    out: [out]
    run: tool3.cwl

Instead, scatter over a subworkflow. In this pattern, a sample can proceed to step2 as soon as step1 is done, independently of any other samples.
Example: (note, the subworkflow can also be put in a separate file)

cwlVersion: v1.0
class: Workflow
steps:
  step1:
    in: {inp: inp}
    scatter: inp
    out: [out]
    run:
      class: Workflow
      inputs:
        inp: File
      outputs:
        out:
          type: File
          outputSource: step3/out
      steps:
        step1:
          in: {inp: inp}
          out: [out]
          run: tool1.cwl
        step2:
          in: {inp: step1/inp}
          out: [out]
          run: tool2.cwl
        step3:
          in: {inp: step2/inp}
          out: [out]
          run: tool3.cwl

Previous: Writing a CWL workflow Next: Arvados CWL Extensions

The content of this documentation is licensed under the Creative Commons Attribution-Share Alike 3.0 United States licence.
Code samples in this documentation are licensed under the Apache License, Version 2.0.