Guidelines for Writing High-Performance Portable Workflows

Performance

  1. Does your application support NVIDIA GPU acceleration?
  2. Trying to reduce costs?
  3. You have a sequence of short-running steps
  4. Avoid declaring InlineJavascriptRequirement or ShellCommandRequirement
  5. Prefer text substitution to Javascript
  6. Use ExpressionTool to efficiently rearrange input files
  7. Limit RAM requests to what you really need
  8. Avoid scattering by step by step

Portability

  1. Always provide DockerRequirement
  2. Build a reusable library of components
  3. Supply scripts as input parameters
  4. Getting the temporary and output directories
  5. Specifying ResourceRequirement

Data import

  1. Importing data into Keep from HTTP
  2. Importing data into Keep from S3

Performance

To get the best perfomance from your workflows, be aware of the following Arvados features, behaviors, and best practices.

Does your application support NVIDIA GPU acceleration?

Use cwltool:CUDARequirement to request nodes with GPUs.

Trying to reduce costs?

Try using preemptible (spot) instances .

You have a sequence of short-running steps

If you have a sequence of short-running steps (less than 1-2 minutes each), use the Arvados extension arv:RunInSingleContainer to avoid scheduling and data transfer overhead by running all the steps together in the same container on the same node. To use this feature, cwltool must be installed in the container image. Example:

class: Workflow
cwlVersion: v1.0
$namespaces:
  arv: "http://arvados.org/cwl#"
inputs:
  file: File
outputs: []
requirements:
  SubworkflowFeatureRequirement: {}
steps:
  subworkflow-with-short-steps:
    in:
      file: file
    out: [out]
    # This hint indicates that the subworkflow should be bundled and
    # run in a single container, instead of the normal behavior, which
    # is to run each step in a separate container.  This greatly
    # reduces overhead if you have a series of short jobs, without
    # requiring any changes the CWL definition of the sub workflow.
    hints:
      - class: arv:RunInSingleContainer
    run: subworkflow-with-short-steps.cwl

Avoid declaring InlineJavascriptRequirement or ShellCommandRequirement

Avoid declaring InlineJavascriptRequirement or ShellCommandRequirement unless you specifically need them. Don’t include them “just in case” because they change the default behavior and may add extra overhead.

Prefer text substitution to Javascript

When combining a parameter value with a string, such as adding a filename extension, write $(inputs.file.basename).ext instead of $(inputs.file.basename + 'ext'). The first form is evaluated as a simple text substitution, the second form (using the + operator) is evaluated as an arbitrary Javascript expression and requires that you declare InlineJavascriptRequirement.

Use ExpressionTool to efficiently rearrange input files

Use ExpressionTool to efficiently rearrange input files between steps of a Workflow. For example, the following expression accepts a directory containing files paired by _R1_ and _R2_ and produces an array of Directories containing each pair.

class: ExpressionTool
cwlVersion: v1.0
inputs:
  inputdir: Directory
outputs:
  out: Directory[]
requirements:
  InlineJavascriptRequirement: {}
expression: |
  ${
    var samples = {};
    for (var i = 0; i < inputs.inputdir.listing.length; i++) {
      var file = inputs.inputdir.listing[i];
      var groups = file.basename.match(/^(.+)(_R[12]_)(.+)$/);
      if (groups) {
        if (!samples[groups[1]]) {
          samples[groups[1]] = [];
        }
        samples[groups[1]].push(file);
      }
    }
    var dirs = [];
    for (var key in samples) {
      dirs.push({"class": "Directory",
                 "basename": key,
                 "listing": [samples[key]]});
    }
    return {"out": dirs};
  }

Limit RAM requests to what you really need

Available compute nodes types vary over time and across different cloud providers, so it is important to limit the RAM requirement to what the program actually needs. However, if you need to target a specific compute node type, see this discussion on calculating RAM request and choosing instance type for containers.

Avoid scattering by step by step

Instead of a scatter step that feeds into another scatter step, prefer to scatter over a subworkflow.

With the following pattern, step1 has to wait for all samples to complete before step2 can start computing on any samples. This means a single long-running sample can prevent the rest of the workflow from moving on:

cwlVersion: v1.0
class: Workflow
inputs:
  inp: File
steps:
  step1:
    in: {inp: inp}
    scatter: inp
    out: [out]
    run: tool1.cwl
  step2:
    in: {inp: step1/inp}
    scatter: inp
    out: [out]
    run: tool2.cwl
  step3:
    in: {inp: step2/inp}
    scatter: inp
    out: [out]
    run: tool3.cwl

Instead, scatter over a subworkflow. In this pattern, a sample can proceed to step2 as soon as step1 is done, independently of any other samples.
Example: (note, the subworkflow can also be put in a separate file)

cwlVersion: v1.0
class: Workflow
steps:
  step1:
    in: {inp: inp}
    scatter: inp
    out: [out]
    run:
      class: Workflow
      inputs:
        inp: File
      outputs:
        out:
          type: File
          outputSource: step3/out
      steps:
        step1:
          in: {inp: inp}
          out: [out]
          run: tool1.cwl
        step2:
          in: {inp: step1/inp}
          out: [out]
          run: tool2.cwl
        step3:
          in: {inp: step2/inp}
          out: [out]
          run: tool3.cwl

Portability

To write workflows that are easy to modify and portable across CWL runners (in the event you need to share your workflow with others), there are several best practices to follow:

Always provide DockerRequirement

Workflows should always provide DockerRequirement in the hints or requirements section.

Build a reusable library of components

Share tool wrappers and subworkflows between projects. Make use of and contribute to community maintained workflows and tools and tool registries such as Dockstore .

Supply scripts as input parameters

CommandLineTools wrapping custom scripts should represent the script as an input parameter with the script file as a default value. Use secondaryFiles for scripts that consist of multiple files. For example:

cwlVersion: v1.0
class: CommandLineTool
baseCommand: python
inputs:
  script:
    type: File
    inputBinding: {position: 1}
    default:
      class: File
      location: bclfastq.py
      secondaryFiles:
        - class: File
          location: helper1.py
        - class: File
          location: helper2.py
  inputfile:
    type: File
    inputBinding: {position: 2}
outputs:
  out:
    type: File
    outputBinding:
      glob: "*.fastq"

Getting the temporary and output directories

You can get the designated temporary directory using $(runtime.tmpdir) in your CWL file, or from the $TMPDIR environment variable in your script.

Similarly, you can get the designated output directory using $(runtime.outdir), or from the HOME environment variable in your script.

Specifying ResourceRequirement

Avoid specifying resources in the requirements section of a CommandLineTool, put it in the hints section instead. This enables you to override the tool resource hint with a workflow step level requirement:

cwlVersion: v1.0
class: Workflow
inputs:
  inp: File
steps:
  step1:
    in: {inp: inp}
    out: [out]
    run: tool1.cwl
  step2:
    in: {inp: step1/inp}
    out: [out]
    run: tool2.cwl
    requirements:
      ResourceRequirement:
        ramMin: 2000
        coresMin: 2
        tmpdirMin: 90000

Data import

Importing data into Keep from HTTP

You can use HTTP URLs as File input parameters and arvados-cwl-runner will download them to Keep for you:

fastq1:
  class: File
  location: https://example.com/genomes/sampleA_1.fastq
fastq2:
  class: File
  location: https://example.com/genomes/sampleA_2.fastq

Files are downloaded and stored in Keep collections with HTTP header information stored in metadata. If a file was previously downloaded, arvados-cwl-runner uses HTTP caching rules to decide if a file should be re-downloaded or not.

The default behavior is to transfer the files on the client, prior to submitting the workflow run. This guarantees the data is available when the workflow is submitted. However, you can use the --defer-download option to have the data transfer performed by workflow runner process on a compute node, after the workflow is submitted. There are a couple reasons you may want to do this:

  1. You are submitting from a workstation, but expect the file will be downloaded faster by the compute node
  2. You are submitting multiple workflow runs in a row and want to parallelize downloads

arvados-cwl-runner provides two additional options to control caching behavior.

  • --varying-url-params will ignore the listed URL query parameters from any HTTP URLs when checking if a URL has already been downloaded to Keep.
  • --prefer-cached-downloads will search Keep for the previously downloaded URL and use that if found, without checking the upstream resource. This means changes in the upstream resource won’t be detected, but it also means the workflow will not fail if the upstream resource becomes inaccessible.

One use of this is to import files from AWS S3 signed URLs (but note that you can also import from S3 natively, see below).

Here is an example usage. The use of --varying-url-params=AWSAccessKeyId,Signature,Expires is especially relevant, this removes these parameters from the cached URL, which means that if a new signed URL for the same object is generated later, it can be found in the cache.

arvados-cwl-runner --defer-download \
                   --varying-url-params=AWSAccessKeyId,Signature,Expires \
                   --prefer-cached-downloads \
                   workflow.cwl params.yml

Importing data into Keep from S3

You can use S3 URLs as File input parameters and arvados-cwl-runner will download them to Keep for you:

fastq1:
  class: File
  location: s3://examplebucket/genomes/sampleA_1.fastq
fastq2:
  class: File
  location: s3://examplebucket/genomes/sampleA_2.fastq

Files are downloaded and stored in Keep collections. If the bucket is versioned, it will make note of the object version and last modified time. If a file was previously downloaded, arvados-cwl-runner will use the object version and/or last modified time to decide if a file should be re-downloaded or not. The --prefer-cached-downloads option will search Keep for the previously downloaded URL and use that if found, without checking the upstream resource. This means changes in the upstream resource won’t be detected, but it also means the workflow will not fail if the upstream resource becomes inaccessible.

The default behavior is to transfer the files on the client, prior to submitting the workflow run. This guarantees the data is available when the workflow is submitted. However, you can use the --defer-download option to have the data transfer performed by workflow runner process on a compute node, after the workflow is submitted. There are several reasons you may want to do this:

  1. You are submitting from a workstation, but expect the file will be downloaded faster by the compute node
  2. You are submitting multiple workflow runs in a row and want to parallelize downloads
  3. You don’t have credentials to access the S3 bucket locally but do have read access to AWS credentials registered with Arvados

When using the --defer-download option, arvados-cwl-runner use the following process to choose which AWS credentials to use to access the S3 bucket.

  1. Arvados will first check to see if you have access to a credential with credential_class: aws_access_key where the s3 bucket is present in scopes (the bucket name must be formatted as s3://bucketname).
  2. Otherwise, Arvados will look for a credential with credential_class: aws_access_key where scopes is empty.

In each case, if more than one credential matches, it will throw an error, and the user must provide --use-credential on the command line with the name or uuid of the credential to specify precisely which one to use.

If no AWS credentials are registered with Arvados, but you have AWS credentials available locally (for example, in ~/.aws/credentials), you can use --enable-aws-credential-capture. This instructs arvados-cwl-runner to capture the active AWS credentials from your environment and pass them to the workflow runner container as a secret file. In this case, these credentials are only stored in Arvados for the duration of the workflow run and are discarded when the workflow finishes. Arvados uses the boto3 library to access S3, which has a list of locations where it will search for credentials.

If the source S3 bucket is a public bucket, you can download from it without AWS credentials by providing --s3-public-bucket on the command line.


Previous: Working with container images Next: CWL Resources

The content of this documentation is licensed under the Creative Commons Attribution-Share Alike 3.0 United States licence.
Code samples in this documentation are licensed under the Apache License, Version 2.0.