Writing a pipeline template

Note:

Arvados pipeline templates are deprecated. The recommended way to develop new workflows for Arvados is using the Common Workflow Language.

This tutorial demonstrates how to construct a two stage pipeline template that uses the bwa mem tool to produce a Sequence Alignment/Map file, then uses the Picard SortSam tool to produce a BAM (Binary Alignment/Map) file.

Note:

This tutorial assumes that you are logged into an Arvados VM instance (instructions for Webshell or Unix or Windows) or you have installed the Arvados FUSE Driver and Python SDK on your workstation and have a working environment.

Use the following command to create an empty template using arv create pipeline_template:

~$ arv create pipeline_template

This will open the template record in an interactive text editor (as specified by $EDITOR or $VISUAL, otherwise defaults to nano). Now, update the contents of the editor with the following content:

#<Liquid::Comment:0x000055f36e41cf50>

{
    "name": "Tutorial align using bwa mem and SortSam",
    "components": {
        "bwa-mem": {
            "script": "run-command",
            "script_version": "master",
            "repository": "arvados",
            "script_parameters": {
                "command": [
                    "$(dir $(bwa_collection))/bwa",
                    "mem",
                    "-t",
                    "$(node.cores)",
                    "-R",
                    "@RG\\\tID:group_id\\\tPL:illumina\\\tSM:sample_id",
                    "$(glob $(dir $(reference_collection))/*.fasta)",
                    "$(glob $(dir $(sample))/*_1.fastq)",
                    "$(glob $(dir $(sample))/*_2.fastq)"
                ],
                "reference_collection": {
                    "required": true,
                    "dataclass": "Collection"
                },
                "bwa_collection": {
                    "required": true,
                    "dataclass": "Collection",
                    "default": "39c6f22d40001074f4200a72559ae7eb+5745"
                },
                "sample": {
                    "required": true,
                    "dataclass": "Collection"
                },
                "task.stdout": "$(basename $(glob $(dir $(sample))/*_1.fastq)).sam"
            },
            "runtime_constraints": {
                "docker_image": "bcosc/arv-base-java",
                "arvados_sdk_version": "master"
            }
        },
        "SortSam": {
            "script": "run-command",
            "script_version": "847459b3c257aba65df3e0cbf6777f7148542af2",
            "repository": "arvados",
            "script_parameters": {
                "command": [
                    "java",
                    "-Xmx4g",
                    "-Djava.io.tmpdir=$(tmpdir)",
                    "-jar",
                    "$(dir $(picard))/SortSam.jar",
                    "CREATE_INDEX=True",
                    "SORT_ORDER=coordinate",
                    "VALIDATION_STRINGENCY=LENIENT",
                    "INPUT=$(glob $(dir $(input))/*.sam)",
                    "OUTPUT=$(basename $(glob $(dir $(input))/*.sam)).sort.bam"
                ],
                "input": {
                    "output_of": "bwa-mem"
                },
                "picard": {
                    "required": true,
                    "dataclass": "Collection",
                    "default": "88447c464574ad7f79e551070043f9a9+1970"
                }
            },
            "runtime_constraints": {
                "docker_image": "bcosc/arv-base-java",
                "arvados_sdk_version": "master"
            }
        }
    }
}
  • "name" is a human-readable name for the pipeline.
  • "components" is a set of scripts or commands that make up the pipeline. Each component is given an identifier ("bwa-mem" and "SortSam") in this example).
    • Each entry in components "components" is an Arvados job submission. For more information about individual jobs, see the job resource reference.
  • "repository", "script_version", and "script" indicate that we intend to use the external "run-command" tool wrapper that is part of the Arvados. These parameters are described in more detail in Writing a script.
  • "runtime_constraints" describes runtime resource requirements for the component.
    • "docker_image" specifies the Docker runtime environment in which to run the job. The Docker image "bcosc/arv-base-java" supplied here has the Java runtime environment, bwa, and samtools installed.
    • "arvados_sdk_version" specifies a version of the Arvados SDK to load alongside the job’s script. The example uses ‘master’. If you would like to use a specific version of the sdk, you can find it in the Arvados Python sdk repository under Latest revisions.
  • "script_parameters" describes the component parameters.
    • "command" is the actual command line to invoke the bwa and then SortSam. The notation $() denotes macro substitution commands evaluated by the run-command tool wrapper.
    • "task.stdout" indicates that the output of this command should be captured to a file.
    • $(node.cores) evaluates to the number of cores available on the compute node at time the command is run.
    • $(tmpdir) evaluates to the local path for temporary directory the command should use for scratch data.
    • $(reference_collection) evaluates to the script_parameter "reference_collection"
    • $(dir $(...)) constructs a local path to a directory representing the supplied Arvados collection.
    • $(file $(...)) constructs a local path to a given file within the supplied Arvados collection.
    • $(glob $(...)) searches the specified path based on a file glob pattern and evalutes to the first result.
    • $(basename $(...)) evaluates to the supplied path with leading path portion and trailing filename extensions stripped
  • "output_of" indicates that the output of the bwa-mem component should be used as the "input" script parameter of SortSam. Arvados uses these dependencies between components to automatically determine the correct order to run them.

When using run-command, the tool should write its output to the current working directory. The output will be automatically uploaded to Keep when the job completes.

See the run-command reference for more information about using run-command.

Note: When trying to get job reproducibility without re-computation, you need to set these parameters to their specific hashes. Using a version such as master in "arvados_sdk_version" will grab the latest version hash, which will allow Arvados to re-compute your job if the sdk gets updated.

  • "arvados_sdk_version" : The latest version can be found on the Arvados Python sdk repository under Latest revisions.
  • "script_version" : The current version of your script in your git repository can be found by using the following command:
~$ git rev-parse HEAD
  • "docker_image" : The docker image hash used is found on the Collection page as the Content address.

Running your pipeline

Your new pipeline template should appear at the top of the Workbench pipeline templates page. You can run your pipeline using Workbench or the command line.

Test data is available in the Arvados Tutorial project:

For more information and examples for writing pipelines, see the pipeline template reference

Re-using your pipeline run

Arvados allows users to re-use jobs that have the same inputs in order to save computing time and resources. Users are able to change a job downstream without re-computing earlier jobs. This section shows which version control parameters should be tuned to make sure Arvados will not re-compute your jobs.

Note: Job reuse can only happen if all input collections do not change.

  • "arvados_sdk_version" : The arvados_sdk_version parameter is used to download the specific version of the Arvados sdk into the docker image. The latest version can be found in the Arvados Python sdk repository under Latest revisions. Make sure you set this to the same version as the previous run that you are trying to reuse.
  • "script_version" : The script_version is the commit hash of the git branch that the crunch script resides in. This information can be found in your git repository by using the following command:
~$ git rev-parse HEAD
  • "docker_image" : This specifies the Docker runtime environment where jobs run their scripts. Docker version control is similar to git, and you can commit and push changes to your images. You must re-use the docker image hash from the previous run to use the same image. It can be found on the Collection page as the Content address or the docker_image_locator in a job’s metadata.

Previous: Using arv-run Next: Tools for writing Crunch pipelines

The content of this documentation is licensed under the Creative Commons Attribution-Share Alike 3.0 United States licence.
Code samples in this documentation are licensed under the Apache License, Version 2.0.