Tools for writing Crunch pipelines

Note:

Arvados pipeline templates are deprecated. The recommended way to develop new workflows for Arvados is using the Common Workflow Language.

Arvados includes a number of tools to help you develop pipelines and jobs for Crunch. This overview explains each tool’s intended use to help you choose the right one.

Use the arv-run command-line utility

arv-run is an interactive command-line tool. You run it as the first command of a traditional Unix shell command line, and it converts that work into an Arvados pipeline. It automatically uploads any required data to Arvados, and dispatches work in parallel when possible. This lets you easily migrate analysis work that you’re doing on the command line to Arvados compute nodes.

arv-run is best suited to complement work you already do on the command line. If you write a shell one-liner that generates useful data, you can then call it with arv-run to parallelize it across a larger data set and save the results in Arvados. For example, this run searches multiple FASTQ files in parallel, and saves the results to Keep through shell redirection:

$ cd ~/keep/by_id/3229739b505d2b878b62aed09895a55a+142
$ ls *.fastq
$ arv-run grep -H -n ATTGGAGGAAAGATGAGTGAC \< *.fastq \> output.txt
[...]
 1 stderr run-command: grep -H -n ATTGGAGGAAAGATGAGTGAC < /keep/3229739b505d2b878b62aed09895a55a+142/HWI-ST1027_129_D0THKACXX.1_1.fastq > output.txt
 2 stderr run-command: grep -H -n ATTGGAGGAAAGATGAGTGAC < /keep/3229739b505d2b878b62aed09895a55a+142/HWI-ST1027_129_D0THKACXX.1_2.fastq > output.txt
 2 stderr run-command: completed with exit code 0 (success)
 2 stderr run-command: the following output files will be saved to keep:
 2 stderr run-command: 121 ./output.txt
 2 stderr run-command: start writing output to keep
 1 stderr run-command: completed with exit code 0 (success)
 1 stderr run-command: the following output files will be saved to keep:
 1 stderr run-command: 363 ./output.txt
 1 stderr run-command: start writing output to keep
 2 stderr upload wrote 121 total 121
 1 stderr upload wrote 363 total 363
[..]

arv-run does not generate pipeline templates, or implement higher-level shell constructs like flow control. If you want to make it easy to rerun your pipeline with different data later, or adapt to different inputs, it’s best to write your own template.

Refer to the arv-run documentation for details.

Write a pipeline template

Pipeline templates describe a set of analysis programs that should be run, and the inputs they require. You can provide a high-level description of how data flows through the pipeline—for example, the outputs of programs A and B are provided as input to program C—and let Crunch take care of the details of starting the individual programs at the right time with the inputs you specified.

Pipeline templates are written in JSON. Once you save a pipeline template in Arvados, you run it by creating a pipeline instance that lists the specific inputs you’d like to use. Arvados Workbench and the arv pipeline run command-line tool both provide high-level interfaces to do this easily. The pipeline’s final output(s) will be saved in a project you specify.

See the User Guide topic to learn how to write and run your own pipelines. The rest of this page suggests specific tools to use in your templates.

The run-command Crunch script

run-command is a Crunch script that is included with Arvados. It builds a command line from its input parameters. It runs that command on files in Collections using the Keep mount provided by Crunch. Output files created by the command are saved in a new collection, which is considered the program’s final output. It can run the command in parallel on a list of inputs, and introspect arguments so you can, for example, generate output filenames based on input filenames.

run-command is a great way to use an existing analysis tool inside an Arvados pipeline. You might use one or two tools in a larger pipeline, or convert a simple series of tool invocations into a pipeline to benefit from Arvados’ provenance tracking and job reuse. For example, here’s a one-step pipeline that uses run-command with bwa to align a single paired-end read FASTQ sample:

#<Liquid::Comment:0x000055bb765a7760>

{
    "name":"run-command example pipeline",
    "components":{
         "bwa-mem": {
            "script": "run-command",
            "script_version": "master",
            "repository": "arvados",
            "script_parameters": {
                "command": [
                    "$(dir $(bwa_collection))/bwa",
                    "mem",
                    "-t",
                    "$(node.cores)",
                    "-R",
                    "@RG\\\tID:group_id\\\tPL:illumina\\\tSM:sample_id",
                    "$(glob $(dir $(reference_collection))/*.fasta)",
                    "$(glob $(dir $(sample))/*_1.fastq)",
                    "$(glob $(dir $(sample))/*_2.fastq)"
                ],
                "reference_collection": {
                    "required": true,
                    "dataclass": "Collection"
                },
                "bwa_collection": {
                    "required": true,
                    "dataclass": "Collection",
                    "default": "39c6f22d40001074f4200a72559ae7eb+5745"
                },
                "sample": {
                    "required": true,
                    "dataclass": "Collection"
                },
                "task.stdout": "$(basename $(glob $(dir $(sample))/*_1.fastq)).sam"
            }
        }
    }
}

run-command is limited to manipulating the tool’s command-line arguments, and can only parallelize on simple lists of inputs. If you need to preprocess input, or dispatch work differently based on those inputs, consider writing your own Crunch script.

Refer to the run-command reference for details.

Writing your own Crunch script with the Python SDK

Arvados includes a Python SDK designed to help you write your own Crunch scripts. It provides a native Arvados API client; Collection classes that provide file-like objects to interact with data in Keep; and utility functions to work within Crunch’s execution environment. Using the Python SDK, you can efficiently dispatch work with however much sophistication you require.

Writing your own Crunch script is the best way to do analysis in Arvados when an existing tool does not meet your needs. By interacting directly with Arvados objects, you’ll have full power to introspect and adapt to your input, introduce minimal overhead, and get very direct error messages in case there’s any trouble. As a simple example, here’s a Crunch script that checksums each file in a collection in parallel, saving the results in Keep:

#!/usr/bin/env python
#<Liquid::Comment:0x000055bb76651aa8>

import hashlib      # Import the hashlib module to compute MD5.
import os           # Import the os module for basic path manipulation
import arvados      # Import the Arvados sdk module

# Automatically parallelize this job by running one task per file.
# This means that if the input consists of many files, each file will
# be processed in parallel on different nodes enabling the job to
# be completed quicker.
arvados.job_setup.one_task_per_input_file(if_sequence=0, and_end_task=True,
                                          input_as_path=True)

# Get object representing the current task
this_task = arvados.current_task()

# Create the message digest object that will compute the MD5 hash
digestor = hashlib.new('md5')

# Get the input file for the task
input_id, input_path = this_task['parameters']['input'].split('/', 1)

# Open the input collection
input_collection = arvados.CollectionReader(input_id)

# Open the input file for reading
with input_collection.open(input_path) as input_file:
    for buf in input_file.readall():  # Iterate the file's data blocks
        digestor.update(buf)          # Update the MD5 hash object

# Write a new collection as output
out = arvados.CollectionWriter()

# Write an output file with one line: the MD5 value and input path
with out.open('md5sum.txt') as out_file:
    out_file.write("{} {}/{}\n".format(digestor.hexdigest(), input_id,
                                       os.path.normpath(input_path)))

# Commit the output to Keep.
output_locator = out.finish()

# Use the resulting locator as the output for this task.
this_task.set_output(output_locator)

# Done!

There’s no limit to what you can do with your own Crunch script. The downside is the amount of time and effort you’re required to invest to write and debug new code. If you have to do that anyway, writing a Crunch script will give you the most benefit from using Arvados.

Refer to the User Guide topic on writing Crunch scripts and the Python SDK reference for details.

Combining run-command and custom Crunch scripts in a pipeline

Just because you need to write some new code to do some work doesn’t mean that you have to do all the work in your own Crunch script. You can combine your custom steps with existing tools in a pipeline, passing data between them. For example, maybe there’s a third-party tool that does most of the analysis work you need, but you often need to massage the tool’s data. You could write your own preprocessing script that creates a new collection to use as the input of a run-command job, or a postprocessing script to create a final output after the tool is done, and tie them all together in a pipeline. Just like Unix pipes, Arvados pipelines let you combine smaller tools to maximize utility.

Using run-command with your legacy scripts

Perhaps you’ve already written your own analysis program that you want to run inside Arvados. Currently, the easiest way to do that is to copy run-command from the Arvados source code to your own Arvados git repository, along with your internal tool. Then your pipeline can call run-command from your own repository to execute the internal tool alongside it.

This approach has the downside that you’ll have to copy and push run-command again any time there’s an update you’d like to use. Future Arvados development will make it possible to get code from multiple git repositories, so your job can use the latest run-command in the Arvados source, as well as the latest tool in your own git repository. Follow Arvados issue #4561 for updates.

Alternatively, you can build a Docker image that includes your program, add it to Arvados, then run the Arvados run-command script inside that Docker image.

Previous: Writing a pipeline template Next: Writing a Crunch script