Running on an Arvados cluster

Note:

Arvados pipeline templates are deprecated. The recommended way to develop new workflows for Arvados is using the Common Workflow Language.

This tutorial demonstrates how to create a pipeline to run your crunch script on an Arvados cluster. Cluster jobs can scale out to multiple nodes, and use git and docker to store the complete system snapshot required to achieve reproducibilty.

Note:

This tutorial assumes that you are logged into an Arvados VM instance (instructions for Webshell or Unix or Windows) or you have installed the Arvados FUSE Driver and Python SDK on your workstation and have a working environment.

This tutorial uses $USER to denote your username. Replace $USER with your user name in all the following examples.

Also, this tutorial uses the tutorial arvados repository created in Adding a new arvados repository as the example repository.

Clone Arvados repository

Please clone the tutorial repository using the instructions from Working with Arvados git repository, if you have not yet cloned already.

Creating a Crunch script

Start by entering the tutorial directory created by git clone. Next, create a subdirectory called crunch_scripts and change to that directory:

>~$ cd tutorial
~/tutorial$ mkdir crunch_scripts
~/tutorial$ cd crunch_scripts

Next, using nano or your favorite Unix text editor, create a new file called hash.py in the crunch_scripts directory.

~/tutorial/crunch_scripts$ nano hash.py

Add the following code to compute the MD5 hash of each file in a collection (if you already completed Writing a Crunch script you can just copy the hash.py file you created previously.)

#!/usr/bin/env python
#<Liquid::Comment:0x000055f36e407bf0>

import hashlib      # Import the hashlib module to compute MD5.
import os           # Import the os module for basic path manipulation
import arvados      # Import the Arvados sdk module

# Automatically parallelize this job by running one task per file.
# This means that if the input consists of many files, each file will
# be processed in parallel on different nodes enabling the job to
# be completed quicker.
arvados.job_setup.one_task_per_input_file(if_sequence=0, and_end_task=True,
                                          input_as_path=True)

# Get object representing the current task
this_task = arvados.current_task()

# Create the message digest object that will compute the MD5 hash
digestor = hashlib.new('md5')

# Get the input file for the task
input_id, input_path = this_task['parameters']['input'].split('/', 1)

# Open the input collection
input_collection = arvados.CollectionReader(input_id)

# Open the input file for reading
with input_collection.open(input_path) as input_file:
    for buf in input_file.readall():  # Iterate the file's data blocks
        digestor.update(buf)          # Update the MD5 hash object

# Write a new collection as output
out = arvados.CollectionWriter()

# Write an output file with one line: the MD5 value and input path
with out.open('md5sum.txt') as out_file:
    out_file.write("{} {}/{}\n".format(digestor.hexdigest(), input_id,
                                       os.path.normpath(input_path)))

# Commit the output to Keep.
output_locator = out.finish()

# Use the resulting locator as the output for this task.
this_task.set_output(output_locator)

# Done!

Make the file executable:

~/tutorial/crunch_scripts$ chmod +x hash.py

Next, add the file to the staging area. This tells git that the file should be included on the next commit.

~/tutorial/crunch_scripts$ git add hash.py

Next, commit your changes. All staged changes are recorded into the local git repository:

~/tutorial/crunch_scripts$ git commit -m "my first script"
[master (root-commit) 27fd88b] my first script
 1 file changed, 45 insertions(+)
 create mode 100755 crunch_scripts/hash.py

Finally, upload your changes to the Arvados server:

~/tutorial/crunch_scripts$ git push origin master
Counting objects: 4, done.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (4/4), 682 bytes, done.
Total 4 (delta 0), reused 0 (delta 0)
To git@git.qr1hi.arvadosapi.com:$USER/tutorial.git
 * [new branch]      master -> master

Create a pipeline template

Next, create a new template using arv create pipeline_template:

~$ arv create pipeline_template

In the editor, enter the following template:

#<Liquid::Comment:0x000055f36cc02cd0>

{
  "name":"My md5 pipeline",
  "components":{
    "do_hash":{
      "repository":"$USER/$USER",
      "script":"hash.py",
      "script_version":"master",
      "runtime_constraints":{
        "docker_image":"arvados/jobs"
      },
      "script_parameters":{
        "input":{
          "required": true,
          "dataclass": "Collection"
        }
      }
    }
  }
}
  • "repository" is the name of a git repository to search for the script version. You can access a list of available git repositories on the Arvados Workbench in the Repositories page using the top navigation menu icon.
  • "script_version" specifies the version of the script that you wish to run. This can be in the form of an explicit Git revision hash, a tag, or a branch (in which case it will use the HEAD of the specified branch). Arvados logs the script version that was used in the run, enabling you to go back and re-run any past job with the guarantee that the exact same code will be used as was used in the previous run.
  • "script" specifies the filename of the script to run. Crunch expects to find this in the crunch_scripts/ subdirectory of the Git repository.
  • "runtime_constraints" describes the runtime environment required to run the job. These are described in the job record schema

Running your pipeline

Your new pipeline template should appear at the top of the Workbench pipeline templates page. You can run your pipeline using Workbench or the command line.

For more information and examples for writing pipelines, see the pipeline template reference


Previous: Writing a Crunch script Next: Concurrent Crunch tasks

The content of this documentation is licensed under the Creative Commons Attribution-Share Alike 3.0 United States licence.
Code samples in this documentation are licensed under the Apache License, Version 2.0.