Writing a Crunch script

Note:

Arvados pipeline templates are deprecated. The recommended way to develop new workflows for Arvados is using the Common Workflow Language.

This tutorial demonstrates how to write a script using Arvados Python SDK. The Arvados SDK supports access to advanced features not available using the run-command wrapper, such as scheduling concurrent tasks across nodes.

Note:

This tutorial assumes that you are logged into an Arvados VM instance (instructions for Webshell or Unix or Windows) or you have installed the Arvados FUSE Driver and Python SDK on your workstation and have a working environment.

This tutorial uses $USER to denote your username. Replace $USER with your user name in all the following examples.

Start by creating a directory called tutorial in your home directory. Next, create a subdirectory called crunch_scripts and change to that directory:

~$ cd $HOME
~$ mkdir -p tutorial/crunch_scripts
~$ cd tutorial/crunch_scripts

Next, using nano or your favorite Unix text editor, create a new file called hash.py in the crunch_scripts directory.

~/tutorial/crunch_scripts$ nano hash.py

Add the following code to compute the MD5 hash of each file in a collection:

#!/usr/bin/env python
#<Liquid::Comment:0x000055ed8b6b42b0>

import hashlib      # Import the hashlib module to compute MD5.
import os           # Import the os module for basic path manipulation
import arvados      # Import the Arvados sdk module

# Automatically parallelize this job by running one task per file.
# This means that if the input consists of many files, each file will
# be processed in parallel on different nodes enabling the job to
# be completed quicker.
arvados.job_setup.one_task_per_input_file(if_sequence=0, and_end_task=True,
                                          input_as_path=True)

# Get object representing the current task
this_task = arvados.current_task()

# Create the message digest object that will compute the MD5 hash
digestor = hashlib.new('md5')

# Get the input file for the task
input_id, input_path = this_task['parameters']['input'].split('/', 1)

# Open the input collection
input_collection = arvados.CollectionReader(input_id)

# Open the input file for reading
with input_collection.open(input_path) as input_file:
    for buf in input_file.readall():  # Iterate the file's data blocks
        digestor.update(buf)          # Update the MD5 hash object

# Write a new collection as output
out = arvados.CollectionWriter()

# Write an output file with one line: the MD5 value and input path
with out.open('md5sum.txt') as out_file:
    out_file.write("{} {}/{}\n".format(digestor.hexdigest(), input_id,
                                       os.path.normpath(input_path)))

# Commit the output to Keep.
output_locator = out.finish()

# Use the resulting locator as the output for this task.
this_task.set_output(output_locator)

# Done!

Make the file executable:

~/tutorial/crunch_scripts$ chmod +x hash.py

Next, create a submission job record. This describes a specific invocation of your script:

~/tutorial/crunch_scripts$ cat >~/the_job <<EOF
{
 "repository":"",
 "script":"hash.py",
 "script_version":"$HOME/tutorial",
 "script_parameters":{
   "input":"c1bad4b39ca5a924e481008009d94e32+210"
 }
}
EOF

You can now run your script on your local workstation or VM using arv-crunch-job:

~/tutorial/crunch_scripts$ arv-crunch-job --job "$(cat ~/the_job)"
2014-08-06_15:16:22 qr1hi-8i9sb-qyrat80ef927lam 14473  check slurm allocation
2014-08-06_15:16:22 qr1hi-8i9sb-qyrat80ef927lam 14473  node localhost - 1 slots
2014-08-06_15:16:23 qr1hi-8i9sb-qyrat80ef927lam 14473  start
2014-08-06_15:16:23 qr1hi-8i9sb-qyrat80ef927lam 14473  script hash.py
2014-08-06_15:16:23 qr1hi-8i9sb-qyrat80ef927lam 14473  script_version $HOME/tutorial
2014-08-06_15:16:23 qr1hi-8i9sb-qyrat80ef927lam 14473  script_parameters {"input":"c1bad4b39ca5a924e481008009d94e32+210"}
2014-08-06_15:16:23 qr1hi-8i9sb-qyrat80ef927lam 14473  runtime_constraints {"max_tasks_per_node":0}
2014-08-06_15:16:23 qr1hi-8i9sb-qyrat80ef927lam 14473  start level 0
2014-08-06_15:16:23 qr1hi-8i9sb-qyrat80ef927lam 14473  status: 0 done, 0 running, 1 todo
2014-08-06_15:16:23 qr1hi-8i9sb-qyrat80ef927lam 14473 0 job_task qr1hi-ot0gb-lptn85mwkrn9pqo
2014-08-06_15:16:23 qr1hi-8i9sb-qyrat80ef927lam 14473 0 child 14478 started on localhost.1
2014-08-06_15:16:23 qr1hi-8i9sb-qyrat80ef927lam 14473  status: 0 done, 1 running, 0 todo
2014-08-06_15:16:24 qr1hi-8i9sb-qyrat80ef927lam 14473 0 stderr crunchstat: Running [stdbuf --output=0 --error=0 /home/$USER/tutorial/crunch_scripts/hash.py]
2014-08-06_15:16:24 qr1hi-8i9sb-qyrat80ef927lam 14473 0 child 14478 on localhost.1 exit 0 signal 0 success=true
2014-08-06_15:16:24 qr1hi-8i9sb-qyrat80ef927lam 14473 0 success in 1 seconds
2014-08-06_15:16:24 qr1hi-8i9sb-qyrat80ef927lam 14473 0 output
2014-08-06_15:16:25 qr1hi-8i9sb-qyrat80ef927lam 14473  wait for last 0 children to finish
2014-08-06_15:16:25 qr1hi-8i9sb-qyrat80ef927lam 14473  status: 1 done, 0 running, 1 todo
2014-08-06_15:16:25 qr1hi-8i9sb-qyrat80ef927lam 14473  start level 1
2014-08-06_15:16:25 qr1hi-8i9sb-qyrat80ef927lam 14473  status: 1 done, 0 running, 1 todo
2014-08-06_15:16:25 qr1hi-8i9sb-qyrat80ef927lam 14473 1 job_task qr1hi-ot0gb-e3obm0lv6k6p56a
2014-08-06_15:16:25 qr1hi-8i9sb-qyrat80ef927lam 14473 1 child 14504 started on localhost.1
2014-08-06_15:16:25 qr1hi-8i9sb-qyrat80ef927lam 14473  status: 1 done, 1 running, 0 todo
2014-08-06_15:16:26 qr1hi-8i9sb-qyrat80ef927lam 14473 1 stderr crunchstat: Running [stdbuf --output=0 --error=0 /home/$USER/tutorial/crunch_scripts/hash.py]
2014-08-06_15:16:35 qr1hi-8i9sb-qyrat80ef927lam 14473 1 child 14504 on localhost.1 exit 0 signal 0 success=true
2014-08-06_15:16:35 qr1hi-8i9sb-qyrat80ef927lam 14473 1 success in 10 seconds
2014-08-06_15:16:35 qr1hi-8i9sb-qyrat80ef927lam 14473 1 output 8c20281b9840f624a486e4f1a78a1da8+105+A234be74ceb5ea31db6e11b6be26f3eb76d288ad0@54987018
2014-08-06_15:16:35 qr1hi-8i9sb-qyrat80ef927lam 14473  wait for last 0 children to finish
2014-08-06_15:16:35 qr1hi-8i9sb-qyrat80ef927lam 14473  status: 2 done, 0 running, 0 todo
2014-08-06_15:16:35 qr1hi-8i9sb-qyrat80ef927lam 14473  release job allocation
2014-08-06_15:16:35 qr1hi-8i9sb-qyrat80ef927lam 14473  Freeze not implemented
2014-08-06_15:16:35 qr1hi-8i9sb-qyrat80ef927lam 14473  collate
2014-08-06_15:16:36 qr1hi-8i9sb-qyrat80ef927lam 14473  collated output manifest text to send to API server is 105 bytes with access tokens
2014-08-06_15:16:36 qr1hi-8i9sb-qyrat80ef927lam 14473  output hash c1b44b6dc41ef334cf1136033ca950e6+54
2014-08-06_15:16:37 qr1hi-8i9sb-qyrat80ef927lam 14473  finish
2014-08-06_15:16:38 qr1hi-8i9sb-qyrat80ef927lam 14473  log manifest is 7fe8cf1d45d438a3ca3ac4a184b7aff4+83

Although the job runs locally, the output of the job has been saved to Keep, the Arvados file store. The “output hash” line (third from the bottom) provides the portable data hash of the Arvados collection where the script’s output has been saved. Copy the output hash and use arv-ls to list the contents of your output collection, and arv-get to download it to the current directory:

~/tutorial/crunch_scripts$ arv-ls c1b44b6dc41ef334cf1136033ca950e6+54
./md5sum.txt
~/tutorial/crunch_scripts$ arv-get c1b44b6dc41ef334cf1136033ca950e6+54/ .
0 MiB / 0 MiB 100.0%
~/tutorial/crunch_scripts$ cat md5sum.txt
44b8ae3fde7a8a88d2f7ebd237625b4f c1bad4b39ca5a924e481008009d94e32+210/var-GS000016015-ASM.tsv.bz2

Running locally is convenient for development and debugging, as it permits a fast iterative development cycle. Your job run is also recorded by Arvados, and will appear in the Recent jobs and pipelines panel on the Workbench Dashboard. This provides limited provenance, by recording the input parameters, the execution log, and the output. However, running locally does not allow you to scale out to multiple nodes, and does not store the complete system snapshot required to achieve reproducibility; to do that you need to submit a job to the Arvados cluster.

Previous: Tools for writing Crunch pipelines Next: Running on an Arvados cluster