Working with an Arvados git repository

This tutorial describes how to work with a new Arvados git repository. Working with an Arvados git repository is analogous to working with other public git repositories. It will show you how to upload custom scripts to a remote Arvados repository, so you can use it in Arvados pipelines.

Note:

This tutorial assumes that you are logged into an Arvados VM instance (instructions for Webshell or Unix or Windows) or you have installed the Arvados FUSE Driver and Python SDK on your workstation and have a working environment.

Note:

This tutorial assumes that you have a working Arvados repository. If you do not have a repository created, you can follow the instructions in the Adding a new repository page. We will use the $USER/tutorial repository created in that page as the example.

Note:

For more information about using Git, try

$ man gittutorial

or search Google for Git tutorials.

Cloning an Arvados repository

Before you start using Git, you should do some basic configuration (you only need to do this the first time):

~$ git config --global user.name "Your Name"
~$ git config --global user.email $USER@example.com

On the Arvados Workbench, click on the dropdown menu icon in the upper right corner of the top navigation menu to access the user settings menu, and click on the menu item Repositories. In the Repositories page, you should see the $USER/tutorial repository listed in the name column. Next to name is the column URL. Copy the URL value associated with your repository. This should look like https://git.pirca.arvadosapi.com/$USER/tutorial.git. Alternatively, you can use git@git.pirca.arvadosapi.com:$USER/tutorial.git

Next, on the Arvados virtual machine, clone your Git repository:

~$ cd $HOME # (or wherever you want to install)
~$ git clone https://git.pirca.arvadosapi.com/$USER/tutorial.git
Cloning into 'tutorial'...

This will create a Git repository in the directory called tutorial in your home directory. Say yes when prompted to continue with connection.
Ignore any warning that you are cloning an empty repository.

Note: If you are prompted for username and password when you try to git clone using this command, you may first need to update your git configuration. Execute the following commands to update your git configuration.

~$ git config 'credential.https://git.pirca.arvadosapi.com/.username' none
~$ git config 'credential.https://git.pirca.arvadosapi.com/.helper' '!cred(){ cat >/dev/null; if [ "$1" = get ]; then echo password=$ARVADOS_API_TOKEN; fi; };cred'

Creating a git branch in an Arvados repository

Create a git branch named tutorial_branch in the tutorial Arvados git repository.

~$ cd tutorial
~/tutorial$ git checkout -b tutorial_branch

Adding scripts to an Arvados repository

Arvados crunch scripts need to be added in a crunch_scripts subdirectory in the repository. If this subdirectory does not exist, first create it in the local repository and change to that directory:

~/tutorial$ mkdir crunch_scripts
~/tutorial$ cd crunch_scripts

Next, using nano or your favorite Unix text editor, create a new file called hash.py in the crunch_scripts directory.

~/tutorial/crunch_scripts$ nano hash.py

Add the following code to compute the MD5 hash of each file in a collection

#!/usr/bin/env python
#<Liquid::Comment:0x00005619cf87a138>

import hashlib      # Import the hashlib module to compute MD5.
import os           # Import the os module for basic path manipulation
import arvados      # Import the Arvados sdk module

# Automatically parallelize this job by running one task per file.
# This means that if the input consists of many files, each file will
# be processed in parallel on different nodes enabling the job to
# be completed quicker.
arvados.job_setup.one_task_per_input_file(if_sequence=0, and_end_task=True,
                                          input_as_path=True)

# Get object representing the current task
this_task = arvados.current_task()

# Create the message digest object that will compute the MD5 hash
digestor = hashlib.new('md5')

# Get the input file for the task
input_id, input_path = this_task['parameters']['input'].split('/', 1)

# Open the input collection
input_collection = arvados.CollectionReader(input_id)

# Open the input file for reading
with input_collection.open(input_path) as input_file:
    for buf in input_file.readall():  # Iterate the file's data blocks
        digestor.update(buf)          # Update the MD5 hash object

# Write a new collection as output
out = arvados.CollectionWriter()

# Write an output file with one line: the MD5 value and input path
with out.open('md5sum.txt') as out_file:
    out_file.write("{} {}/{}\n".format(digestor.hexdigest(), input_id,
                                       os.path.normpath(input_path)))

# Commit the output to Keep.
output_locator = out.finish()

# Use the resulting locator as the output for this task.
this_task.set_output(output_locator)

# Done!

Make the file executable:

~/tutorial/crunch_scripts$ chmod +x hash.py

Next, add the file to the git repository. This tells git that the file should be included on the next commit.

~/tutorial/crunch_scripts$ git add hash.py

Next, commit your changes. All staged changes are recorded into the local git repository:

~/tutorial/crunch_scripts$ git commit -m "my first script"

Finally, upload your changes to the remote repository:

~/tutorial/crunch_scripts$ git push origin tutorial_branch

Although this tutorial shows how to add a python script to Arvados, the same steps can be used to add any of your custom bash, R, or python scripts to an Arvados repository.

Previous: Adding a new Arvados git repository Next: Introduction to Crunch