This tutorial describes how to work with a new Arvados git repository. Working with an Arvados git repository is analogous to working with other public git repositories. It will show you how to upload custom scripts to a remote Arvados repository, so you can use it in Arvados pipelines.
This tutorial assumes that you are logged into an Arvados VM instance (instructions for Webshell or Unix or Windows) or you have installed the Arvados FUSE Driver and Python SDK on your workstation and have a working environment.
This tutorial assumes that you have a working Arvados repository. If you do not have a repository created, you can follow the instructions in the Adding a new repository page. We will use the $USER/tutorial repository created in that page as the example.
For more information about using Git, try
$ man gittutorial
or search Google for Git tutorials.
Before you start using Git, you should do some basic configuration (you only need to do this the first time):
~$ git config --global user.name "Your Name"
~$ git config --global user.email $USER@example.com
On the Arvados Workbench, click on the dropdown menu icon in the upper right corner of the top navigation menu to access the user settings menu, and click on the menu item Repositories. In the Repositories page, you should see the $USER/tutorial
repository listed in the name column. Next to name is the column URL. Copy the URL value associated with your repository. This should look like https://git.pirca.arvadosapi.com/$USER/tutorial.git
. Alternatively, you can use git@git.pirca.arvadosapi.com:$USER/tutorial.git
Next, on the Arvados virtual machine, clone your Git repository:
~$ cd $HOME # (or wherever you want to install)
~$ git clone https://git.pirca.arvadosapi.com/$USER/tutorial.git
Cloning into 'tutorial'...
This will create a Git repository in the directory called tutorial
in your home directory. Say yes when prompted to continue with connection.
Ignore any warning that you are cloning an empty repository.
Note: If you are prompted for username and password when you try to git clone using this command, you may first need to update your git configuration. Execute the following commands to update your git configuration.
~$ git config 'credential.https://git.pirca.arvadosapi.com/.username' none
~$ git config 'credential.https://git.pirca.arvadosapi.com/.helper' '!cred(){ cat >/dev/null; if [ "$1" = get ]; then echo password=$ARVADOS_API_TOKEN; fi; };cred'
Create a git branch named tutorial_branch in the tutorial Arvados git repository.
~$ cd tutorial
~/tutorial$ git checkout -b tutorial_branch
Arvados crunch scripts need to be added in a crunch_scripts subdirectory in the repository. If this subdirectory does not exist, first create it in the local repository and change to that directory:
~/tutorial$ mkdir crunch_scripts
~/tutorial$ cd crunch_scripts
Next, using nano
or your favorite Unix text editor, create a new file called hash.py
in the crunch_scripts
directory.
~/tutorial/crunch_scripts$ nano hash.py
Add the following code to compute the MD5 hash of each file in a collection
#!/usr/bin/env python #<Liquid::Comment:0x000055fc4b2ca658> import hashlib # Import the hashlib module to compute MD5. import os # Import the os module for basic path manipulation import arvados # Import the Arvados sdk module # Automatically parallelize this job by running one task per file. # This means that if the input consists of many files, each file will # be processed in parallel on different nodes enabling the job to # be completed quicker. arvados.job_setup.one_task_per_input_file(if_sequence=0, and_end_task=True, input_as_path=True) # Get object representing the current task this_task = arvados.current_task() # Create the message digest object that will compute the MD5 hash digestor = hashlib.new('md5') # Get the input file for the task input_id, input_path = this_task['parameters']['input'].split('/', 1) # Open the input collection input_collection = arvados.CollectionReader(input_id) # Open the input file for reading with input_collection.open(input_path) as input_file: for buf in input_file.readall(): # Iterate the file's data blocks digestor.update(buf) # Update the MD5 hash object # Write a new collection as output out = arvados.CollectionWriter() # Write an output file with one line: the MD5 value and input path with out.open('md5sum.txt') as out_file: out_file.write("{} {}/{}\n".format(digestor.hexdigest(), input_id, os.path.normpath(input_path))) # Commit the output to Keep. output_locator = out.finish() # Use the resulting locator as the output for this task. this_task.set_output(output_locator) # Done!
Make the file executable:
~/tutorial/crunch_scripts$ chmod +x hash.py
Next, add the file to the git repository. This tells git
that the file should be included on the next commit.
~/tutorial/crunch_scripts$ git add hash.py
Next, commit your changes. All staged changes are recorded into the local git repository:
~/tutorial/crunch_scripts$ git commit -m "my first script"
Finally, upload your changes to the remote repository:
~/tutorial/crunch_scripts$ git push origin tutorial_branch
Although this tutorial shows how to add a python script to Arvados, the same steps can be used to add any of your custom bash, R, or python scripts to an Arvados repository.
The content of this documentation is licensed under the
Creative
Commons Attribution-Share Alike 3.0 United States licence.
Code samples in this documentation are licensed under the
Apache License, Version 2.0.