This tutoral describes how to access Arvados collections on GNU/Linux using traditional filesystem tools by mounting Keep as a file system using arv-mount
.
This tutorial assumes that you are logged into an Arvados VM instance (instructions for Webshell or Unix or Windows) or you have installed the Arvados FUSE Driver and Python SDK on your workstation and have a working environment.
arv-mount
provides several features:
The default mode permits browsing any collection in Arvados as a subdirectory under the mount directory. To avoid having to fetch a potentially large list of all collections, collection directories only come into existence when explicitly accessed by UUID or portable data hash. For instance, a collection may be found by its content hash in the keep/by_id
directory.
~$ mkdir -p keep
~$ arv-mount keep
~$ cd keep/by_id/c1bad4b39ca5a924e481008009d94e32+210
~/keep/by_id/c1bad4b39ca5a924e481008009d94e32+210$ ls
var-GS000016015-ASM.tsv.bz2
~/keep/by_id/c1bad4b39ca5a924e481008009d94e32+210$ md5sum var-GS000016015-ASM.tsv.bz2
44b8ae3fde7a8a88d2f7ebd237625b4f var-GS000016015-ASM.tsv.bz2
~/keep/by_id/c1bad4b39ca5a924e481008009d94e32+210$ cd ../..
~$ fusermount -u keep
The last line unmounts Keep. Subdirectories will no longer be accessible.
In the top level directory of each collection, arv-mount provides a special file called .arvados#collection
that contains a JSON-formatted API record for the collection. This can be used to determine the collection’s portable_data_hash
, uuid
, etc. This file does not show up in ls
or ls -a
.
By default, all files in the Keep mount are read only. However, arv-mount --read-write
enables you to perform the following operations using normal Unix command line tools (touch
, mv
, rm
, mkdir
, rmdir
) and your own programs using standard POSIX file system APIs:
mkdir
and rmdir
in a project directory)Not supported:
If multiple clients (separate instances of arv-mount or other arvados applications) modify the same file in the same collection within a short time interval, this may result in a conflict. In this case, the most recent commit wins, and the “loser” will be renamed to a conflict file in the form name~YYYYMMDD-HHMMSS~conflict~
.
Please note this feature is in beta testing. In particular, the conflict mechanism is itself currently subject to race conditions with potential for data loss when a collection is being modified simultaneously by multiple clients. This issue will be resolved in future development.
The content of this documentation is licensed under the
Creative
Commons Attribution-Share Alike 3.0 United States licence.
Code samples in this documentation are licensed under the
Apache License, Version 2.0.