collections

API endpoint base: https://pirca.arvadosapi.com/arvados/v1/collections

Object type: 4zz18

Example UUID: zzzzz-4zz18-0123456789abcde

Resource

Collections describe sets of files in terms of data blocks stored in Keep. See Keep – Content-Addressable Storage and using collection versioning for details.

Each collection has, in addition to the Common resource fields:

Attribute	Type	Description	Example
name	string
description	text	Free text description of the group. Allows HTML formatting.
properties	hash	User-defined metadata, may be used in queries using subproperty filters
portable_data_hash	string	The MD5 sum of the manifest text stripped of block hints other than the size hint.
manifest_text	text	The manifest describing how to assemble blocks into files, in the Arvados manifest format
replication_desired	number	Minimum storage replication level desired for each data block referenced by this collection. A value of `null` signifies that the site default replication level (typically 2) is desired.	`2`
replication_confirmed	number	Replication level most recently confirmed by the storage system. This field is null when a collection is first created, and is reset to null when the manifest_text changes in a way that introduces a new data block. An integer value indicates the replication level of the least replicated data block in the collection.	`2`, null
replication_confirmed_at	datetime	When `replication_confirmed` was confirmed. If `replication_confirmed` is null, this field is also null.
storage_classes_desired	list	An optional list of storage class names where the blocks should be saved. If not provided, the cluster’s default storage class(es) will be set.	`['archival']`
storage_classes_confirmed	list	Storage classes most recently confirmed by the storage system. This field is an empty list when a collection is first created.	`'archival']`, `[]`
storage_classes_confirmed_at	datetime	When `storage_classes_confirmed` was confirmed. If `storage_classes_confirmed` is `[]`, this field is null.
trash_at	datetime	If `trash_at` is non-null and in the past, this collection will be hidden from API calls. May be untrashed.
delete_at	datetime	If `delete_at` is non-null and in the past, the collection may be permanently deleted.
is_trashed	boolean	True if `trash_at` is in the past, false if not.
current_version_uuid	string	UUID of the collection’s current version. On new collections, it’ll be equal to the `uuid` attribute.
version	number	Version number, starting at 1 on new collections. This attribute is read-only.
preserve_version	boolean	When set to true on a current version, it will be persisted. When passing `true` as part of a bigger update call, both current and newly created versions are persisted.
file_count	number	The total number of files in the collection. This attribute is read-only.
file_size_total	number	The sum of the file sizes in the collection. This attribute is read-only.

Conditions of creating a Collection

If a new portable_data_hash is specified when creating or updating a Collection, it must match the cryptographic digest of the supplied manifest_text.

Side effects of creating a Collection

Referenced blocks are protected from garbage collection in Keep.

Data can be shared with other users via the Arvados permission model.

Trashing collections

Collections can be trashed by updating the record and setting the trash_at field, or with the delete method. The delete method sets trash_at to “now”.

The value of trash_at can be set to a time in the future as a feature to automatically expire collections.

When trash_at is set, delete_at will also be set. Normally delete_at = trash_at + Collections.DefaultTrashLifetime. When the trash_at time is past but delete_at is in the future, the trashed collection is invisible to most API calls unless the include_trash parameter is true. Collections in the trashed state can be untrashed so long as delete_at has not past. Collections are also trashed if they are contained in a trashed group

Once delete_at is past, the collection and all of its previous versions will be deleted permanently and can no longer be untrashed.

Using “replace_files” to create or update a collection

The replace_files option can be used with the create and update APIs to efficiently and atomically copy individual files and directory trees from other collections, copy/rename/delete items within an existing collection, and add new items to a collection.

replace_files keys indicate target paths in the new collection, and values specify sources that should be copied to the target paths.

Each target path must be an absolute canonical path beginning with /. It must not contain . or .. components, consecutive / characters, or a trailing / after the final component.
Each source must be one of the following:
- an empty string (signifying that the target path is to be deleted),
- <PDH>/<path> where <PDH> is the portable data hash of a collection on the cluster and <path> is a file or directory in that collection,
- manifest_text/<path> where <path> is an existing file or directory in a collection supplied in the manifest_text attribute in the request, or
- current/<path> where <path> is an existing file or directory in the collection being updated.

In an update request, sources may reference the current portable data hash of the collection being updated. However, in many cases it is more appropriate to use a current/<path> source instead, to ensure the latest content is used even if the collection has been updated since the PDH was last retrieved.

Delete a file

Delete foo.txt.

"replace_files": {
  "/foo.txt": ""
}

Rename a file

Rename foo.txt to bar.txt.

"replace_files": {
  "/foo.txt": "",
  "/bar.txt": "current/foo.txt"
}

Swap files

Swap contents of files foo and bar.

"replace_files": {
  "/foo": "current/bar",
  "/bar": "current/foo"
}

Add a file

"replace_files": {
  "/new_directory/new_file.txt": "manifest_text/new_file.txt"
},
"collection": {
  "manifest_text": ". acbd18db4cc2f85cedef654fccc4a4d8+3+A82740cd577ff5745925af5780de5992cbb25d937@668efec4 0:3:new_file.txt\n"
}

Replace all content with new content

Note this is equivalent to omitting the replace_files argument.

"replace_files": {
  "/": "manifest_text/"
},
"collection": {
  "manifest_text": "./new_directory acbd18db4cc2f85cedef654fccc4a4d8+3+A82740cd577ff5745925af5780de5992cbb25d937@668efec4 0:3:new_file.txt\n"
}

Atomic rename and replace

Rename current_file.txt to old_file.txt and replace current_file.txt with new content, all in a single atomic operation.

"replace_files": {
  "/current_file.txt": "manifest_text/new_file.txt",
  "/old_file.txt": "current/current_file.txt"
},
"collection": {
  "manifest_text": ". acbd18db4cc2f85cedef654fccc4a4d8+3+A82740cd577ff5745925af5780de5992cbb25d937@668efec4 0:3:new_file.txt\n"
}

Combine collections

Delete all current content, then copy content from other collections into new subdirectories.

"replace_files": {
  "/": "",
  "/copy of collection 1": "1f4b0bc7583c2a7f9102c395f4ffc5e3+45/",
  "/copy of collection 2": "ea10d51bcf88862dbcc36eb292017dfd+45/"
}

Extract a subdirectory

Replace all current content with a copy of a subdirectory from another collection.

"replace_files": {
  "/": "1f4b0bc7583c2a7f9102c395f4ffc5e3+45/subdir"
}

Usage restrictions

A target path with a non-empty source cannot be the ancestor of another target path in the same request. For example, the following request is invalid:

"replace_files": {
  "/foo": "fa7aeb5140e2848d39b416daeef4ffc5+45/",
  "/foo/this_will_return_an_error": ""
}

It is an error to supply a non-empty manifest_text that is unused, i.e., the replace_files argument does not contain any values beginning with "manifest_text/". For example, the following request is invalid:

"replace_files": {
  "/foo": "current/bar"
},
"collection": {
  "manifest_text": ". acbd18db4cc2f85cedef654fccc4a4d8+3+A82740cd577ff5745925af5780de5992cbb25d937@668efec4 0:3:new_file.txt\n"
}

Collections on other clusters in a federation cannot be used as sources. Each source must exist on the current cluster and be readable by the current user.

Similarly, if manifest_text is provided, it must only reference data blocks that are stored on the current cluster. This API does not copy data from other clusters in a federation.

Using “replace_segments” to repack file data

The replace_segments option can be used with the create or update API to atomically apply a new file packing, typically with the goal of replacing a number of small blocks with one larger block. The repacking is specified in terms of block segments: a block segment is a portion of a stored block that is referenced by a file in a manifest.

replace_segments keys indicate existing block segments in the collection, and values specify replacement segments.

Each segment is specified as space-separated tokens: "locator offset length" where locator is a signed block locator and offset and length are decimal-encoded integers specifying a portion of the block that is referenced in the collection.
Each replacement block locator must be properly signed (just as if it appeared in a manifest_text).
Each existing block segment must correspond to an entire contiguous portion of a block referenced by a single file (splitting existing segments is not supported).
If a segment to be replaced does not match any existing block segment in the manifest, that segment and all other replace_segments entries referencing the same replacement block will be skipped. Other replacements will still be applied. Replacements that are skipped for this reason do not cause the request to fail. This rule ensures that when concurrent clients compute different repackings and request similar replacements such as a,b,c,d,e → X and a,b,c,d,e,f → Y, the resulting manifest references X or Y but not both. Otherwise, the effect could be a,b,c,d,e → X, f → Y where Y is just an inefficient way to reference the same data as f.

The replace_files and manifest_text options, if present, are applied before replace_segments. This means replace_segments can apply to blocks from manifest_text and/or other collections referenced by replace_files.

In the following example, two files were originally saved by writing two small blocks (c410 and c93e). After concatenating the two small blocks and writing a single larger block ca9c, the manifest is being updated to reference the larger block.

"collection": {
  "manifest_text": ". c4103f122d27677c9db144cae1394a66+2+A3d02f1f3d8a622b2061ad5afe4853dbea42039e2@674dd351 693e9af84d3dfcc71e640e005bdc5e2e+3+A6528480b63d90a24b60b2ee2409040f050cc5d0c@674dd351 0:2:file1.txt 2:3:file2.txt\n"
},
"replace_segments": {
  "c4103f122d27677c9db144cae1394a66+2+A3d02f1f3d8a622b2061ad5afe4853dbea42039e2@674dd351 0 2": "ca9c491ac66b2c62500882e93f3719a8+5+A312fea6de5807e9e77d844450d36533a599c40f1@674dd351 0 2",
  "693e9af84d3dfcc71e640e005bdc5e2e+3+A6528480b63d90a24b60b2ee2409040f050cc5d0c@674dd351 0 3": "ca9c491ac66b2c62500882e93f3719a8+5+A312fea6de5807e9e77d844450d36533a599c40f1@674dd351 2 3"
}

Resulting manifest:

. ca9c491ac66b2c62500882e93f3719a8+5+A312fea6de5807e9e77d844450d36533a599c40f1@674dd351 0:2:file1.txt 2:3:file2.txt

Methods

See Common resource methods for more information about create, delete, get, list, and update.

Required arguments are displayed in green.

Supports federated get only, which may be called with either a uuid or a portable data hash. When requesting a portable data hash which is not available on the home cluster, the query is forwarded to all the clusters listed in RemoteClusters and returns the first successful result.

create

Create a new Collection.

Arguments:

Argument	Type	Description	Location
collection	object		query
replace_files	object	Initialize files and directories with new content and/or content from other collections	query
replace_segments	object	Repack the collection by substituting data blocks	query

The new collection’s content can be initialized by providing a manifest_text key in the provided collection object, or by using the replace_files option.

An alternative file packing can be applied atomically using the replace_segments option.

delete

Put a Collection in the trash. This sets the trash_at field to now and delete_at field to now + token TTL. A trashed collection is invisible to most API calls unless the include_trash parameter is true.

Arguments:

Argument	Type	Description	Location	Example
uuid	string	The UUID of the Collection in question.	path

get

Gets a Collection’s metadata by UUID or portable data hash. When making a request by portable data hash, attributes other than portable_data_hash, manifest_text, and trash_at are not returned, even when requested explicitly using the select parameter.

Arguments:

Argument	Type	Description	Location	Example
uuid	string	The UUID or portable data hash of the Collection in question.	path

list

List collections.

See common resource list method.

Argument	Type	Description	Location	Example
include_trash	boolean (default false)	Include trashed collections.	query
include_old_versions	boolean (default false)	Include past versions of the collection(s) being listed, if any.	query

Note: Because adding access tokens to manifests can be computationally expensive, the manifest_text field is not included in results by default. If you need it, pass a select parameter that includes manifest_text.

Searching Collections for names of file or directories

You can search collections for specific file or directory names (whole or part) using the following filter in a list query.

filters: [["file_names", "ilike", "%sample1234.fastq%"]]

Note: file_names is a hidden field used for indexing. It is not returned by any API call. On the client, you can programmatically enumerate all the files in a collection using arv-ls, the Python SDK Collection class, Go SDK FileSystem struct, the WebDAV API, or the S3-compatible API.

As of this writing (Arvados 2.4), you can also search for directory paths, but not complete file paths.

In other words, this will work (when dir3 is a directory):

filters: [["file_names", "ilike", "%dir1/dir2/dir3%"]]

However, this will not return the desired results (where sample1234.fastq is a file):

filters: [["file_names", "ilike", "%dir1/dir2/dir3/sample1234.fastq%"]]

As a workaround, you can search for both the directory path and file name separately, and then filter on the client side.

filters: [["file_names", "ilike", "%dir1/dir2/dir3%"], ["file_names", "ilike", "%sample1234.fastq%"]]

update

Update attributes of an existing Collection.

Arguments:

Argument	Type	Description	Location
uuid	string	The UUID of the Collection in question.	path
collection	object		query
replace_files	object	Add, delete, and replace files and directories with new content and/or content from other collections	query
replace_segments	object	Repack the collection by substituting data blocks	query

The collection’s existing content can be replaced entirely by providing a manifest_text key in the provided collection object, or updated in place by using the replace_files option.

An alternative file packing can be applied atomically using the replace_segments option.

untrash

Remove a Collection from the trash. This sets the trash_at and delete_at fields to null.

Arguments:

Argument	Type	Description	Location	Example
uuid	string	The UUID of the Collection to untrash.	path
ensure_unique_name	boolean (default false)	Rename collection uniquely if untrashing it would fail with a unique name conflict.	query

provenance

Returns a list of objects in the database that directly or indirectly contributed to producing this collection, such as the container request that produced this collection as output.

The general algorithm is:

Visit the container request that produced this collection (via output_uuid or log_uuid attributes of the container request)
Visit the input collections to that container request (via mounts and container_image of the container request)
Iterate until there are no more objects to visit

Arguments:

Argument	Type	Description	Location	Example
uuid	string	The UUID of the Collection to get provenance.	path

used_by

Returns a list of objects in the database this collection directly or indirectly contributed to, such as containers that takes this collection as input.

The general algorithm is:

Visit containers that take this collection as input (via mounts or container_image of the container)
Visit collections produced by those containers (via output or log of the container)
Iterate until there are no more objects to visit

Arguments:

Argument	Type	Description	Location	Example
uuid	string	The UUID of the Collection to get usage.	path

Previous: Metadata properties Next: logs