Data lifecycle

Overview

Arvados collections consist of a manifest and the data blocks referenced in that manifest. Manifests are stored in the PosgreSQL database, data blocks are stored by a keepstore.

Data blocks are frequently shared between collections. Each collection has its own manifest. Collection manifests and data blocks have a separate lifecycle, which is described in detail below.

Collection lifecycle

During its lifetime, a collection can be in various states. These states are persisted, expiring, trashed and permanently deleted.

The nominal state is persisted which means the data can be can be accessed normally and will be retained indefinitely.

A collection is expiring when it has a trash_at time in the future. An expiring collection can be accessed as normal, but is scheduled to be trashed automatically at the trash_at time.

A collection is trashed when it has a trash_at time in the past. The is_trashed attribute will also be “true”. The delete operation immediately puts the collection in the trash by setting the trash_at time to “now”, and delete_at defaults to “now” + Collections.DefaultTrashLifetime. Once trashed, the collection is no longer readable through normal data access APIs. The collection will have delete_at set to some time in the future. The trashed collection is recoverable until the delete_at time passes, at which point the collection is permanently deleted.

See Recovering trashed collections for instructions to recover trashed collections.

Collection lifecycle attributes

As listed above the attributes that are used to manage a collection lifecycle are is_trashed, trash_at, and delete_at. The table below lists the values of these attributes and how they influence the state of a collection and its accessibility.

collection state is_trashed trash_at delete_at get list list?include_trash=true can be modified
persisted collection false null null yes yes yes yes
expiring collection false future future yes yes yes yes
trashed collection true past future no no yes only is_trashed, trash_at and delete_at attributes
deleted collection true past past no no no no

Block lifecycle

During its lifetime, a data block can be in various states. These states are persisted, unreferenced, trashed and permanently deleted.

The nominal state is persisted which means the block can be can be retrieved normally from a keepstore process.

A block is unreferenced when there are no collection manifests in the PostgreSQL collections table that reference it. The block can still be retrieved normally from a keepstore process, e.g. by creating a new collection with a manifest that references the hash of the block. Unreferenced blocks will be moved to the trashed state by keep-balance after BlobSigningTTL, if BlobTrash is enabled and keep-balance is running and configured to send trash lists to the keepstores.

A block is trashed when keep-balance has asked a keepstore to move it to its trash and BlobTrash is enabled. It will stay there for a period of time, subject to the BlobTrashLifetime settings.

A block is permanently deleted on the first wakeup of its keepstore trash process after the block has spent BlobTrashLifetime in that keepstore’s trash. The trash process wakes up with a frequency defined by the BlobTrashCheckInterval.

block state duration retrievable via Keep can be recovered
persisted block indefinitely yes n/a
unreferenced block BlobSigningTTL + up to BalancePeriod + duration of keep-balance run yes n/a
trashed block BlobTrashLifetime + up to BlobTrashCheckInterval no yes
deleted block no no

Previous: Keep clients Next: Manifest format

The content of this documentation is licensed under the Creative Commons Attribution-Share Alike 3.0 United States licence.
Code samples in this documentation are licensed under the Apache License, Version 2.0.