Keep components overview

Keep has a number of components. This page describes each component and the role it plays.

Keep clients for data access

In order to access data in Keep, a client is needed to store data in and retrieve data from Keep. Different types of Keep clients exist:

a command line client like arv-get or arv-put
a FUSE mount provided by arv-mount
a WebDAV mount provided by keep-web
an S3-compatible endpoint provided by keep-web
programmatic access via the Arvados SDKs

In essense, these clients all do the same thing: they translate file and directory references into requests for Keep blocks and collection manifests. How Keep clients work, and how they use rendezvous hashing, is described in greater detail in the next section.

For example, when a request comes in to read a file from Keep, the client will

request the collection object (including its manifest) from the API server
look up the file in the collection manifest, and retrieve the hashes of the block(s) that contain its content
ask the keepstore(s) for the block hashes
return the contents of the file to the requestor

All of those steps are subject to access control, which applies at the level of the collection: in the example above, the API server and the keepstore daemons verify that the client has permission to read the collection, and will reject the request if it does not.

API server

The API server stores collection objects and all associated metadata. That includes data about where the blocks for a collection are to be stored, e.g. when storage classes are configured, as well as the desired and confirmed replication count for each block. It also stores the ACLs that control access to the collections. Finally, the API server provides Keep clients with time-based block signatures for access.

Keepstore

The keepstore daemon is Keep’s workhorse, the storage server that stores and retrieves data from an underlying storage system. Keepstore exposes an HTTP REST API. Keepstore only handles requests for blocks. Because blocks are content-addressed, they can be written and deleted, but there is no update operation: blocks are immutable.

So what happens if the content of a file changes? When a client changes a file, it first writes any new blocks to the keepstore(s). Then, it updates the manifest for the collection the file belongs to with the references to the new blocks.

A keepstore can store its blocks in object storage (S3 or an S3-compatible system, or Azure Blob Storage). It can also store blocks on a POSIX file system. A keepstore can be configured with multiple storage volumes. Each keepstore volume is configured with a replication number; e.g. a POSIX file system backed by a single disk would have a replication factor of 1, while an Azure ‘LRS’ storage volume could be configured with a replication factor of 3 (that is how many copies LRS stores under the hood, according to the Azure documentation).

By default, Arvados uses a replication factor of 2. See the DefaultReplication configuration parameter in the configuration reference. Additionally, each collection can be configured with its own replication factor. It’s worth noting that it is the responsibility of the Keep clients to make sure that all blocks are stored subject to their desired replica count, which is derived from the collections the blocks belong to. keepstore itself does not provide replication; all it does is store blocks on the volumes it knows about. The keepproxy and keep-balance processes (see below) make sure that blocks are replicated properly.

The maximum block size for keepstore is 64 MiB, and keep clients typically combine small files into larger blocks. In a typical Arvados installation, the majority of blocks stored in Keep will be 64 MiB, though some fraction will be smaller.

Keepproxy

The keepproxy server is a gateway into your Keep storage. Unlike the Keepstore servers, which are only accessible on the local LAN, Keepproxy is suitable for clients located elsewhere on the internet. A client writing through Keepproxy only writes one copy of each block; the Keepproxy server will write additional copies of the data to the Keepstore servers, to fulfill the requested replication factor. Keepproxy also checks API token validity before processing requests.

Keep-web

The keep-web server provides read/write access to files stored in Keep using the HTTP, WebDAV and S3 protocols. This makes it easy to access files in Keep from a browser, or mount Keep as a network folder using WebDAV support in various operating systems. It serves public data to unauthenticated clients, and serves private data to clients that supply Arvados API tokens.

Keep-balance

Keep is a garbage-collected system. When a block is no longer referenced in any collection manifest in the system, it becomes eligible for garbage collection. When the desired replication factor for a block (derived from the default replication factor, in addition to the replication factor of any collection(s) the block belongs to) does not match reality, the number of copies stored in the available Keepstore servers needs to be adjusted.

The keep-balance program takes care of these things. It runs as a service, and wakes up periodically to do a scan of the system and send instructions to the Keepstore servers. That process is described in more detail at Balancing Keep servers.

Previous: Introduction to Keep Next: Keep clients