Keep has a number of components. This page describes each component and the role it plays.
In order to access data in Keep, a client is needed to store data in and retrieve data from Keep. Different types of Keep clients exist:
In essense, these clients all do the same thing: they translate file and directory references into requests for Keep blocks and collection manifests. How Keep clients work, and how they use rendezvous hashing, is described in greater detail in the next section.
For example, when a request comes in to read a file from Keep, the client will
All of those steps are subject to access control, which applies at the level of the collection: in the example above, the API server and the keepstore daemons verify that the client has permission to read the collection, and will reject the request if it does not.
The API server stores collection objects and all associated metadata. That includes data about where the blocks for a collection are to be stored, e.g. when storage classes are configured, as well as the desired and confirmed replication count for each block. It also stores the ACLs that control access to the collections. Finally, the API server provides Keep clients with time-based block signatures for access.
keepstore daemon is Keep’s workhorse, the storage server that stores and retrieves data from an underlying storage system. Keepstore exposes an HTTP REST API. Keepstore only handles requests for blocks. Because blocks are content-addressed, they can be written and deleted, but there is no update operation: blocks are immutable.
So what happens if the content of a file changes? When a client changes a file, it first writes any new blocks to the keepstore(s). Then, it updates the manifest for the collection the file belongs to with the references to the new blocks.
A keepstore can store its blocks in object storage (S3 or an S3-compatible system, or Azure Blob Storage). It can also store blocks on a POSIX file system. A keepstore can be configured with multiple storage volumes. Each keepstore volume is configured with a replication number; e.g. a POSIX file system backed by a single disk would have a replication factor of 1, while an Azure ‘LRS’ storage volume could be configured with a replication factor of 3 (that is how many copies LRS stores under the hood, according to the Azure documentation).
By default, Arvados uses a replication factor of 2. See the
DefaultReplication configuration parameter in the configuration reference. Additionally, each collection can be configured with its own replication factor. It’s worth noting that it is the responsibility of the Keep clients to make sure that all blocks are stored subject to their desired replica count, which is derived from the collections the blocks belong to.
keepstore itself does not provide replication; all it does is store blocks on the volumes it knows about. The
keep-balance processes (see below) make sure that blocks are replicated properly.
The maximum block size for
keepstore is 64 MiB, and keep clients typically combine small files into larger blocks. In a typical Arvados installation, the majority of blocks stored in Keep will be 64 MiB, though some fraction will be smaller.
keepproxy server is a gateway into your Keep storage. Unlike the Keepstore servers, which are only accessible on the local LAN, Keepproxy is suitable for clients located elsewhere on the internet. A client writing through Keepproxy only writes one copy of each block; the Keepproxy server will write additional copies of the data to the Keepstore servers, to fulfill the requested replication factor. Keepproxy also checks API token validity before processing requests.
keep-web server provides read/write access to files stored in Keep using the HTTP, WebDAV and S3 protocols. This makes it easy to access files in Keep from a browser, or mount Keep as a network folder using WebDAV support in various operating systems. It serves public data to unauthenticated clients, and serves private data to clients that supply Arvados API tokens.
Keep is a garbage-collected system. When a block is no longer referenced in any collection manifest in the system, it becomes eligible for garbage collection. When the desired replication factor for a block (derived from the default replication factor, in addition to the replication factor of any collection(s) the block belongs to) does not match reality, the number of copies stored in the available Keepstore servers needs to be adjusted.
keep-balance program takes care of these things. It runs as a service, and wakes up periodically to do a scan of the system and send instructions to the Keepstore servers. That process is described in more detail at Balancing Keep servers.
The content of this documentation is licensed under the
Commons Attribution-Share Alike 3.0 United States licence.
Code samples in this documentation are licensed under the Apache License, Version 2.0.