Configure S3 object storage

Keepstore can store data in object storage compatible with the S3 API, such as Amazon S3, Google Cloud Storage, Ceph RADOS, NetApp StorageGRID, and others.

Volumes are configured in the Volumes section of the cluster configuration file.

  1. Configuration example
  2. IAM Policy

Configuration example

Note that each volume has a UUID, like zzzzz-nyw5e-0123456789abcde. You assign these manually: replace zzzzz with your Cluster ID, and replace 0123456789abcde with an arbitrary unique string of 15 alphanumerics. Once assigned, UUIDs should not be changed.

Essential configuration values are highlighted in red. Remaining parameters are provided for documentation, with their default values.

    Volumes:
      ClusterID-nyw5e-000000000000000:
        AccessViaHosts:
          # This section determines which keepstore servers access the
          # volume. In this example, keep0 has read/write access, and
          # keep1 has read-only access.
          #
          # If the AccessViaHosts section is empty or omitted, all
          # keepstore servers will have read/write access to the
          # volume.
          "http://keep0.ClusterID.example.com:25107": {}
          "http://keep1.ClusterID.example.com:25107": {ReadOnly: true}

        Driver: S3
        DriverParameters:
          # Bucket name.
          Bucket: example-bucket-name

          # IAM role name to use when retrieving credentials from
          # instance metadata. It can be omitted, in which case the
          # role name itself will be retrieved from instance metadata
          # -- but setting it explicitly may protect you from using
          # the wrong credentials in the event of an
          # installation/configuration error.
          IAMRole: ""

          # If you are not using an IAM role for authentication,
          # specify access credentials here instead.
          AccessKeyID: ""
          SecretAccessKey: ""

          # Storage provider region. If Endpoint is specified, the
          # region determines the request signing method, and defaults
          # to "us-east-1".
          Region: us-east-1

          # Storage provider endpoint. For Amazon S3, use "" or
          # omit. For Google Cloud Storage, use
          # "https://storage.googleapis.com".
          Endpoint: ""

          # Change to true if the region requires a LocationConstraint
          # declaration.
          LocationConstraint: false

          # Use V2 signatures instead of the default V4. Amazon S3
          # supports V4 signatures in all regions, but this option
          # might be needed for other S3-compatible services.
          V2Signature: false

          # Use the AWS S3 v2 Go driver instead of the goamz driver.
          UseAWSS3v2Driver: false

          # By default keepstore stores data using the MD5 checksum
          # (32 hexadecimal characters) as the object name, e.g.,
          # "0123456abc...". Setting PrefixLength to 3 changes this
          # naming scheme to "012/0123456abc...". This can improve
          # performance, depending on the S3 service being used. For
          # example, PrefixLength 3 is recommended to avoid AWS
          # limitations on the number of read/write operations per
          # second per prefix (see
          # https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/).
          #
          # Note that changing PrefixLength on an existing volume is
          # not currently supported. Once you have started using a
          # bucket as an Arvados volume, you should not change its
          # configured PrefixLength, or configure another volume using
          # the same bucket and a different PrefixLength.
          PrefixLength: 0

          # Requested page size for "list bucket contents" requests.
          IndexPageSize: 1000

          # Maximum time to wait while making the initial connection
          # to the backend before failing the request.
          ConnectTimeout: 1m

          # Maximum time to wait for a complete response from the
          # backend before failing the request.
          ReadTimeout: 2m

          # Maximum eventual consistency latency
          RaceWindow: 24h

        # How much replication is provided by the underlying bucket.
        # This is used to inform replication decisions at the Keep
        # layer.
        Replication: 2

        # If true, do not accept write or trash operations, even if
        # AccessViaHosts.*.ReadOnly is false.
        #
        # If false or omitted, enable write access (subject to
        # AccessViaHosts.*.ReadOnly, where applicable).
        ReadOnly: false

        # Storage classes to associate with this volume.  See "Storage
        # classes" in the "Admin" section of doc.arvados.org.
        StorageClasses: null

Two S3 drivers are available. Historically, Arvados has used the goamz driver to talk to S3-compatible services. More recently, support for the aws-sdk-go-v2 driver was added. This driver can be activated by setting the UseAWSS3v2Driver flag to true.

The aws-sdk-go-v2 does not support the old S3 v2 signing algorithm. This will not affect interacting with AWS S3, but it might be an issue when Keep is backed by a very old version of a third party S3-compatible service.

The aws-sdk-go-v2 driver can improve read performance by 50-100% over the goamz driver, but it has not had as much production use. See the wiki for details.

IAM Policy

On Amazon, VMs which will access the S3 bucket (these include keepstore and compute nodes) will need an IAM policy with permission that can read, write, list and delete objects in the bucket . Here is an example policy:

{
    "Id": "arvados-keepstore policy",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                  "s3:*"
            ],
            "Resource": "arn:aws:s3:::xarv1-nyw5e-000000000000000-volume"
            "Resource": "arn:aws:s3:::xarv1-nyw5e-000000000000000-volume/*"
        }
    ]
}

Previous: Configure filesystem storage Next: Configure Azure Blob storage

The content of this documentation is licensed under the Creative Commons Attribution-Share Alike 3.0 United States licence.
Code samples in this documentation are licensed under the Apache License, Version 2.0.