This multi host installer is the recommendend way to set up a production Arvados cluster. These instructions include specific details for installing on Amazon Web Services (AWS), which are marked as “AWS specific”. However with additional customization the installer can be used as a template for deployment on other cloud provider or HPC systems.
Choose a 5-character cluster identifier that will represent the cluster. Here are guidelines on choosing a cluster identifier . Only lowercase letters and digits 0-9 are allowed. Examples will use xarv1
or ${CLUSTER}
, you should substitute the cluster id you have selected.
Determine the base domain for the cluster. This will be referred to as ${DOMAIN}
.
For example, if DOMAIN is xarv1.example.com
, then controller.${DOMAIN}
means controller.xarv1.example.com
.
You will need a DNS entry for each service. When using the Terraform script to set up your infrastructure, these domains will be created automatically using AWS Route 53.
In the default configuration these are:
controller.${DOMAIN}
ws.${DOMAIN}
keep0.${DOMAIN}
keep1.${DOMAIN}
keep.${DOMAIN}
download.${DOMAIN}
*.collections.${DOMAIN}
— important note, this must be a wildcard DNS, resolving to the keepweb
serviceworkbench.${DOMAIN}
workbench2.${DOMAIN}
webshell.${DOMAIN}
shell.${DOMAIN}
prometheus.${DOMAIN}
grafana.${DOMAIN}
For more information, see DNS entries and TLS certificates.
This is a package-based installation method, however the installation script is currently distributed in source form via git
. We recommend checking out the git tree on your local workstation, not directly on the target(s) where you want to install and run Arvados.
git clone https://github.com/arvados/arvados.git
cd arvados
git checkout main
cd tools/salt-install
The install.sh
and provision.sh
scripts will help you deploy Arvados by preparing your environment to be able to run the installer, then running it. The actual installer is located in the arvados-formula git repository and will be cloned during the running of the provision.sh
script. The installer is built using Saltstack and provision.sh
performs the install using masterless mode.
Replace “xarv1” with the cluster id you selected earlier.
This creates a git repository in ~/setup-arvados-xarv1
. The installer.sh
will record all the configuration changes you make, as well as using git push
to synchronize configuration edits if you have multiple nodes.
Important! Once you have initialized the installer directory, all further commands must be run with ~/setup-arvados-${CLUSTER}
as the current working directory.
If you are going to use Terraform to set up the infrastructure on AWS, you first need to install the Terraform CLI and the AWS CLI tool. Then you can initialize the installer.
CLUSTER=xarv1
./installer.sh initialize ~/setup-arvados-${CLUSTER} multiple_hosts multi_host/aws terraform/aws
cd ~/setup-arvados-${CLUSTER}
CLUSTER=xarv1
./installer.sh initialize ~/setup-arvados-${CLUSTER} multiple_hosts multi_host/aws
cd ~/setup-arvados-${CLUSTER}
We provide a set of Terraform code files that you can run to create the necessary infrastructure on Amazon Web Services.
These files are located in the terraform
installer directory and are divided in three sections:
terraform/vpc/
subdirectory controls the network related infrastructure of your cluster, including firewall rules and split-horizon DNS resolution.terraform/data-storage/
subdirectory controls the stateful part of your cluster, currently only sets up the S3 bucket for holding the Keep blocks and in the future it’ll also manage the database service.terraform/services/
subdirectory controls the hosts that will run the different services on your cluster, makes sure that they have the required software for the installer to do its job.The Terraform state files (that keep crucial infrastructure information from the cloud) will be saved inside each subdirectory, under the terraform.tfstate
name. These will be committed to the git repository used to coordinate deployment. It is very important to keep this git repository secure, only sysadmins that will be responsible for maintaining your Arvados cluster should have access to it.
Each section described above contain a terraform.tfvars
file with some configuration values that you should set before applying each configuration. You should at least set the AWS region, cluster prefix and domain name in terraform/vpc/terraform.tfvars
:
# Copyright (C) The Arvados Authors. All rights reserved.
#
# SPDX-License-Identifier: CC-BY-SA-3.0
# Main cluster configurations. No sensible defaults provided for these:
# region_name = "us-east-1"
# cluster_name = "xarv1"
# domain_name = "xarv1.example.com"
# Uncomment this to create an non-publicly accessible Arvados cluster
# private_only = true
# Optional networking options. Set existing resources to be used instead of
# creating new ones.
# NOTE: We only support fully managed or fully custom networking, not a mix of both.
#
# vpc_id = "vpc-aaaa"
# sg_id = "sg-bbbb"
# public_subnet_id = "subnet-cccc"
# private_subnet_id = "subnet-dddd"
#
# RDS related parameters:
# use_rds = true
# additional_rds_subnet_id = "subnet-eeee"
# Optional custom tags to add to every resource. Default: {}
# custom_tags = {
# environment = "production"
# project = "Phoenix"
# owner = "jdoe"
# }
# Optional cluster service nodes configuration:
#
# List of node names which either will be hosting user-facing or internal
# services. Defaults:
# user_facing_hosts = [ "controller", "workbench" ]
# internal_service_hosts = [ "keep0", "shell" ]
#
# Map assigning each node name an internal IP address. Defaults:
# private_ip = {
# controller = "10.1.1.11"
# workbench = "10.1.1.15"
# shell = "10.1.2.17"
# keep0 = "10.1.2.13"
# }
#
# Map assigning DNS aliases for service node names. Defaults:
# dns_aliases = {
# workbench = [
# "ws",
# "workbench2",
# "webshell",
# "keep",
# "download",
# "prometheus",
# "grafana",
# "loki",
# "*.collections"
# ]
# }
If you don’t set the main configuration variables at vpc/terraform.tfvars
file, you will be asked to re-enter these parameters every time you run Terraform.
The data-storage/terraform.tfvars
and services/terraform.tfvars
let you configure additional details, including the SSH public key for deployment, instance & volume sizes, etc. All these configurations are provided with sensible defaults:
# Copyright (C) The Arvados Authors. All rights reserved.
#
# SPDX-License-Identifier: CC-BY-SA-3.0
# Set to true if the database server won't be running in any service instance.
# Default: false
# use_external_db = true
# Copyright (C) The Arvados Authors. All rights reserved.
#
# SPDX-License-Identifier: CC-BY-SA-3.0
# SSH public key path to use by the installer script. It will be installed in
# the home directory of the 'deploy_user'. Default: ~/.ssh/id_rsa.pub
# pubkey_path = "/path/to/pub.key"
# Set the instance type for your nodes. Default: m5a.large
# instance_type = {
# default = "m5a.xlarge"
# controller = "c5a.4xlarge"
# }
# Set the volume size (in GiB) per service node.
# Default: 100 for controller, 20 the rest.
# NOTE: The service node will need to be rebooted after increasing its volume's
# size.
# instance_volume_size = {
# default = 20
# controller = 300
# }
# Use an RDS instance for database. For this to work, make sure to also set
# 'use_rds' to true in '../vpc/terraform.tfvars'.
# use_rds = true
#
# Provide custom values if needed.
# rds_username = ""
# rds_password = ""
# rds_instance_type = "db.m5.xlarge"
# rds_postgresql_version = "16.3"
# rds_allocated_storage = 200
# rds_max_allocated_storage = 1000
# rds_backup_retention_period = 30
# rds_backup_before_deletion = false
# rds_final_backup_name = ""
# AWS secret's name which holds the SSL certificate private key's password.
# Default: "arvados-ssl-privkey-password"
# ssl_password_secret_name_suffix = "some-name-suffix"
# User for software deployment. Depends on the AMI's distro.
# Default: "admin"
# deploy_user = "ubuntu"
# Instance AMI to use for service nodes. Default: latest from Debian 11
# instance_ami = "ami-0481e8ba7f486bd99"
# Customer-managed Key to use for volume encryption.
# cmk_arn = "arn:aws:kms:...."
You will need an AWS access key and secret key to create the infrastructure.
export AWS_ACCESS_KEY_ID="anaccesskey"
export AWS_SECRET_ACCESS_KEY="asecretkey"
Build the infrastructure by running ./installer.sh terraform
. The last stage will output the information needed to set up the cluster’s domain and continue with the installer. for example:
./installer.sh terraform
...
Apply complete! Resources: 16 added, 0 changed, 0 destroyed.
Outputs:
arvados_sg_id = "sg-02f999a99973999d7"
arvados_subnet_id = "subnet-01234567abc"
cluster_int_cidr = "10.1.0.0/16"
cluster_name = "xarv1"
compute_subnet_id = "subnet-abcdef12345"
deploy_user = "admin"
domain_name = "xarv1.example.com"
letsencrypt_iam_access_key_id = "AKAA43MAAAWAKAADAASD"
loki_iam_access_key_id = "AKAABCDEFGJKLMNOP1234"
private_ip = {
"controller" = "10.1.1.1"
"keep0" = "10.1.1.3"
"keep1" = "10.1.1.4"
"keepproxy" = "10.1.1.2"
"shell" = "10.1.1.7"
"workbench" = "10.1.1.5"
}
public_ip = {
"controller" = "18.235.116.23"
"keep0" = "34.202.85.86"
"keep1" = "38.22.123.98"
"keepproxy" = "34.231.9.201"
"shell" = "44.208.155.240"
"workbench" = "52.204.134.136"
}
region_name = "us-east-1"
route53_dns_ns = tolist([
"ns-1119.awsdns-11.org",
"ns-1812.awsdns-34.co.uk",
"ns-437.awsdns-54.com",
"ns-809.awsdns-37.net",
])
ssl_password_secret_name = "xarv1-arvados-ssl-privkey-password"
vpc_id = "vpc-0999994998399923a"
letsencrypt_iam_secret_access_key = "XXXXXSECRETACCESSKEYXXXX"
database_password = <not set>
loki_iam_secret_access_key = "YYYYYYSECRETACCESSKEYYYYYYY"
Once Terraform has completed, the infrastructure for your Arvados cluster is up and running. One last piece of DNS configuration is required.
The domain names for your cluster (e.g.: controller.xarv1.example.com) are managed via Route 53 and the TLS certificates will be issued using Let’s Encrypt .
You need to configure the parent domain to delegate to the newly created zone. For example, you need to configure “example.com” to delegate the subdomain “xarv1.example.com” to the nameservers for the Arvados hostname records created by Terraform. You do this by creating a NS
record on the parent domain that refers to the name servers listed in the Terraform output parameter route53_dns_ns
.
If your parent domain is also controlled by Route 53, the process will be like this:
NS - Name servers for a hosted zone
route53_dns_ns
, one hostname per line, with punctuation (quotes and commas) removed.If the parent domain is controlled by some other service, follow the guide for the the appropriate service.
The certificates will be requested from Let’s Encrypt when you run the installer.
cluster_int_cidr
will be used to set CLUSTER_INT_CIDR
compute_subnet_id
and arvados_sg_id
to set COMPUTE_SUBNET
and COMPUTE_SG
in local.params
and when you create a compute image.You can now proceed to edit local.params* files.
If you will be setting up infrastructure without using the provided Terraform script, here are the recommendations you will need to consider.
We recommend setting Arvados up in its own Virtual Private Cloud
When you do so, you need to configure a couple of additional things:
We recommend creating an S3 bucket for data storage named ${CLUSTER}-nyw5e-000000000000000-volume
. We recommend creating an IAM role called ${CLUSTER}-keepstore-00-iam-role
with a policy that can read, write, list and delete objects in the bucket . With the example cluster id xarv1
the bucket would be called xarv1-nyw5e-000000000000000-volume
and the role would be called xarv1-keepstore-00-iam-role
.
These names are recommended because they are default names used in the configuration template. If you use different names, you will need to edit the configuration template later.
You will need to allocate several hosts (physical or virtual machines) for the fixed infrastructure of the Arvados cluster. These machines should have at least 2 cores and 8 GiB of RAM, running a supported Linux distribution.
Supported Linux Distributions |
---|
AlmaLinux 8 (since 8.4) |
Red Hat Enterprise Linux 8 (since 8.4) |
Rocky Linux 8 (since 8.4) |
Debian 12 (“bookworm”) |
Debian 11 (“bullseye”) |
Ubuntu 24.04 (“noble”) |
Ubuntu 22.04 (“jammy”) |
Ubuntu 20.04 (“focal”) |
Allocate the following hosts as appropriate for your site. On AWS you may choose to do it manually with the AWS console, or using a DevOps tool such as CloudFormation or Terraform. With the exception of “keep0” and “keep1”, all of these hosts should have external (public) IP addresses if you intend for them to be accessible outside of the private network or VPC.
The installer will set up the Arvados services on your machines. Here is the default assignment of services to machines:
controller.${DOMAIN}
)keep0.${DOMAIN}
and keep1.${DOMAIN}
)workbench.${DOMAIN}
)workbench2.${DOMAIN}
)webshell.${DOMAIN}
)ws.${DOMAIN}
)keep.${DOMAIN}
)download.${DOMAIN}
and *.collections.${DOMAIN}
)shell.${DOMAIN}
)When using the database installed by Arvados (and not an external database), the database is stored under /var/lib/postgresql
. Arvados logs are also kept in /var/log
and /var/www/arvados-api/shared/log
. Accordingly, you should ensure that the disk partition containing /var
has adequate storage for your planned usage. We suggest starting with 50GiB of free space on the database host.
ssh
to each machine~/.ssh/authorized_keys
on each node.sudo
access on the account on each machine you will ssh
in tosudo
group and having a rule like this in /etc/sudoers.d/arvados_passwordless
that allows members of group sudo
to execute any command without entering a password.%sudo ALL=(ALL:ALL) NOPASSWD:ALL
git
installed on each machine(AWS specific) The machine that runs the arvados cloud dispatcher will need an IAM role that allows it to manage EC2 instances.
If your infrastructure differs from the setup proposed above (ie, different hostnames), you can still use the installer, but additional customization may be necessary .
local.params*
filesThe cluster configuration parameters are included in two files: local.params
and local.params.secrets
. These files can be found wherever you choose to initialize the installation files (e.g., ~/setup-arvados-xarv1
in these examples).
The local.params.secrets
file is intended to store security-sensitive data such as passwords, private keys, tokens, etc. Depending on the security requirements of the cluster deployment, you may wish to store this file in a secrets store like AWS Secrets Manager or Jenkins credentials.
local.params
:CLUSTER
to the 5-character cluster identifier. (e.g. “xarv1”)DOMAIN
to the base DNS domain of the environment. (e.g. “xarv1.example.com”)*_INT_IP
variables with the internal (private) IP addresses of each host. Since services share hosts, some hosts are the same. See note about /etc/hostsCLUSTER_INT_CIDR
, this should be the CIDR of the private network that Arvados is running on, e.g. the VPC. If you used terraform, this is emitted as cluster_int_cidr
.INITIAL_USER_EMAIL
to your email address, as you will be the first admin user of the system.local.params.secrets
:KEY
/ TOKEN
/ PASSWORD
to a random string. You can use installer.sh generate-tokens
./installer.sh generate-tokens
BLOB_SIGNING_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
MANAGEMENT_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
SYSTEM_ROOT_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ANONYMOUS_USER_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
DATABASE_PASSWORD=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
DATABASE_PASSWORD
to a random string (unless you already have a database then you should set it to that database’s password)Lq&MZ<V']d?j
DATABASE_PASSWORD="Lq\&MZ\<V\'\]d\?j"
LE_AWS_*
credentials to allow Let’s Encrypt do authentication through Route53LOKI_AWS_*
credentials to enable the Loki service to store centralized logs on its dedicated S3 bucket.DISPATCHER_SSH_PRIVKEY
to a SSH private key that arvados-dispatch-cloud
will use to connect to the compute nodes:DISPATCHER_SSH_PRIVKEY="-----BEGIN OPENSSH PRIVATE KEY-----
b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQAAAAAAAAABAAABlwAAAAdzc2gtcn
...
s4VY40kNxs6MsAAAAPbHVjYXNAaW5zdGFsbGVyAQIDBA==
-----END OPENSSH PRIVATE KEY-----"
You can create one by following the steps described on the building a compute node documentation page./etc/hosts
Because Arvados services are typically accessed by external clients, they are likely to have both a public IP address and a internal IP address.
On cloud providers such as AWS, sending internal traffic to a service’s public IP address can incur egress costs and throttling. Thus it is very important for internal traffic to stay on the internal network. The installer implements this by updating /etc/hosts
on each node to associate each service’s hostname with the internal IP address, so that when Arvados services communicate with one another, they always use the internal network address. This is NOT a substitute for DNS, you still need to set up DNS names for all of the services that have public IP addresses (it does, however, avoid a complex “split-horizon” DNS configuration).
It is important to be aware of this because if you mistype the IP address for any of the *_INT_IP
variables, hosts may unexpectedly fail to be able to communicate with one another. If this happens, check and edit as necessary the file /etc/hosts
on the host that is failing to make an outgoing connection.
The multi_host/aws
template uses S3 for storage. Arvados also supports filesystem storage and Azure blob storage . Keep storage configuration can be found in in the arvados.cluster.Volumes
section of local_config_dir/pillars/arvados.sls
.
If you followed the recommendend naming scheme for both the bucket and role (or used the provided Terraform script), you’re done.
If you did not follow the recommendend naming scheme for either the bucket or role, you’ll need to update these parameters in local.params
:
KEEP_AWS_S3_BUCKET
to the value of keepstore bucket you created earlierKEEP_AWS_IAM_ROLE
to keepstore role you created earlierYou can also configure a specific AWS Region for the S3 bucket by setting KEEP_AWS_REGION
.
Arvados requires a valid TLS certificate to work correctly. This installer supports these options:
lets-encrypt
: automatically obtain and install an SSL certificates for your hostnamesbring-your-own
: supply your own certificates in the certs
directoryIn the default configuration, this installer gets a valid certificate via Let’s Encrypt. If you have the CLUSTER.DOMAIN domain in a route53 zone, you can set USE_LETSENCRYPT_ROUTE53 to YES and supply appropriate credentials so that Let’s Encrypt can use dns-01 validation to get the appropriate certificates.
SSL_MODE="lets-encrypt"
USE_LETSENCRYPT_ROUTE53="yes"
LE_AWS_REGION="us-east-1"
LE_AWS_ACCESS_KEY_ID="AKIABCDEFGHIJKLMNOPQ"
LE_AWS_SECRET_ACCESS_KEY="thisistherandomstringthatisyoursecretkey"
Please note that when using AWS, EC2 instances can have a default hostname that ends with amazonaws.com. Let’s Encrypt has a blacklist of domain names for which it will not issue certificates, and that blacklist includes the amazonaws.com domain, which means the default hostname can not be used to get a certificate from Let’s Encrypt.
To supply your own certificates, change the configuration like this:
SSL_MODE="bring-your-own"
You will need certificates for each DNS name and DNS wildcard previously listed in the DNS hostnames for each service .
To simplify certificate management, we recommend creating a single certificate for all of the hostnames, or creating a wildcard certificate that covers all possible hostnames (with the following patterns in subjectAltName):
xarv1.example.com *.xarv1.example.com *.collections.xarv1.example.com
(Replacing xarv1.example.com
with your own ${DOMAIN}
)
Copy your certificates to the directory specified with the variable CUSTOM_CERTS_DIR
in the remote directory where you copied the provision.sh
script. The provision script will find the certificates there.
The script expects cert/key files with these basenames (matching the role except for keepweb, which is split in both download / collections):
balancer
— Optional on multi-node installationscollections
— Part of keepweb, must be a wildcard for *.collections.${DOMAIN}
controller
download
— Part of keepwebgrafana
— Service available by default on multi-node installationskeepproxy
— Corresponds to default domain keep.${DOMAIN}
prometheus
— Service available by default on multi-node installationswebshell
websocket
— Corresponds to default domain ws.${DOMAIN}
workbench
workbench2
For example, for the keepproxy
service the script will expect to find this certificate:
${CUSTOM_CERTS_DIR}/keepproxy.crt
${CUSTOM_CERTS_DIR}/keepproxy.key
Make sure that all the FQDNs that you will use for the public-facing applications (API/controller, Workbench, Keepproxy/Keepweb) are reachable.
Note: because the installer currently looks for a different certificate file for each service, if you use a single certificate, we recommend creating a symlink for each certificate and key file to the primary certificate and key, e.g.
ln -s xarv1.crt ${CUSTOM_CERTS_DIR}/controller.crt
ln -s xarv1.key ${CUSTOM_CERTS_DIR}/controller.key
ln -s xarv1.crt ${CUSTOM_CERTS_DIR}/keepproxy.crt
ln -s xarv1.key ${CUSTOM_CERTS_DIR}/keepproxy.key
...
All certificate files will be used by nginx. You may need to include intermediate certificates in your certificate files. See the nginx documentation for more details.
When using SSL_MODE=bring-your-own
, you can keep your TLS certificate keys encrypted on the server nodes. This reduces the risk of certificate leaks from node disk volumes snapshots or backups.
This feature is currently implemented in AWS by providing the certificate keys’ password via Amazon’s Secrets Manager service, and installing appropriate services on the nodes that provide this password to nginx via a file that only lives in system RAM.
If you use the installer’s Terraform code, the secret and related permission cloud resources are created automatically, and you can customize the secret’s name by editing terraform/services/terraform.tfvars
and setting its suffix in ssl_password_secret_name_suffix
.
In local.params
you need to set SSL_KEY_ENCRYPTED
to yes
and change the default values for SSL_KEY_AWS_SECRET_NAME
and SSL_KEY_AWS_REGION
if necessary.
Then, if your certificate key file is not yet encrypted, you can generated an encrypted version of it by running the openssl
command as follows:
openssl rsa -aes256 -in your.key -out your.encrypted.key
(this will ask you to type the encryption password)
This encrypted key file will be the one needed to be copied to the ${CUSTOM_CERTS_DIR}
directory, instead of the plain key file.
In order to allow the appropriate nodes decrypt the key file, you should set the password on Amazon Secrets Manager. There’re a couple way this can be done:
aws secretsmanager put-secret-value --secret-id pkey-pwd --secret-string "p455w0rd" --region us-east-1
Where pkey-pwd
should match with what’s set in SSL_KEY_AWS_SECRET_NAME
and us-east-1
with what’s set in SSL_KEY_AWS_REGION
.Take into account that the AWS secret should be set before running installer.sh deploy
to avoid any failures when trying to start the nginx
servers.
If you ever need to change the encryption password on a running cluster, you should first change the secret’s value on AWS, and only then copy the newly encrypted key file to ${CUSTOM_CERTS_DIR}
and re-run the deploy command.
By default, the installer will use the “Test” provider, which is a list of usernames and cleartext passwords stored in the Arvados config file. This is low security configuration and you are strongly advised to configure one of the other supported authentication methods .
The standard behavior of the installer is to install and configure PostgreSQL for use by Arvados. You can optionally configure it to use a separately managed database instead.
Arvados requires a database that is compatible with PostgreSQL 9.5 or later. For example, Arvados is known to work with Amazon Aurora (note: even idle, Arvados services will periodically poll the database, so we strongly advise using “provisioned” mode).
local.params
, remove ‘database’ from the list of roles assigned to the controller node:NODES=(
[controller.${DOMAIN}]=controller,websocket,dispatcher,keepbalance
...
)
local.params
, set DATABASE_INT_IP
to empty string and DATABASE_EXTERNAL_SERVICE_HOST_OR_IP
to the database endpoint (can be a hostname, does not have to be an IP address).DATABASE_INT_IP=""
...
DATABASE_EXTERNAL_SERVICE_HOST_OR_IP="arvados.xxxxxxx.eu-east-1.rds.amazonaws.com"
local.params.secrets
, set DATABASE_PASSWORD
to the correct value. See the previous section describing correct quotinglocal.params
you may need to adjust the database name and user.If you are installing on AWS and have followed all of the naming conventions recommend in this guide, you probably don’t need to do any further customization.
If you are installing on a different cloud provider or on HPC, other changes may require editing the Saltstack pillars and states files found in local_config_dir
. In particular, local_config_dir/pillars/arvados.sls
contains the template (in the arvados.cluster
section) used to produce the Arvados configuration file that is distributed to all the nodes. Consult the Configuration reference for a comprehensive list of configuration keys.
Any extra Salt “state” files you add under local_config_dir/states
will be added to the Salt run and applied to the hosts.
If you will use fixed compute nodes with an HPC scheduler such as SLURM or LSF, you will need to Set up your compute nodes with Docker or Set up your compute nodes with Singularity.
On cloud installations, containers are dispatched in Docker daemons running in the compute instances, which need some additional setup.
Follow the instructions to build a cloud compute node image using the compute image builder script found in arvados/tools/compute-images
in your Arvados clone from step 3.
Once the image has been created, open local.params
and edit as follows (AWS specific settings described here, you will need to make custom changes for other cloud providers):
COMPUTE_AMI
to the AMI produced by PackerCOMPUTE_AWS_REGION
to the appropriate AWS regionCOMPUTE_USER
to the admin user account on the imageCOMPUTE_SG
list to the VPC security group which you set up to allow SSH connections to these nodesCOMPUTE_SUBNET
to the value of SubnetId of your VPCarvados.cluster.InstanceTypes
in local_config_dir/pillars/arvados.sls
as necessary. The example instance types are for AWS, other cloud providers will of course have different instance types with different names and specifications.At this point, you are ready to run the installer script in deploy mode that will conduct all of the Arvados installation.
Run this in the ~/arvados-setup-xarv1
directory:
./installer.sh deploy
This will install and configure Arvados on all the nodes. It will take a while and produce a lot of logging. If it runs into an error, it will stop.
When everything has finished, you can run the diagnostics. There’s a couple ways of doing this listed below.
The requirements to run diagnostics are having arvados-client
and docker
installed. If this is not possible you can run them on your Arvados shell node as explained in the next section.
Depending on where you are running the installer, you need to provide -internal-client
or -external-client
. If you are running the installer from a host connected to the Arvados private network, use -internal-client
. Otherwise, use -external-client
.
./installer.sh diagnostics (-internal-client|-external-client)
You can run the diagnostics from the cluster’s shell node. This has the advantage that you don’t need to manage any software on your local system, but might not be a possibility if your Arvados cluster doesn’t include a shell node.
./installer.sh diagnostics-internal
The installer records log files for each deployment.
Most service logs go to /var/log/syslog
.
The logs for Rails API server can be found in /var/www/arvados-api/current/log/production.log
on the appropriate instance(s).
Workbench 2 is a client-side Javascript application. If you are having trouble loading Workbench 2, check the browser’s developer console (this can be found in “Tools → Developer Tools”).
You can iterate on the config and maintain the cluster by making changes to local.params
and local_config_dir
and running installer.sh deploy
again.
If you are debugging a configuration issue on a specific node, you can speed up the cycle a bit by deploying just one node:
./installer.sh deploy keep0.xarv1.example.com
However, once you have a final configuration, you should run a full deploy to ensure that the configuration has been synchronized on all the nodes.
The arvados-api-server package sets up the database as a post-install script. If the database host or password wasn’t set correctly (or quoted correctly) at the time that package is installed, it won’t be able to set up the database.
This will manifest as an error like this:
#<ActiveRecord::StatementInvalid: PG::UndefinedTable: ERROR: relation \"api_clients\" does not exist
If this happens, you need to
1. correct the database information
2. run ./installer.sh deploy xarv1.example.com
to update the configuration on the API/controller node
3. Log in to the API/controller server node, then run this command to re-run the post-install script, which will set up the database:
dpkg-reconfigure arvados-api-server
./installer.sh deploy
again to synchronize everything, and so that the install steps that need to contact the API server are run successfully.
If the AMI wasn’t built with ENA (extended networking) support and the instance type requires it, it’ll fail to start. You’ll see an error in syslog on the node that runs arvados-dispatch-cloud
. The solution is to build a new AMI with —aws-ena-support true
At this point you should be able to log into the Arvados cluster. The initial URL will be
https://workbench.${DOMAIN}
If you did not configure a different authentication provider you will be using the “Test” provider, and the provision script creates an initial user for testing purposes. This user is configured as administrator of the newly created cluster. It uses the values of INITIAL_USER
and INITIAL_USER_PASSWORD
from the local.params*
file.
If you did configure a different authentication provider, the first user to log in will automatically be given Arvados admin privileges.
You can monitor the health and performance of the system using the admin dashboard:
https://grafana.${DOMAIN}
To log in, use username “admin” and ${INITIAL_USER_PASSWORD}
from local.params.secrets
.
Once logged in, you will want to add the dashboards to the front page.
In order to handle high loads and perform rolling upgrades, the controller service can be scaled to a number of hosts and the installer make this implementation a fairly simple task.
First, you should take care of the infrastructure deployment: if you use our Terraform code, you will need to set up the terraform.tfvars
in terraform/vpc/
so that in addition to the node named controller
(the load-balancer), a number of controllerN
nodes (backends) are defined as needed, and added to the internal_service_hosts
list.
We suggest that the backend nodes just hold the controller service and nothing else, so they can be easily created or destroyed as needed without other service disruption.
The following is an example terraform/vpc/terraform.tfvars
file that describes a cluster with a load-balancer, 2 backend nodes, a separate database node, a shell node, a keepstore node and a workbench node that will also hold other miscelaneous services:
region_name = "us-east-1"
cluster_name = "xarv1"
domain_name = "xarv1.example.com"
# Include controller nodes in this list so instances are assigned to the
# private subnet. Only the balancer node should be connecting to them.
internal_service_hosts = [ "keep0", "shell", "database", "controller1", "controller2" ]
# Assign private IPs for the controller nodes. These will be used to create
# internal DNS resolutions that will get used by the balancer and database nodes.
private_ip = {
controller = "10.1.1.11"
workbench = "10.1.1.15"
database = "10.1.2.12"
controller1 = "10.1.2.21"
controller2 = "10.1.2.22"
shell = "10.1.2.17"
keep0 = "10.1.2.13"
}
Once the infrastructure is deployed, you’ll then need to define which node will be using the balancer
role and which will be the controller
nodes in local.params
, as it’s being shown in this partial example:
NODES=(
[controller.${DOMAIN}]=balancer
[controller1.${DOMAIN}]=controller
[controller2.${DOMAIN}]=controller
[database.${DOMAIN}]=database
...
)
Note that we also set the database
role to its own node instead of just leaving it in a shared controller node.
Each time you run installer.sh deploy
, the system will automatically do rolling upgrades. This means it will make changes to one controller node at a time, after removing it from the balancer so that there’s no downtime.
As part of the operation of installer.sh
, it automatically creates a git
repository with your configuration templates. You should retain this repository but be aware that it contains sensitive information (passwords and tokens used by the Arvados services as well as cloud credentials if you used Terraform to create the infrastructure).
As described in Iterating on config changes you may use installer.sh deploy
to re-run the Salt to deploy configuration changes and upgrades. However, be aware that the configuration templates created for you by installer.sh
are a snapshot which are not automatically kept up to date.
When deploying upgrades, consult the Arvados upgrade notes to see if changes need to be made to the configuration file template in local_config_dir/pillars/arvados.sls
. To specify the version to upgrade to, set the VERSION
parameter in local.params
.
See also Maintenance and upgrading for more information.
The content of this documentation is licensed under the
Creative
Commons Attribution-Share Alike 3.0 United States licence.
Code samples in this documentation are licensed under the
Apache License, Version 2.0.