Many useful references and data sets are available on the web and in S3. To help you work with this data, Arvados lets you specify workflow inputs with URL paths like:
example_web_input: class: File path: "https://HOST_NAME/FILE_PATH" example_s3_input: class: File path: "s3://BUCKET_NAME/FILE_PATH"
When Arvados starts this workflow, before it starts any workflow steps, it will automatically download each input URL to an Arvados collection. This ensures you retain a complete record of the analysis you ran. Arvados stores details about the data source as collection metadata and can avoid re-downloading inputs it has downloaded before.
External inputs have some limitations you should be aware of before you start. These limitations may be lifted in a future release of Arvados.
External input URLs can only refer to a single file. You cannot specify an entire S3 bucket or subdirectory as an input. You must list each file you want to work with as a separate input.
Arvados only knows how to work with one S3 access key at a time. If you need to work with data sets that require different credentials, first transfer them to Keep, then analyze them from there.
If your inputs refer to public S3 buckets and don’t require an access key, run arvados-cwl-runner
with the --s3-public-bucket
option. For example:
$ arvados-cwl-runner --s3-public-bucket [… other options…] --submit WORKFLOW.cwl PUBLIC-S3-FILES.yml
If you want to access data in an S3 bucket that requires an access key, you can register the access key with Arvados. Workflows will be able to find and use the correct access key automatically.
In the left-hand navigation, open “External Credentials.” In the upper right, use the blue + New External Credential button to add an S3 access key.
Fill out the New External Credential dialog as follows for S3 credentials:
aws_access_key
.s3://BUCKET_NAME
for each bucket the access key can access. Each scope is listed under the input as you add it and can be removed if you enter a scope incorrectly.For illustration, the Example Credential being filled out below can be used to access the arvados-example
and arvados-doc
S3 buckets.
After you create the credential, you can control who is allowed to use it by sharing it with other users and groups with at least Read access. In the left-hand navigation, open “External Credentials.” Find your credential in the listing and right-click it. Select “Share” from the context menu. Use the dialog to add and remove permissions.
Once you have finished setting up access keys, you can run a workflow with S3 inputs.
S3 access keys are stored in Arvados as credentials. Below is a body you could use with either the command-line tool arv credential create --credential=…
or an SDK like:
arv_client.credentials().create(
body={'credential': ...}
).execute()
name
and optionally a description
.credential_class
must be exactly "aws_access_key"
.external_id
is the access key ID (the alphanumeric string that usually starts with “AKIA”).secret
is the secret access key (the random string).expires_at
timestamp to a date no later than the expiry of the underlying access key. Arvados will automatically stop using access keys after they have expired.scopes
identify which S3 bucket(s) this access key should be used for. Enter each scope in the format "s3://BUCKET_NAME"
for each bucket the access key can access. The example credential below can be used to access the arvados-example
and arvados-doc
S3 buckets.{
"name": "Example Credential",
"description": "<p>This is an example credential for the Arvados documentation.</p>",
"credential_class": "aws_access_key",
"external_id": "AKIAS3ABCDEFGHIJKLMN",
"secret": "ZYXWVUTSRQPONMLKJIHGFEDCBA",
"expires_at": "2038-01-19T03:14:07Z",
"scopes": [
"s3://arvados-example",
"s3://arvados-doc"
]
}
After you create the credential, you can control who is allowed to use it by creating permission links to other users and groups. For more information, refer to the Working with permissions section of the Python SDK code cookbook and links API reference.
After you register an access key for an S3 bucket, you can use an S3 URL for that bucket in the place of any workflow file input. When Arvados starts this workflow, before it starts any workflow steps, it will automatically find the right credentials to download the file from the bucket. For example, you can submit the workflow from the command line by running:
$ arvados-cwl-runner [… other options…] --defer-downloads --submit WORKFLOW.cwl PRIVATE-S3-FILES.yml
Note that you must use --defer-downloads
in this case.
If you submit a workflow to Arvados and it reports this error:
WARNING Download error: Multiple credentials found for bucket 's3://BUCKET_NAME' in Arvados, use --use-credential to specify which one to use.
You can run arvados-cwl-runner
with the --use-credential
option to specify the UUID of the credential to use:
$ arvados-cwl-runner --use-credential=zzzzz-oss07-abcde12345fghij [… other options…] --defer-downloads --submit WORKFLOW.cwl PRIVATE-S3-FILES.yml
If you submit a workflow to Arvados and it fails with logs like this:
WARNING Download error: boto3 did not find any local AWS credentials to use to download from S3. If you want to use credentials registered with Arvados, use --defer-downloads. If the bucket is public, use --s3-public-bucket. ERROR Workflow error, try again with --debug for more information: Can't handle 's3://example-bucket/example-file' Container exited with status code 1
This means Arvados did not find an access key to use for this bucket. Double-check:
s3://example-bucket
in its list of scopes? Check both the credential scope and the workflow input to make sure the bucket name matches and doesn’t have any typos.If you are running arvados-cwl-runner
on a system that already has credentials to access your S3 input files, you can run it with the --enable-aws-credential-capture
option to have Arvados download inputs with the same credentials that the aws
tool would use. This can be useful to run one-off workflows where you don’t plan to reuse an access key. For example:
$ arvados-cwl-runner --enable-aws-credential-capture [… other options…] --submit WORKFLOW.cwl PRIVATE-S3-FILES.yml
When Arvados downloads external input data, the default behavior is designed to prioritize the predictability and reproducibility of your workflows. Several options are available to customize this behavior.
By default, arvados-cwl-runner
downloads input data from the system where you launch it. This aims to let you know about any problems with the input sources as soon as possible. However, the system where you run arvados-cwl-runner
may not be the best suited to download very large input files. If you prefer to download input files from the Arvados compute node that runs your workflow, run arvados-cwl-runner
with the --defer-downloads
option.
$ arvados-cwl-runner --defer-downloads [… other options…] --submit WORKFLOW.cwl EXTERNAL-INPUTS.yml
By default, arvados-cwl-runner
checks the headers of every input URL to determine whether the external data has been updated and needs to be re-downloaded. These checks take a little time and will fail if the external data is no longer accessible. You can run arvados-cwl-runner
with the --prefer-cached-downloads
option to skip these checks and use any available collection caches. This will let you run the workflow even if the external data is no longer accessible, but that means the workflow may not be reproducible on Arvados clusters that don’t have the collection cache.
$ arvados-cwl-runner --prefer-cached-downloads [… other options…] --submit WORKFLOW.cwl EXTERNAL-INPUTS.yml
By default, arvados-cwl-runner
expects that every unique URL may refer to a unique resource and downloads each to a new collection cache. Some HTTP/S URLs include time-sensitive signatures or tokens in their query parameters that refer to the same underlying resource. You can identify those parameters to Arvados by running arvados-cwl-runner
with the --varying-url-params
option. This option takes a comma-separated list of parameter name(s). Arvados will ignore the values of these parameters in the URL when determining whether or not a resource has already been downloaded so you can avoid redundant downloads. For example:
$ arvados-cwl-runner --varying-url-params="NAME1,NAME2,…" [… other options…] --submit WORKFLOW.cwl WEB-FILES.yml
The content of this documentation is licensed under the
Creative
Commons Attribution-Share Alike 3.0 United States licence.
Code samples in this documentation are licensed under the
Apache License, Version 2.0.