Using external data sources in workflows

Introduction
Limitations
Accessing S3 data in public buckets
Accessing S3 data with an access key
Controlling when Arvados downloads external data

Introduction

Many useful references and data sets are available on the web and in S3. To help you work with this data, Arvados lets you specify workflow inputs with URL paths like:

example_web_input:
  class: File
  path: "https://HOST_NAME/FILE_PATH"
example_s3_input:
  class: File
  path: "s3://BUCKET_NAME/FILE_PATH"

When Arvados starts this workflow, before it starts any workflow steps, it will automatically download each input URL to an Arvados collection. This ensures you retain a complete record of the analysis you ran. Arvados stores details about the data source as collection metadata and can avoid re-downloading inputs it has downloaded before.

Limitations

External inputs have some limitations you should be aware of before you start. These limitations may be lifted in a future release of Arvados.

External input URLs can only refer to a single file. You cannot specify an entire S3 bucket or subdirectory as an input. You must list each file you want to work with as a separate input.

Arvados only knows how to work with one S3 access key at a time. If you need to work with data sets that require different credentials, first transfer them to Keep, then analyze them from there.

Accessing S3 data in public buckets

If your inputs refer to public S3 buckets and don’t require an access key, run arvados-cwl-runner with the --s3-public-bucket option. For example:

$ arvados-cwl-runner --s3-public-bucket [… other options…] --submit WORKFLOW.cwl PUBLIC-S3-FILES.yml

Accessing S3 data with an access key

If you want to access data in an S3 bucket that requires an access key, you can register the access key with Arvados. Workflows will be able to find and use the correct access key automatically.

Adding an S3 access key in Arvados Workbench

In the left-hand navigation, open “External Credentials.” In the upper right, use the blue ＋ New External Credential button to add an S3 access key.

Screenshot of the Arvados Workbench External Credentials listing.

Fill out the New External Credential dialog as follows for S3 credentials:

Give the credential a useful name, and optionally a description.
The Credential Class must be exactly arv:aws_access_key.
The External ID is the access key ID (the alphanumeric string that usually starts with “AKIA”).
The Secret is the secret access key (the random string).
Set the “Expires at” date to a date no later than the expiry of the underlying access key. Arvados will automatically stop using access keys after they have expired.
The applicable scopes identify which S3 bucket(s) this access key should be used for. Enter each scope in the format s3://BUCKET_NAME for each bucket the access key can access. Each scope is listed under the input as you add it and can be removed if you enter a scope incorrectly. You can enter the special value s3://* if you want this credential to be used for any S3 inputs.

For illustration, the Example Credential being filled out below can be used to access the arvados-example and arvados-doc S3 buckets.

Screenshot of the Arvados Workbench New External Credential dialog with fields filled in with sample values.

After you create the credential, you can control who is allowed to use it by sharing it with other users and groups with at least Read access. In the left-hand navigation, open “External Credentials.” Find your credential in the listing and right-click it. Select “Share” from the context menu. Use the dialog to add and remove permissions.

Once you have finished setting up access keys, you can run a workflow with S3 inputs.

Adding an S3 access key via the Arvados API

S3 access keys are stored in Arvados as credentials. Below is a body you could use with either the command-line tool arv credential create --credential=… or an SDK like:

arv_client.credentials().create(
    body={'credential': ...}
).execute()

Give the credential a useful name and optionally a description.
The credential_class must be exactly "arv:aws_access_key".
The external_id is the access key ID (the alphanumeric string that usually starts with “AKIA”).
The secret is the secret access key (the random string).
Set the expires_at timestamp to a date no later than the expiry of the underlying access key. Arvados will automatically stop using access keys after they have expired.
The scopes identify which S3 bucket(s) this access key should be used for. Enter each scope in the format "s3://BUCKET_NAME" for each bucket the access key can access. The example credential below can be used to access the arvados-example and arvados-doc S3 buckets. You can enter the special value s3://* if you want this credential to be used for any S3 inputs.

{
  "name": "Example Credential",
  "description": "<p>This is an example credential for the Arvados documentation.</p>",
  "credential_class": "arv:aws_access_key",
  "external_id": "AKIAS3ABCDEFGHIJKLMN",
  "secret": "ZYXWVUTSRQPONMLKJIHGFEDCBA",
  "expires_at": "2038-01-19T03:14:07Z",
  "scopes": [
    "s3://arvados-example",
    "s3://arvados-doc"
  ]
}

After you create the credential, you can control who is allowed to use it by creating permission links to other users and groups. For more information, refer to the Working with permissions section of the Python SDK code cookbook and links API reference.

Running a workflow using a stored access key

After you register an access key for an S3 bucket, you can use an S3 URL for that bucket in the place of any workflow file input. When Arvados starts this workflow, before it starts any workflow steps, it will automatically find the right credentials to download the file from the bucket. For example, you can submit the workflow from the command line by running:

$ arvados-cwl-runner [… other options…] --defer-downloads --submit WORKFLOW.cwl PRIVATE-S3-FILES.yml

Note that you must use --defer-downloads in this case.

Troubleshooting download errors using stored access keys

“Multiple credentials found”

If you submit a workflow to Arvados and it reports this error:

WARNING Download error: Multiple AWS access keys with scope 's3://BUCKET_NAME' found in Arvados. Run `arvados-cwl-runner` with the `--use-credential` option to provide the UUID of the credential to use.

You can run arvados-cwl-runner with the --use-credential option to specify the UUID of the credential to use:

$ arvados-cwl-runner --use-credential=zzzzz-oss07-abcde12345fghij [… other options…] --defer-downloads --submit WORKFLOW.cwl PRIVATE-S3-FILES.yml

“boto3 did not find any local AWS credentials to use to download from S3”

If you submit a workflow to Arvados and it fails with logs like this:

WARNING Download error: boto3 did not find any local AWS credentials to use to download from S3. If you want to use credentials registered with Arvados, use --defer-downloads. If the bucket is public, use --s3-public-bucket.
ERROR Workflow error, try again with --debug for more information:
Can't handle 's3://example-bucket/example-file'
Container exited with status code 1

This means Arvados did not find an access key to use for this bucket. Double-check:

Is the access key registered as a credential in Arvados?
Does the user running the workflow have permission to use that credential? Can they see it in Workbench or get it from the API?
Does the credential have s3://example-bucket or s3://* in its list of scopes? Check both the credential scope and the workflow input to make sure the bucket name matches and doesn’t have any typos.
Is the external credential expired?

Running a workflow using a local access key

If you are running arvados-cwl-runner on a system that already has credentials to access your S3 input files, you can run it with the --enable-aws-credential-capture option to have Arvados download inputs with the same credentials that the aws tool would use. This can be useful to run one-off workflows where you don’t plan to reuse an access key. For example:

$ arvados-cwl-runner --enable-aws-credential-capture [… other options…] --submit WORKFLOW.cwl PRIVATE-S3-FILES.yml

Controlling when Arvados downloads external data

When Arvados downloads external input data, the default behavior is designed to prioritize the predictability and reproducibility of your workflows. Several options are available to customize this behavior.

Download data from the Arvados compute node

By default, arvados-cwl-runner downloads input data from the system where you launch it. This aims to let you know about any problems with the input sources as soon as possible. However, the system where you run arvados-cwl-runner may not be the best suited to download very large input files. If you prefer to download input files from the Arvados compute node that runs your workflow, run arvados-cwl-runner with the --defer-downloads option.

$ arvados-cwl-runner --defer-downloads [… other options…] --submit WORKFLOW.cwl EXTERNAL-INPUTS.yml

Prioritize cached data collections

By default, arvados-cwl-runner checks the headers of every input URL to determine whether the external data has been updated and needs to be re-downloaded. These checks take a little time and will fail if the external data is no longer accessible. You can run arvados-cwl-runner with the --prefer-cached-downloads option to skip these checks and use any available collection caches. This will let you run the workflow even if the external data is no longer accessible, but that means the workflow may not be reproducible on Arvados clusters that don’t have the collection cache.

$ arvados-cwl-runner --prefer-cached-downloads [… other options…] --submit WORKFLOW.cwl EXTERNAL-INPUTS.yml

Ignore varying URL parameters when caching collections

By default, arvados-cwl-runner expects that every unique URL may refer to a unique resource and downloads each to a new collection cache. Some HTTP/S URLs include time-sensitive signatures or tokens in their query parameters that refer to the same underlying resource. You can identify those parameters to Arvados by running arvados-cwl-runner with the --varying-url-params option. This option takes a comma-separated list of parameter name(s). Arvados will ignore the values of these parameters in the URL when determining whether or not a resource has already been downloaded so you can avoid redundant downloads. For example:

$ arvados-cwl-runner --varying-url-params="NAME1,NAME2,…" [… other options…] --submit WORKFLOW.cwl WEB-FILES.yml

Previous: Debugging workflows - shell access Next: Running web services in Arvados containers