Federated Multi-Cluster Workflows

To support running analysis on geographically dispersed data (avoiding expensive data transfers by sending the computation to the data), and “hybrid cloud” configurations where an on-premise cluster can expand its capabilities by delegating work to a cloud-hosted cluster, Arvados supports federated workflows. In a federated workflow, different steps of a workflow may execute on different clusters. Arvados manages data transfer and delegation of credentials, so that all that is required is adding arv:ClusterTarget hints to your existing workflow.

For more information, visit the architecture and admin sections about Arvados federation.

Get the example files

The tutorial files are located in the documentation section of the Arvados source repository: or see below

~$ git clone https://github.com/arvados/arvados
~$ cd arvados/doc/user/cwl/federated

Run example

Note:

At this time, remote steps of a workflow on Workbench are not displayed. As a workaround, you can find the UUIDs of the remote steps in the live logs of the workflow runner (the “Logs” tab). You may visit the remote cluster’s workbench and enter the UUID into the search box to view the details of the remote step. This will be fixed in a future version of workbench.

Run it like any other workflow:

~$ arvados-cwl-runner feddemo.cwl shards.cwl

You can also run a workflow on a remote federated cluster .

Federated scatter/gather example

In this following example, an analysis task is executed on three different clusters with different data, then the results are combined to produce the final output.

# Demonstrate Arvados federation features.  This example searches a
# list of CSV files that are hosted on different Arvados clusters.
# For each file, send a task to the remote cluster which will scan
# file and extracts the rows where the column "select_column" has one
# of the values appearing in the "select_values" file.  The home
# cluster then runs a task which pulls the results from the remote
# clusters and merges the results to produce a final report.

cwlVersion: v1.0
class: Workflow
$namespaces:
  # When using Arvados extensions to CWL, must declare the 'arv' namespace
  arv: "http://arvados.org/cwl#"

requirements:
  InlineJavascriptRequirement: {}
  ScatterFeatureRequirement: {}
  StepInputExpressionRequirement: {}

  DockerRequirement:
    # Replace this with your own Docker container
    dockerPull: arvados/jobs

  # Define a record type so we can conveniently associate the input
  # file and the cluster where the task should run.
  SchemaDefRequirement:
    types:
      - $import: FileOnCluster.yml

inputs:
  select_column: string
  select_values: File

  datasets:
    type:
      type: array
      items: FileOnCluster.yml#FileOnCluster

  intermediate_projects: string[]

outputs:
  # Will produce an output file with the results of the distributed
  # analysis jobs merged together.
  joined:
    type: File
    outputSource: gather-results/out

steps:
  distributed-analysis:
    in:
      select_column: select_column
      select_values: select_values
      dataset: datasets
      intermediate_projects: intermediate_projects

    # Scatter over shards, this means creating a parallel job for each
    # element in the "shards" array.  Expressions are evaluated for
    # each element.
    scatter: [dataset, intermediate_projects]
    scatterMethod: dotproduct

    # Specify the cluster target for this task.  This means each
    # separate scatter task will execute on the cluster that was
    # specified in the "cluster" field.
    #
    # Arvados handles streaming data between clusters, for example,
    # the Docker image containing the code for a particular tool will
    # be fetched on demand, as long as it is available somewhere in
    # the federation.
    hints:
      arv:ClusterTarget:
        cluster_id: $(inputs.dataset.cluster)
        project_uuid: $(inputs.intermediate_projects)

    out: [out]
    run: extract.cwl

  # Collect the results of the distributed step and join them into a
  # single output file.  Arvados handles streaming inputs,
  # intermediate results, and outputs between clusters on demand.
  gather-results:
    in:
      dataset: distributed-analysis/out
    out: [out]
    run: merge.cwl

Example input document:

select_column: color
select_values:
  class: File
  location: colors_to_select.txt

datasets:
  - cluster: clsr1
    file:
      class: File
      location: keep:0dcf9310e5bf0c07270416d3a0cd6a43+56/items1.csv

  - cluster: clsr2
    file:
      class: File
      location: keep:12707d325a3f4687674b858bd32beae9+56/items2.csv

  - cluster: clsr3
    file:
      class: File
      location: keep:dbff6bb7fc43176527af5eb9dec28871+56/items3.csv

intermediate_projects:
  - clsr1-j7d0g-qxc4jcji7n4lafx
  - clsr2-j7d0g-e7r20egb8hlgn53
  - clsr3-j7d0g-vrl00zoku9spnen

Previous: Analyzing workflow cost (cloud only) Next: arvados-cwl-runner options