This tutorial describes how to copy Arvados objects from one cluster to another by using arv-copy
.
This tutorial assumes that you have access to Arvados command line tools, configured your API token, and confirmed a working environment.
arv-copy
allows users to copy collections, workflow definitions and projects from one cluster to another. You can also use arv-copy
to import resources from HTTP URLs into Keep.
For projects, arv-copy
will copy all the collections workflow definitions owned by the project, and recursively copy subprojects.
For workflow definitions, arv-copy
will recursively go through the workflow and copy all associated dependencies (input collections and Docker images).
For example, let’s copy from the Arvados playground, also known as pirca, to dstcl. The names pirca and dstcl are interchangable with any cluster id. You can find the cluster name from the prefix of the uuid of the object you want to copy. For example, in zzzzz-4zz18-tci4vn4fa95w0zx, the cluster name is zzzzz .
In order to communicate with both clusters, you must create custom configuration files for each cluster. The Getting an API token page describes how to get a token and create a configuration file. However, instead of “settings.conf” in ~/.config/arvados
you need two configuration files, one for each cluster, with filenames in the format of ClusterID.conf.
In this example, navigate to the Current token page on each of pirca and dstcl to get the ARVADOS_API_HOST
and ARVADOS_API_TOKEN
.
The config file consists of two lines, one for ARVADOS_API_HOST and one for ARVADOS_API_TOKEN:
ARVADOS_API_HOST=zzzzz.arvadosapi.com ARVADOS_API_TOKEN=v2/zzzzz-gj3su-xxxxxxxxxxxxxxx/123456789abcdefghijkl
Copy your ARVADOS_API_HOST
and ARVADOS_API_TOKEN
into the config files as shown below in the shell account from which you are executing the commands. In our example, you need two files, ~/.config/arvados/pirca.conf
and ~/.config/arvados/dstcl.conf
.
Now you’re ready to copy between pirca and dstcl!
First, determine the uuid or portable data hash of the collection you want to copy from the source cluster. The uuid can be found in the collection display page in the collection summary area (top left box), or from the URL bar (the part after collections/...
)
Now copy the collection from pirca to dstcl. We will use the uuid jutro-4zz18-tv416l321i4r01e
as an example. You can find this collection on playground.arvados.org.
~$ arv-copy --src pirca --dst dstcl jutro-4zz18-tv416l321i4r01e
jutro-4zz18-tv416l321i4r01e: 6.1M / 6.1M 100.0%
arvados.arv-copy[1234] INFO: Success: created copy with uuid dstcl-4zz18-xxxxxxxxxxxxxxx
You can also copy by content address:
~$ arv-copy --src pirca --dst dstcl 2463fa9efeb75e099685528b3b9071e0+438
2463fa9efeb75e099685528b3b9071e0+438: 6.1M / 6.1M 100.0%
arvados.arv-copy[1234] INFO: Success: created copy with uuid dstcl-4zz18-xxxxxxxxxxxxxxx
The output of arv-copy displays the uuid of the collection generated in the destination cluster. By default, the output is placed in your home project in the destination cluster. If you want to place your collection in an existing project, you can specify the project you want it to be in using the tag --project-uuid
followed by the project uuid.
For example, this will copy the collection to project dstcl-j7d0g-a894213ukjhal12
in the destination cluster.
~$ arv-copy --src pirca --dst dstcl --project-uuid dstcl-j7d0g-a894213ukjhal12 jutro-4zz18-tv416l321i4r01e
Additionally, if you need to specify the storage classes where to save the copied data on the destination cluster, you can do that by using the --storage-classes LIST
argument, where LIST
is a comma-separated list of storage class names.
Copying workflows requires arvados-cwl-runner
to be available in your $PATH
.
We will use the uuid jutro-7fd4e-mkmmq53m1ze6apx
as an example workflow.
Arv-copy will infer the source cluster is jutro
from the object uuid, and destination cluster is pirca
from --project-uuid
.
~$ arv-copy --project-uuid pirca-j7d0g-ecak8knpefz8ere jutro-7fd4e-mkmmq53m1ze6apx
ae480c5099b81e17267b7445e35b4bc7+180: 23M / 23M 100.0%
2463fa9efeb75e099685528b3b9071e0+438: 156M / 156M 100.0%
jutro-4zz18-vvvqlops0a0kpdl: 94M / 94M 100.0%
2020-08-19 17:04:13 arvados.arv-copy[4789] INFO:
2020-08-19 17:04:13 arvados.arv-copy[4789] INFO: Success: created copy with uuid pirca-7fd4e-s0tw9rfbkpo2fmx
The name, description, and workflow definition from the original workflow will be used for the destination copy. In addition, any collections and docker images referenced in the source workflow definition will also be copied to the destination.
If you would like to copy the object without dependencies, you can use the --no-recursive
flag.
We will use the uuid jutro-j7d0g-xj19djofle3aryq
as an example project.
Arv-copy will infer the source cluster is jutro
from the source project uuid, and destination cluster is pirca
from --project-uuid
.
~$ arv-copy --project-uuid pirca-j7d0g-lr8sq3tx3ovn68k jutro-j7d0g-xj19djofle3aryq
2021-09-08 21:29:32 arvados.arv-copy[6377] INFO:
2021-09-08 21:29:32 arvados.arv-copy[6377] INFO: Success: created copy with uuid pirca-j7d0g-ig9gvu5piznducp
The name and description of the original project will be used for the destination copy. If a project already exists with the same name, collections and workflow definitions will be copied into the project with the same name.
If you would like to copy the project but not its subproject, you can use the --no-recursive
flag.
You can also use arv-copy
to copy the contents of a HTTP URL into Keep. When you do this, Arvados keeps track of the original URL the resource came from. This allows you to refer to the resource by its original URL in Workflow inputs, but actually read from the local copy in Keep.
~$ arv-copy --project-uuid tordo-j7d0g-lr8sq3tx3ovn68k https://example.com/index.html
tordo-4zz18-dhpb6y9km2byb94
2023-10-06 10:15:36 arvados.arv-copy[374147] INFO: Success: created copy with uuid tordo-4zz18-dhpb6y9km2byb94
In addition, when importing from HTTP URLs, you may provide a different cluster than the destination in --src
. This tells arv-copy
to search the other cluster for a collection associated with that URL, and if found, copy the collection from that cluster instead of downloading from the original URL.
The following arv-copy
command line options affect the behavior of HTTP import.
Option | Description |
---|---|
--varying-url-params VARYING_URL_PARAMS | A comma separated list of URL query parameters that should be ignored when storing HTTP URLs in Keep. |
--prefer-cached-downloads | If a HTTP URL is found in Keep, skip upstream URL freshness check (will not notice if the upstream has changed, but also not error if upstream is unavailable). |
The content of this documentation is licensed under the
Creative
Commons Attribution-Share Alike 3.0 United States licence.
Code samples in this documentation are licensed under the
Apache License, Version 2.0.