Metadata vocabulary

Many Arvados objects (like collections and projects) can store metadata as properties that in turn can be used in searches allowing a flexible way of organizing data inside the system.

Arvados enables the site administrator to set up a formal metadata vocabulary definition so that users can select from predefined key/value pairs of properties, offering the possibility to add different terms for the same concept on clients’ UI such as workbench2.

The Controller service loads and caches the configured vocabulary file in memory at startup time, exporting it on a particular endpoint. From time to time, it’ll check for updates in the local copy and refresh its cache if validation passes.

Configuration

The site administrator should place the JSON vocabulary file on the same host as the controller service and set up the config file as follows:

Cluster:
  zzzzz:
    API:
      VocabularyPath: /etc/arvados/vocabulary.json

Definition format

The JSON file describes the available keys and values and if the user is allowed to enter free text not defined by the vocabulary.

Keys and values are indexed by identifiers so that the concept of a term is preserved even if vocabulary labels are changed.

The following is an example of a vocabulary definition:

{
    "strict_tags": false,
    "tags": {
        "IDTAGANIMALS": {
            "strict": false,
            "labels": [{"label": "Animal" }, {"label": "Creature"}, {"label": "Species"}],
            "values": {
                "IDVALANIMALS1": { "labels": [{"label": "Human"}, {"label": "Homo sapiens"}] },
                "IDVALANIMALS2": { "labels": [{"label": "Dog"}, {"label": "Canis lupus familiaris"}] },
                "IDVALANIMALS3": { "labels": [{"label": "Elephant"}, {"label": "Loxodonta"}] },
                "IDVALANIMALS4": { "labels": [{"label": "Eagle"}, {"label": "Haliaeetus leucocephalus"}] }
            }
        },
        "IDTAGCOMMENT": {
            "labels": [{"label": "Comment"}, {"label": "Suggestion"}]
        },
        "IDTAGIMPORTANCES": {
            "strict": true,
            "labels": [{"label": "Importance"}, {"label": "Priority"}],
            "values": {
                "IDVALIMPORTANCES1": { "labels": [{"label": "Critical"}, {"label": "Urgent"}, {"label": "High"}] },
                "IDVALIMPORTANCES2": { "labels": [{"label": "Normal"}, {"label": "Moderate"}] },
                "IDVALIMPORTANCES3": { "labels": [{"label": "Low"}] }
            }
        }
    }
}

For clients to be able to query the vocabulary definition, a special endpoint is exposed on the controller service: /arvados/v1/vocabulary. This endpoint doesn’t require authentication and returns the vocabulary definition in JSON format.

If the strict_tags flag at the root level is true, it will restrict the users from saving property keys other than the ones defined in the vocabulary. This restriction is enforced at the backend level to ensure consistency across different clients.

Inside the tags member, IDs are defined (IDTAGANIMALS, IDTAGCOMMENT, IDTAGIMPORTANCES) and can have any format that the current application requires. Every key will declare at least a labels list with zero or more label objects.

The strict flag inside a tag definition operates the same as the strict_tags root member, but at the individual tag level. When strict is true, a tag’s value options are limited to those defined by the vocabulary.

The values member is optional and is used to define valid key/label pairs when applicable. In the example above, IDTAGCOMMENT allows open-ended text by only defining the tag’s ID and labels and leaving out values.

When any key or value has more than one label option, Workbench2’s user interface will allow the user to select any of the options. But because only the IDs are saved in the system, when the property is displayed in the user interface, the label shown will be the first of each group defined in the vocabulary file. For example, the user could select the property key Species and Homo sapiens as its value, but the user interface will display it as Animal: Human because those labels are the first in the vocabulary definition.

Internally, Workbench2 uses the IDs to do property based searches, so if the user searches by Animal: Human or Species: Homo sapiens, both will return the same results.

Definition validation

Because the vocabulary definition is prone to syntax or logical errors, the controller service needs to do some validation before answering requests. If the vocabulary validation fails, the service won’t start.
The site administrator can make sure the vocabulary file is correct before even trying to start the controller service by running arvados-server config-check. When the vocabulary definition isn’t correct, the administrator will get a list of issues like the one below:

# arvados-server config-check -config /etc/arvados/config.yml
Error loading vocabulary file "/etc/arvados/vocabulary.json" for cluster zzzzz:
duplicate JSON key "tags.IDTAGFRUITS.values.IDVALFRUITS1"
tag key "IDTAGCOMMENT" is configured as strict but doesn't provide values
tag value label "Banana" for pair ("IDTAGFRUITS":"IDVALFRUITS8") already seen on value "IDVALFRUITS4"
exit status 1

NOTE: These validation checks are performed only on the node that hosts the vocabulary file defined on the configuration. As the same configuration file is shared between different nodes, those who don’t host the file won’t produce spurious errors when running arvados-server config-check.

Live updates

Sometimes it may be necessary to modify the vocabulary definition in a running production environment.
When a change is detected, the controller service will automatically attempt to load the new vocabulary and check its validity before making it active.
If the new vocabulary has some issue, the last valid one will keep being active. The service will export any errors on its health endpoint so that a monitoring solution can send an alert appropriately.
With the above mechanisms in place, no outages should occur from making typos or other errors when updating the vocabulary file.

Health status

To be able for the administrator to guarantee the system’s metadata integrity, the controller service exports a specific health endpoint for the vocabulary at /_health/vocabulary.
As a first measure, the service won’t start if the vocabulary file is incorrect. Once running, if there are updates (that may even be periodical), the service needs to keep running while notifying the operator that some fixing is in order.
An example of a vocabulary health error is included below:

$ curl --silent -H "Authorization: Bearer xxxtokenxxx" https://controller/_health/vocabulary | jq .
{
  "error": "while loading vocabulary file \"/etc/arvados/vocabulary.json\": duplicate JSON key \"tags.IDTAGSIZES.values.IDVALSIZES3\"",
  "health": "ERROR"
}

Client support

Workbench2 currently takes advantage of this vocabulary definition by providing an easy-to-use interface for searching and applying metadata to different objects in the system. Because the definition file only resides on the controller node, and Workbench2 is just a static web application run by every users’ web browser, there’s a mechanism in place that allows Workbench2 and any other client to request the active vocabulary.

The controller service provides an unauthenticated endpoint at /arvados/v1/vocabulary where it exports the contents of the vocabulary JSON file:

$ curl --silent https://controller/arvados/v1/vocabulary | jq .
{
  "kind": "arvados#vocabulary",
  "strict_tags": false,
  "tags": {
    "IDTAGANIMALS": {
      "labels": [
        {
          "label": "Animal"
        },
        {
          "label": "Creature"
        }
      ],
      "strict": false,
...
}

Although the vocabulary enforcement is done on the backend side, clients can use this information to provide helping features to users, like doing ID-to-label translations, preemptive error checking, etc.

Properties migration

After installing the new vocabulary definition, it may be necessary to migrate preexisting properties that were set up using literal strings. This can be a big task depending on the number of properties on the vocabulary and the amount of collections and projects on the cluster.

To help with this task we provide below a migration example script that accepts the new vocabulary definition file as an input, and uses the ARVADOS_API_TOKEN and ARVADOS_API_HOST environment variables to connect to the cluster, search for every collection and group that has properties with labels defined on the vocabulary file, and migrates them to the corresponding identifiers.

This script will not run if the vocabulary file has duplicated labels for different keys or for different values inside a key, this is a failsafe mechanism to avoid migration errors.

Please take into account that this script requires admin credentials. It also offers a --dry-run flag that will report what changes are required without applying them, so it can be reviewed by an administrator.

Also, take into consideration that this example script does case-sensitive matching on labels.

#!/usr/bin/env python3
#
# Copyright (C) The Arvados Authors. All rights reserved.
#
# SPDX-License-Identifier: CC-BY-SA-3.0

import argparse
import copy
import json
import logging
import os
import sys

import arvados
import arvados.util

logger = logging.getLogger('arvados.vocabulary_migrate')
logger.setLevel(logging.INFO)

class VocabularyError(Exception):
    pass

opts = argparse.ArgumentParser(add_help=False)
opts.add_argument('--vocabulary-file', type=str, metavar='PATH', required=True,
                  help="""
Use vocabulary definition file at PATH for migration decisions.
""")
opts.add_argument('--dry-run', action='store_true', default=False,
                  help="""
Don't actually migrate properties, but only check if any collection/project
should be migrated.
""")
opts.add_argument('--debug', action='store_true', default=False,
                  help="""
Sets logging level to DEBUG.
""")
arg_parser = argparse.ArgumentParser(
    description='Migrate collections & projects properties to the new vocabulary format.',
    parents=[opts])

def parse_arguments(arguments):
    args = arg_parser.parse_args(arguments)
    if args.debug:
        logger.setLevel(logging.DEBUG)
    if not os.path.isfile(args.vocabulary_file):
        arg_parser.error("{} doesn't exist or isn't a file.".format(args.vocabulary_file))
    return args

def _label_to_id_mappings(data, obj_name):
    result = {}
    for obj_id, obj_data in data.items():
        for lbl in obj_data['labels']:
            obj_lbl = lbl['label']
            if obj_lbl not in result:
                result[obj_lbl] = obj_id
            else:
                raise VocabularyError('{} label "{}" for {} ID "{}" already seen at {} ID "{}".'.format(obj_name, obj_lbl, obj_name, obj_id, obj_name, result[obj_lbl]))
    return result

def key_labels_to_ids(vocab):
    return _label_to_id_mappings(vocab['tags'], 'key')

def value_labels_to_ids(vocab, key_id):
    if key_id in vocab['tags'] and 'values' in vocab['tags'][key_id]:
        return _label_to_id_mappings(vocab['tags'][key_id]['values'], 'value')
    return {}

def migrate_properties(properties, key_map, vocab):
    result = {}
    for k, v in properties.items():
        key = key_map.get(k, k)
        value = value_labels_to_ids(vocab, key).get(v, v)
        result[key] = value
    return result

def main(arguments=None):
    args = parse_arguments(arguments)
    vocab = None
    with open(args.vocabulary_file, 'r') as f:
        vocab = json.load(f)
    arv = arvados.api('v1')
    if 'tags' not in vocab or vocab['tags'] == {}:
        logger.warning('Empty vocabulary file, exiting.')
        return 1
    if not arv.users().current().execute()['is_admin']:
        logger.error('Admin privileges required.')
        return 1
    key_label_to_id_map = key_labels_to_ids(vocab)
    migrated_counter = 0

    for key_label in key_label_to_id_map:
        logger.debug('Querying objects with property key "{}"'.format(key_label))
        for resource in [arv.collections(), arv.groups()]:
            objs = arvados.util.keyset_list_all(
                resource.list,
                order='created_at',
                select=['uuid', 'properties'],
                filters=[['properties', 'exists', key_label]]
            )
            for o in objs:
                props = copy.copy(o['properties'])
                migrated_props = migrate_properties(props, key_label_to_id_map, vocab)
                if not args.dry_run:
                    logger.debug('Migrating {}: {} -> {}'.format(o['uuid'], props, migrated_props))
                    arv.collections().update(uuid=o['uuid'], body={
                        'properties': migrated_props
                    }).execute()
                else:
                    logger.info('Should migrate {}: {} -> {}'.format(o['uuid'], props, migrated_props))
                migrated_counter += 1
                if not args.dry_run and migrated_counter % 100 == 0:
                    logger.info('Migrating {} objects...'.format(migrated_counter))

    if args.dry_run and migrated_counter == 0:
        logger.info('Nothing to do.')
    elif not args.dry_run:
        logger.info('Done, total objects migrated: {}.'.format(migrated_counter))
    return 0

if __name__ == "__main__":
    sys.exit(main())

Previous: Logs table management Next: Configuring storage classes