Measuring deduplication

The arvados-client tool can be used to generate a deduplication report across an arbitrary number of collections. It can be installed from packages (apt install arvados-client or yum install arvados-client).

Syntax

~$ arvados-client deduplication-report -h
Usage:
  arvados-client deduplication-report [options ...]   ...

  arvados-client deduplication-report [options ...] , \
     , ...

  This program analyzes the overlap in blocks used by 2 or more collections. It
  prints a deduplication report that shows the nominal space used by the
  collections, as well as the actual size and the amount of space that is saved
  by Keep's deduplication.

  The list of collections may be provided in two ways. A list of collection
  uuids is sufficient. Alternatively, the PDH for each collection may also be
  provided. This is will greatly speed up operation when the list contains
  multiple collections with the same PDH.

  Exit status will be zero if there were no errors generating the report.

Example:

  Use the 'arv' and 'jq' commands to get the list of the 100
  largest collections and generate the deduplication report:

  arv collection list --order 'file_size_total desc' --limit 100 | \
    jq -r '.items[] | [.portable_data_hash,.uuid] |@csv' | \
    sed -e 's/"//g'|tr '\n' ' ' | \
    xargs arvados-client deduplication-report

Options:
  -log-level string
      logging level (debug, info, ...) (default "info")

The usual environment variables (ARVADOS_API_HOST and ARVADOS_API_TOKEN) need to be set for the deduplication report to be be generated. To get cluster-wide results, an admin token will need to be supplied. Users can also run this report, but only collections their token is able to read will be included.

Example output (with uuids and portable data hashes obscured) from a small Arvados cluster:

~$ arv collection list --order 'file_size_total desc' --limit 10 | jq -r '.items[] | [.portable_data_hash,.uuid] |@csv' |sed -e 's/"//g'|tr '\n' ' ' |xargs arvados-client deduplication-report
Collection _____-_____-_______________: pdh ________________________________+5003343; nominal size 7382073267640 (6.7 TiB); file count 2796
Collection _____-_____-_______________: pdh ________________________________+4961919; nominal size 6989909625775 (6.4 TiB); file count 5592
Collection _____-_____-_______________: pdh ________________________________+1903643; nominal size 2677933564052 (2.4 TiB); file count 2796
Collection _____-_____-_______________: pdh ________________________________+1903643; nominal size 2677933564052 (2.4 TiB); file count 2796
Collection _____-_____-_______________: pdh ________________________________+137710; nominal size 191858151583 (179 GiB); file count 201
Collection _____-_____-_______________: pdh ________________________________+137636; nominal size 191858101962 (179 GiB); file count 200
Collection _____-_____-_______________: pdh ________________________________+135350; nominal size 191715427388 (178 GiB); file count 201
Collection _____-_____-_______________: pdh ________________________________+135276; nominal size 191715384167 (178 GiB); file count 200
Collection _____-_____-_______________: pdh ________________________________+135350; nominal size 191707276684 (178 GiB); file count 201
Collection _____-_____-_______________: pdh ________________________________+135276; nominal size 191707233463 (178 GiB); file count 200

Collections:                              10
Nominal size of stored data:  20878411596766 bytes (19 TiB)
Actual size of stored data:   17053104444050 bytes (16 TiB)
Saved by Keep deduplication:   3825307152716 bytes (3.5 TiB)

Previous: Recovering data Next: Using Preemptible instances