Test Slurm dispatch

Note:

crunch-dispatch-slurm is only relevant for on premises clusters that will spool jobs to Slurm. Skip this section if you are installing a cloud cluster.

Test compute node setup

You should now be able to submit Slurm jobs that run in Docker containers. On the node where you’re running the dispatcher, you can test this by running:

~$ sudo -u crunch srun -N1 docker run busybox echo OK

If it works, this command should print OK (it may also show some status messages from Slurm and/or Docker). If it does not print OK, double-check your compute node setup, and that the crunch user can submit Slurm jobs.

Test the dispatcher

On the dispatch node, start monitoring the crunch-dispatch-slurm logs:

~$ sudo journalctl -o cat -fu crunch-dispatch-slurm.service

Submit a simple container request:

shell:~$ arv container_request create --container-request '{
  "name":            "test",
  "state":           "Committed",
  "priority":        1,
  "container_image": "arvados/jobs:latest",
  "command":         ["echo", "Hello, Crunch!"],
  "output_path":     "/out",
  "mounts": {
    "/out": {
      "kind":        "tmp",
      "capacity":    1000
    }
  },
  "runtime_constraints": {
    "vcpus": 1,
    "ram": 8388608
  }
}'

This command should return a record with a container_uuid field. Once crunch-dispatch-slurm polls the API server for new containers to run, you should see it dispatch that same container. It will log messages like:

2016/08/05 13:52:54 Monitoring container zzzzz-dz642-hdp2vpu9nq14tx0 started
2016/08/05 13:53:04 About to submit queued container zzzzz-dz642-hdp2vpu9nq14tx0
2016/08/05 13:53:04 sbatch succeeded: Submitted batch job 8102

Before the container finishes, Slurm’s squeue command will show the new job in the list of queued and running jobs. For example, you might see:

~$ squeue --long
Fri Aug  5 13:57:50 2016
  JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)
   8103   compute zzzzz-dz   crunch  RUNNING       1:56 UNLIMITED      1 compute0

The job’s name corresponds to the container’s UUID. You can get more information about it by running, e.g., scontrol show job Name=UUID.

When the container finishes, the dispatcher will log that, with the final result:

2016/08/05 13:53:14 Container zzzzz-dz642-hdp2vpu9nq14tx0 now in state "Complete" with locked_by_uuid ""
2016/08/05 13:53:14 Monitoring container zzzzz-dz642-hdp2vpu9nq14tx0 finished

After the container finishes, you can get the container record by UUID from a shell server to see its results:

shell:~$ arv get zzzzz-dz642-hdp2vpu9nq14tx0
{
 ...
 "exit_code":0,
 "log":"a01df2f7e5bc1c2ad59c60a837e90dc6+166",
 "output":"d41d8cd98f00b204e9800998ecf8427e+0",
 "state":"Complete",
 ...
}

You can use standard Keep tools to view the container’s output and logs from their corresponding fields. For example, to see the logs from the collection referenced in the log field:

~$ arv keep ls a01df2f7e5bc1c2ad59c60a837e90dc6+166
./crunch-run.txt
./stderr.txt
./stdout.txt
~$ arv-get a01df2f7e5bc1c2ad59c60a837e90dc6+166/stdout.txt
2016-08-05T13:53:06.201011Z Hello, Crunch!

If the container does not dispatch successfully, refer to the crunch-dispatch-slurm logs for information about why it failed.

Previous: Set up a Slurm compute node Next: Install PostgreSQL 9.4+