The foundation of QNIBTerminal is an image that holds consul and glues everything together. I used the Easter break to refine my qnib/slurm images - this blog post give a quick intro. SLURM is a resource scheduler that helps out by freeing your mind from how to use resources in a cluster. Hence the name Simple Linux Utility for Resource Management. To spin up a cluster three daemons are necessary: Munge (MUNGE Uid 'N' Gid Emporium) creates and validates credentials for authentication. SLURM Controller The SLURM controller (slurmctld) is in charge of the actual scheduling. It gathers all information and dispatches the jobs. SLURM Daemon The slurmd daemon runs on all nodes within the cluster and connects to the slurmctld, reports the node ready for duty. Docker Images For this simple version of a SLURM cluster, three docker containers are needed. qnib/consul bundles everything together by providing service discovery and a key/value store qnib/slurmctld provides the SLURM Controller qnib/slurmd acts as the compute node SLURM config file The two slum daemons use a common configuration file slurm.conf. This bugger has to be equal among all daemons, otherwise they will complain. Within my first iteration of QNIBTerminal I used etcd and some bash scripts to keep everyone in sync - this time around I use the power of consul. :) consul services The containers are reporting all services back to consul. Among them slurmctld and slurmd. Within consul, these nodes can be queried using the consul HTTP API: $ export CHOST="http://consul.service.consul:8500" $ curl -s ${CHOST}/v1/catalog/service/slurmd|python -m json.tool [ { "Address": "172.17.0.217", "Node": "001a982e8af1", "ServiceAddress": "", "ServiceID": "slurmd", "ServiceName": "slurmd", "ServicePort": 6818, "ServiceTags": null } ] And if that would be not enough the guys behind consul provide consul-template to use the information painlessly simple. consul-template As the name suggests, consul-tempate uses consul and creates config files out of it. Since our SLURM cluster is going to be fairly dynamic (the cluster should grow and shrink if we feel like it) it has to be configured dynamically as well. The hard part within the slum config file is to get the nodes dynamically created. $ tail -n5 /usr/local/etc/slurm.conf NodeName=001a982e8af1 NodeAddr=172.17.0.217 PartitionName=qnib Nodes=001a982e8af1 Default=YES MaxTime=INFINITE State=UP This information derives out of this template: $ tail -n6 /etc/consul-template/templates/slurm.conf.tmpl {{range service "slurmd" "any"}} NodeName={{.Node}} NodeAddr={{.Address}}{{end}} PartitionName=qnib Nodes={{range $i, $e := service "slurmd" "any"}}{{if ne $i 0}},{{end}}{{$e.Node}}{{end}} Default=YES MaxTime=INFINITE State=UP consul-template gets a list of all nodes providing slurmd (by default it only takes services into account that are up'n'running, "any" gets them all). Supervisors holds the serves that listens for changes, recreates the configuration and restarts the service, that's it. :) $ cat /etc/supervisord.d/slurmd_update.ini [program:slurmd_update] command=consul-template -consul consul.service.consul:8500 -template "/etc/consul-template/templates/slurm.conf.tmpl:/usr/local/etc/slurm.conf:supervisorctl restart slurmd" redirect_stderr=true stdout_logfile=syslog fig I must admit I am stuck at fig, I should update to docker-compose - but still... The following fig file spins up the stack: consul: image: qnib/consul ports: - "8500:8500" environment: - DC_NAME=dc1 - ENABLE_SYSLOG=true dns: 127.0.0.1 hostname: consul privileged: true slurmctld: image: qnib/slurmctld ports: - "6817:6817" links: - consul:consul environment: - DC_NAME=dc1 - SERVICE_6817_NAME=slurmctld - ENABLE_SYSLOG=true dns: 127.0.0.1 hostname: slurmctld privileged: true slurmd: image: qnib/slurmd links: - consul:consul - slurmctld:slurmctld environment: - DC_NAME=dc1 - ENABLE_SYSLOG=true dns: 127.0.0.1 #hostname: slurmd privileged: true I do not use a hostname to have a dynamic hostname for each slurmd container. Logging into the first node I can use the slum commands: $ docker exec -ti dockerslurmd_slurmd_1 bash bash-4.2# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST qnib* up infinite 1 idle 001a982e8af1 bash-4.2# srun hostname 001a982e8af1 By using fig scale the cluster can be expanded... $ fig scale slurmd=5 Starting dockerslurmd_slurmd_2... Starting dockerslurmd_slurmd_3... Starting dockerslurmd_slurmd_4... Starting dockerslurmd_slurmd_5... $ docker exec -ti dockerslurmd_slurmd_1 bash bash-4.2# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST qnib* up infinite 5 idle 001a982e8af1,9d8960b0d3ae,46e07712f89e,988187e8255a,e10c39a5ea12 bash-4.2# srun -N 5 hostname 001a982e8af1 e10c39a5ea12 46e07712f89e 988187e8255a 9d8960b0d3ae And that's it...