Simple multi-host Docker SLURM cluster

I was asked twice recently how I would transform the stacks I am using into a of-the-shelf Docker HPC cluster.

For starters I will go with a pretty minimalistic approach of leveraging the blog post about docker networking I did and expand it on physical machines.

Since I do not have a cluster under my fingertips, I will mock-up the setup with docker-machines.

$ for x in login node0 node1 node2;do machine create -d virtualbox ${x};done

If you see a command like eval $(machine env login) it's just me setting the DOCKER_HOST to the target (here the login node), in the physical world you point the DOCKER_HOST variable to the corresponding ip address.

You might call it head-node, master-node or whatever. It could even be one of the compute node, the thing is that this fella will hold the key/value store for Docker Networking and is not going to be part of the cluster.

Install docker on it and run the compose file from the 'Docker Networking 101' post.

$ git clone https://github.com/ChristianKniep/orchestra.git
$ cd orchestra/docker-networking/
$ eval $(machine env login)
login $ docker-compose up -d
Pulling consul (qnib/consul:latest)...
latest: Pulling from qnib/consul
*snip*
Status: Downloaded newer image for qnib/consul:latest
Creating consul
login $

The login node should present a nice Consul WebUI at <login_ip>:8500.

Compute nodes

The computes node must run docker in version 1.9 (or higher) as to be able to use docker networking.

Setup Docker Engine

In order to use dockers networking capabilities we are going to add the following option to the docker engines on the compute hosts.

--cluster-store=consul://<login_ip>:8500/network

Since boot2docker uses eth0 for NAT and eth1 as the host-only network the following option has to be set in a boot2docker environment (on my MacBook, that is - not the physical setup).

--cluster-advertise=eth1:2376

Depending on your Linux flavour it might be set in /etc/default/docker, /etc/sysconfig/docker or somewhere else.

For the docker-machines it comes down to the following:

$ machine ssh node0
                        ##         .
                  ## ## ##        ==
               ## ## ## ## ##    ===
           /"""""""""""""""""\___/ ===
      ~~~ {~~ ~~~~ ~~~ ~~~~ ~~~ ~ /  ===- ~~~
           \______ o           __/
             \    \         __/
              \____\_______/
 _                 _   ____     _            _
| |__   ___   ___ | |_|___ \ __| | ___   ___| | _____ _ __
| '_ \ / _ \ / _ \| __| __) / _` |/ _ \ / __| |/ / _ \ '__|
| |_) | (_) | (_) | |_ / __/ (_| | (_) | (__|   <  __/ |
|_.__/ \___/ \___/ \__|_____\__,_|\___/ \___|_|\_\___|_|
Boot2Docker version 1.9.0, build master : 16e4a2a - Tue Nov  3 19:49:22 UTC 2015
Docker version 1.9.0, build 76d6bc9
docker@node0:~$ sudo vi /var/lib/boot2docker/profile
docker@node0:~$ grep 192 -B2 /var/lib/boot2docker/profile
EXTRA_ARGS='
--label provider=virtualbox
--cluster-store=consul://192.168.99.103:8500/network --cluster-advertise=eth1:2376
docker@node0:~$ exit
$

After the change I restart the machine, restarting the service will do as well.

$ machine restart node0
Restarted machines may have new IP addresses. You may need to re-run the `docker-machine env` command.
$

For the physical version just do: service docker restart

As shown in the blog post, node0 (it's IP address) now appears in the KV store, hence he is part of the docker networking family.

After all nodes are treated it looks like this:

Overlay network

Now that all nodes are present we are adding one global overlay network.

$ docker $(machine config node0) network create -d overlay global
67343b2a61b1617c847351b680de7fc2426d8113dba093c3812f4322a23003b6

Et voila, the network show up on each nodes network list.

$ for x in node{0..2}; do echo ">> ${x}" ; docker $(machine config ${x}) network ls;done
>> node0
NETWORK ID          NAME                DRIVER
67343b2a61b1        global              overlay
b2f3267f8a1b        none                null
62765c4d843b        host                host
8f2be59bfe1d        bridge              bridge
>> node1
NETWORK ID          NAME                DRIVER
67343b2a61b1        global              overlay
b993e28a9485        none                null
69ce0ddd1d1c        host                host
11328d6e3578        bridge              bridge
>> node2
NETWORK ID          NAME                DRIVER
67343b2a61b1        global              overlay
ecb749ac7744        none                null
0839f1b397a6        host                host
81bc08e36dd4        bridge              bridge
$

Spawn SLURM cluster

OK, so far I just rewrote the blog post about docker networking.

Now let's add some meat...

Consul, slurmctld and the first compute node

Let's put the consul and the slurmctld on node0. It would be nice if this container could also life on the login node, but remind you - the login node is not part of the overlay network.

node0 $ cd orchestra/multihost-slurm/node0/
node0 $ docker-compose up -d
*snip*
Status: Downloaded newer image for qnib/slurmctld:latest
Creating slurmctld
*snip*
Status: Downloaded newer image for qnib/slurmd:latest
Creating fd20_0
node0 $

After the service has settled Consul should look like this...

Slurm should be up and running and sinfo provides one node...

node0 $ docker exec -ti fd20_0 sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      1   idle fd20_0
even         up   infinite      1   idle fd20_0
node0 $

Additional Nodes

Now that the environment variable CONSUL_IP is set we can start additional nodes.

node0 $ cd ../nodes/
node0 $ eval $(machine env node1)
node1 $ cd ../nodes/
node1 $ ./up.sh
Which SUFFIX should we provide fd20_x? 1
+ CNT=1
+ docker-compose up -d
Creating fd20_1
node1 $ docker exec -ti fd20_1 sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      2   idle fd20_[0-1]
odd          up   infinite      1   idle fd20_1
even         up   infinite      1   idle fd20_0
node1 $ docker exec -ti fd20_1 srun -N2 hostname
fd20_1
fd20_0
node1 $

Last but not least, let's bring up node2

node1 $ eval $(machine env node2)
node2 $ cd ../nodes/
node2 $ ./up.sh
Which SUFFIX should we provide fd20_x? 2
+ CNT=2
+ docker-compose up -d
Creating fd20_2
node2 $ docker exec -ti fd20_2 sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      3   idle fd20_[0-2]
odd          up   infinite      1   idle fd20_1
even         up   infinite      2   idle fd20_[0,2]
node2 $ docker exec -ti fd20_2 srun -N3 hostname
fd20_1
fd20_0
fd20_2
node2 $

Future Work

Ok, that was a fun ride so far, but what's missing?

Shared FS The containers do not share a volume or a underlying filesystem. On a HPC cluster here should be something present, therefore just uncomment and adjust the volume part in base.yml.
User Consolidation I talked to a guy at the DockerCon, who was using a nscld-socket to introduce the cluster users to the containers; that I like - I have to rediscover his mail address. No matter how, somehow the users have to be present within the containers. AFAIK the promised USER_NAMESPACE is not going to help, since it just defines a mapping of UID and GID from within to outside of the container, but to make this fun, all groups of a user have to be known within.
Volumes I haven't played around with volumes yet, but this might be also one way of having a shared filesystem. Or maybe providing access to the (read-only) input deck could be done like this. We'll see.
MPI & Interconnect Sure we want to run some distributed applications, therefore IB would be nice. But that's trivial, once I got my hands on it again. I will share this to make it clear.
More distributions It would be fun to provide more distributions to let them compete with each other (as Robert mentioned, the reason why I am blogging about it). That's for another post.
SWARM Swarm would make the constant changing of Docker targets obsolet. As my docker swarm blog post showed, it's not even hard to use it.
Stuff I forgot I am quite sure I forgot something, but the post is long enough already... Ahh, Providing a stack to process all logs and metrics would be cool... Another time. :)

Conclusion

It's not that hard to compose a simple HPC cluster to use Docker containers.

Enjoy!

Login Node