Happy new year, y'all!

To kick of 2016 I am going to contribute to FOSDEM by talking about a Multi-host containerised HPC cluster. And since the HPC Advisory Council was so nice to provide an actual HPC resource I will present a 2.0 version of the talk (lessons learned after FOSDEM) at the HPC Advisory Council Workshop in Stanford.

The blog post are listed here:

The talks are limited in time (20min +5min Q&A / 45min), therefore I thought it would be nice to provide some information in a series of blog posts. This one should set the scene a bit. Subsequent posts focus on how everything was setup (spoiler: ansible), the way I monitor it (Sensu on an external server), how the docker networking performs and so on. We'll see what pops into my head. :)

All the code is going to be released on github; in fact it's already there, but I use an external cluster with RabbitMQ credentials. Have to sanitise it first...

The Cluster

I got access to the Venus cluster I used before: Containerized MPI Workloads

The cluster has eight SunFire 2250 nodes, with 32GB (even though some nodes are degenerated to 28GB) memory and dual socket Xeon Quad-core X5472 each. All of which are connected by Mellanox ConnectX-2 cards (QDR 40Gbit/s) using a Mellanox 36port switch.

All nodes were installed with CentOS 7.2, the installation I used back in the days.

Baselines

Computation

An initial HPCG benchmark came back with 5.3 GFLOP/s. Above 144 I run out of memory, since some of the nodes are degraded to 27GB.

$ export MPI_HOSTS=venus001,venus002,venus003,venus004,venus005,venus006,venus007,venus008
$ mpirun -np 64 --allow-run-as-root --host ${MPI_HOSTS} /scratch/jobs/xhpcg
$ egrep "(GFLOP/s rat|^\s+nx)" HPCG-Benchmark-3.0_2016.01.08.1*
HPCG-Benchmark-3.0_2016.01.08.13.53.05.yaml:  nx: 104
HPCG-Benchmark-3.0_2016.01.08.13.53.05.yaml:  HPCG result is VALID with a GFLOP/s rating of: 5.28937
HPCG-Benchmark-3.0_2016.01.08.14.10.46.yaml:  nx: 104
HPCG-Benchmark-3.0_2016.01.08.14.10.46.yaml:  HPCG result is VALID with a GFLOP/s rating of: 5.30873
HPCG-Benchmark-3.0_2016.01.08.14.59.14.yaml:  nx: 144
HPCG-Benchmark-3.0_2016.01.08.14.59.14.yaml:  HPCG result is VALID with a GFLOP/s rating of: 5.34622
HPCG-Benchmark-3.0_2016.01.08.17.13.03.yaml:  nx: 144
HPCG-Benchmark-3.0_2016.01.08.17.13.03.yaml:  HPCG result is VALID with a GFLOP/s rating of: 5.33844
HPCG-Benchmark-3.0_2016.01.08.18.03.10.yaml:  nx: 144
HPCG-Benchmark-3.0_2016.01.08.18.03.10.yaml:  HPCG result is VALID with a GFLOP/s rating of: 5.34506
$

Network

Ethernet provides close to 1GBit/s...

$ iperf -c venus001 -d -t 120
*snip* ID] Interval       Transfer     Bandwidth
[  5]  0.0-120.0 sec  13.1 GBytes   938 Mbits/sec
[  4]  0.0-120.0 sec  13.1 GBytes   937 Mbits/sec

IPoIB is much faster then that, even though not close to the 40GBit/s InfiniBand offers.

$ iperf -c 10.0.0.181 -d -t 120
*snip*
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-120.0 sec  88.9 GBytes  6.37 Gbits/sec
[  4]  0.0-120.0 sec  79.6 GBytes  5.69 Gbits/sec
$

InfiniBand itself provides the promised 40GBit/s.

$ qperf venus001 -t 120 rc_bi_bw
rc_bi_bw:
    bw  =  5.09 GB/sec
$

Setup

OK, now that we established some rough baselines...

To be able to use Docker Networking the docker-engines need to connect to a global Key/Value store. I showed a simple setup the other day. This time around I create a Consul cluster in which each server has its own Consul agent running in server mode. All of them form one Consul cluster.

[root@venus001 ~]# consul members
Node      Address              Status  Type    Build  Protocol  DC
venus001  192.168.12.181:8301  alive   server  0.6.0  2         vagrant
venus002  192.168.12.182:8301  alive   server  0.6.0  2         vagrant
venus003  192.168.12.183:8301  alive   server  0.6.0  2         vagrant
venus004  192.168.12.184:8301  alive   server  0.6.0  2         vagrant
venus005  192.168.12.185:8301  alive   server  0.6.0  2         vagrant
venus006  192.168.12.186:8301  alive   server  0.6.0  2         vagrant
venus007  192.168.12.187:8301  alive   server  0.6.0  2         vagrant
venus008  192.168.12.188:8301  alive   server  0.6.0  2         vagrant
[root@venus001 ~]#

The nice property of this is (at least in my view), that the docker-engine can connect to the local Consul agent and does not rely on another machine. A reboot of a node turns out nicely, since the Consul agent rejoins the cluster and everything is fine.

[root@venus001 jobs]# cat /etc/sysconfig/docker
OPTIONS="-H tcp://0.0.0.0:2376 --cluster-store=consul://127.0.0.1:8500/network --cluster-advertise=enp4s0f0:2376"
[root@venus001 jobs]#

By doing so, all docker-engines share the same overlay network via the ethernet interface.

[root@venus001 jobs]# docker network create  -d overlay global
8d2943e84f6d2f2ae9a9036f3ec25fd054d4e33bbc595afb1d7d0fdd702bb139
[root@venus001 ~]# docker network ls
NETWORK ID          NAME                DRIVER
8d2943e84f6d        global              overlay
6756095cb2c8        bridge              bridge
d9072e5b2492        none                null
e5b17208d64a        host                host
07342005436a        docker_gwbridge     bridge
[root@venus001 ~]#
[root@venus004 ~]# docker network ls
NETWORK ID          NAME                DRIVER
8d2943e84f6d        global              overlay
b0eaee7ded62        bridge              bridge
7bceab756aff        docker_gwbridge     bridge
6d46041083fa        none                null
941f89e50815        host                host
[root@venus004 ~]#

Container

For this first benchmark the qnib/ib-bench:cos7 image is used on each host. It introduces an ssh-server, InfiniBand libraries and two IB-Benchmarks: HPCG + OMB

In general I favour docker-compose for starting the container, but it seems docker-compose has some trouble with memlocks (github issue). Therefore I start the containers by hand...

[root@venus001 slurm]# cd ../hpcg
[root@venus001 hpcg]# ./start.sh up
> export DOCKER_HOST=tcp://venus001:2376
> docker run -d ... --ulimit memlock='68719476736' qnib/ib-bench:cos7 /opt/qnib/sshd/bin/start.sh
a863b176043faf560eb19ef13d351f1be42c6916fc16653cb08e303ac6135836
> export DOCKER_HOST=tcp://venus002:2376
> docker run -d ... --ulimit memlock='68719476736' qnib/ib-bench:cos7 /opt/qnib/sshd/bin/start.sh
473c412374e38c5b0d1e9d80b387c90ab597f829d697084e65bb3968f4afdf0c
> export DOCKER_HOST=tcp://venus003:2376
> docker run -d ... --ulimit memlock='68719476736' qnib/ib-bench:cos7 /opt/qnib/sshd/bin/start.sh
b4caa4977f7416014a0db1b24b561a757c10fdb23c4939348c9abe0a793b490b
> export DOCKER_HOST=tcp://venus004:2376
> docker run -d ... --ulimit memlock='68719476736' qnib/ib-bench:cos7 /opt/qnib/sshd/bin/start.sh
5f3aa0d85609692c5a8c8222ae12191a505fbf1c738df03ece2bf75c53026b57
> export DOCKER_HOST=tcp://venus005:2376
> docker run -d ... --ulimit memlock='68719476736' qnib/ib-bench:cos7 /opt/qnib/sshd/bin/start.sh
d616f65fdb9c704b2d388103f278a30167d77103d43a6151d5d751deb52e0f55
> export DOCKER_HOST=tcp://venus006:2376
> docker run -d ... --ulimit memlock='68719476736' qnib/ib-bench:cos7 /opt/qnib/sshd/bin/start.sh
07a475ccf3833458dc209fd1951b6c59f78a48af9e6a0044106da626ebd5c67e
> export DOCKER_HOST=tcp://venus007:2376
> docker run -d ... --ulimit memlock='68719476736' qnib/ib-bench:cos7 /opt/qnib/sshd/bin/start.sh
d5e3104cf2358772515628af90de505b4011295498ab5951be0b6d6d0a461108
> export DOCKER_HOST=tcp://venus008:2376
> docker run -d ... --ulimit memlock='68719476736' qnib/ib-bench:cos7 /opt/qnib/sshd/bin/start.sh
cbb9d04f0ca8cd63bdd7ad096f78eb422d6f2428df61a513d698807815bef958
[root@venus001 hpcg]#

The containers are started with the command /opt/qnib/sshd/bin/start.sh, which does not start all services, but just the sshd daemon.

[root@venus001 ~]# docker exec -ti hpcg1 bash
[root@hpcg1 /]# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 20:18 ?        00:00:00 /bin/bash /opt/qnib/sshd/bin/start.sh
root        17     1  0 20:18 ?        00:00:00 /sbin/sshd -D
root        18     0  6 20:23 ?        00:00:00 bash
root        58    18  0 20:23 ?        00:00:00 ps -ef
[root@hpcg1 /]#

Benchmarks

The Ethernet performance is slightly smaller...

[root@hpcg2 /]# iperf -c hpcg1 -d -t 120
*snip*
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-120.0 sec  10.4 GBytes   745 Mbits/sec
[  4]  0.0-120.0 sec  11.8 GBytes   847 Mbits/sec
[root@hpcg2 /]#

That's why we love RDMA and InfiniBand... :)

[root@hpcg2 /]# qperf hpcg1 -t 120 rc_bi_bw
rc_bi_bw:
    bw  =  5.07 GB/sec
[root@hpcg2 /]#

And compute-wise... This time with the user bob, since he is able to login password-less.

Furthermore I had to reduce the size to 104, due to a lack of memory. The services started need memory themselves.

[root@venus001 hpcg]# docker exec -ti hpcg1 bash
[root@hpcg1 jobs]# su - bob
[bob@hpcg1 ~]$ cd /scratch/jobs/
[bob@hpcg1 jobs]$ export MPI_HOSTS=hpcg1,hpcg2,hpcg3,hpcg4,hpcg5,hpcg6,hpcg7,hpcg8
[bob@hpcg1 jobs]$ mpirun -np 64 --host ${MPI_HOSTS} /opt/hpcg-3.0/Linux_MPI/bin/xhpcg
[bob@hpcg1 cos7]$ egrep "(GFLOP/s rat|^\s+nx)" HPCG-Benchmark-3.0_2016.01*
HPCG-Benchmark-3.0_2016.01.09.11.52.11.yaml:  nx: 104
HPCG-Benchmark-3.0_2016.01.09.11.52.11.yaml:  HPCG result is VALID with a GFLOP/s rating of: 5.39522
HPCG-Benchmark-3.0_2016.01.09.12.25.15.yaml:  nx: 104
HPCG-Benchmark-3.0_2016.01.09.12.25.15.yaml:  HPCG result is VALID with a GFLOP/s rating of: 5.38519
HPCG-Benchmark-3.0_2016.01.09.13.30.21.yaml:  nx: 104
HPCG-Benchmark-3.0_2016.01.09.13.30.21.yaml:  HPCG result is VALID with a GFLOP/s rating of: 5.4443
[bob@hpcg1 cos7]$

Compared to the bare-metal run - it's even a little bit faster. :)

Conlusion so far

With docker networking it's easy as pie to create a small cluster. I haven't mentioned it, but the hostnames are dynamically added to all /etc/hosts files, which make all containers addressable out of the box.

After I ran the benchmark within the containers I stopped docker-engine and consul and benchmarked the bare-metal again.

[root@venus001 ~]# cd /scratch/jobs/bare
[root@venus001 bare]# export MPI_HOSTS=venus001,venus002,venus003,venus004,venus005,venus006,venus007,venus008
[root@venus001 bare]# mpirun --allow-run-as-root -np 64 --host ${MPI_HOSTS} /scratch/jobs/bare/xhpcg
[root@venus001 bare]#  egrep "(GFLOP/s rat|^\s+nx)" HPCG-Benchmark-3.0_2016.01*
  nx: 104
  HPCG result is VALID with a GFLOP/s rating of: 5.31681
[root@venus001 bare]#

Still, a bit slower - most likely due to an installation I messed with. The container images are a bit cleaner; that's why I like them in the first place...

So long!

A little spoiler in regards of my monitoring setup, the IB traffic is not trustable - I use a super-simple approach. I did much better in the past. :)