After a quick Proof of Concept I pushed it to my little HPC setup with physical nodes and InfiniBand.

Next I aimed to run HPCG within containers instead of a bare-metal run.

The Plan

The big plan was to run a slurm-cluster on bare metal and instantiate the mpi-tasks within containers.

All nodes report as slurm clients (in an attempted to update venus003 and venus003 they didn't come up again).

[root@venus001 ~]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
venus*       up   infinite      6   idle venus[001,004-008]
[root@venus001 ~]#

Images

The images are rather minimalistic.

[root@venus001 docker]# cat docker-fedora/Dockerfile
###### Updated version of fedora (22)
FROM fedora:22
MAINTAINER "Christian Kniep <christian@qnib.org>"

# Solution for 'ping: icmp open socket: Operation not permitted'
RUN ln -sf /usr/share/zoneinfo/Europe/Paris /etc/localtime

ADD etc/yum.conf /etc/yum.conf
RUN dnf install -y python-dnf-plugins-extras-migrate && dnf-2 migrate && \
    echo "2015-03-24"; dnf clean all && \
    dnf update -y -x systemd -x systemd-libs -x iputils && \
    dnf install -y wget vim curl
[root@venus001 docker]# cat docker-openmpi/Dockerfile
### QNIBTerminal Image
FROM qnib/fedora

ENV PATH=/usr/lib64/openmpi/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
RUN dnf install -y openmpi bc libmlx4
ADD docker-latest /usr/local/bin/docker
[root@venus001 docker]#

No slurm, no ssh, I just added docker to be able to call the docker cli. I could even get rid of this.

HPCG

The input data for HPCG is the following:

[root@venus001 ~]# cat /scratch/hpcg.dat
HPCG benchmark input file
Sandia National Laboratories; University of Tennessee, Knoxville
104 104 104
300

A wrapper shifts through the list of nodes:

#!/bin/bash
HOSTLIST="venus004,venus005,venus006,venus007,venus008"

function shift_hostlist() {
    HEAD=$(echo ${1} |awk -F, '{print $1}')
    TAIL=$(echo ${1} |awk -F, '{$1=""; print $0}' |sed -e 's/^[[:space:]]*//' |tr ' ' ',')
    echo "${TAIL},${HEAD}"
}

EXE=${1}
RUNS=${2-3}
DOCKER=${3}

if [ ! -z ${DOCKER} ];then
    MCA_OPTS="-mca plm_rsh_agent /scratch/bin/dssh"
fi


CNT=0
while [ true ]; do
    if [ ${CNT} -eq ${RUNS} ];then
       break
    fi
    SUBHOSTS=$(echo ${HOSTLIST} |cut -d,  -f-4 |tr ' ' ',')
    echo "# --host ${SUBHOSTS} ${MCA_OPTS}"
    time mpirun --allow-run-as-root ${MCA_OPTS} --host ${SUBHOSTS} --np 32 ${EXE}
    HOSTLIST=$(shift_hostlist ${HOSTLIST})
    CNT=$(echo ${CNT} + 1|bc)
done

The goal is not to stage a complete and fair comparison. Since I do not care about the OpenMP settings, I am pretty sure that the results are borderline accurate. As seen below in the htop pics the number of forked processes differs - that does not look similar. So don't pin me down on this... :)

Bare-metal run

As a baseline I compiled HPCG on CentOS7.2 (the bare metal installation) and ran it.

[root@venus001 scratch]# ./bin/shuf_xhpcg.sh /scratch/src/hpcg-3.0/Linux_MPI/bin/xhpcg
# --host venus004,venus005,venus006,venus007
real    13m20.755s
# --host venus005,venus006,venus007,venus008
real    13m39.780s
# --host venus006,venus007,venus008,venus004
real    13m41.473s
[root@venus001 scratch]#
[root@venus001 scratch]# grep "a GFLOP" HPCG-Benchmark-3.0_2016.04*
HPCG-Benchmark-3.0_2016.04.03.13.33.06.yaml:  HPCG result is VALID with a GFLOP/s rating of: 2.76119
HPCG-Benchmark-3.0_2016.04.03.13.46.42.yaml:  HPCG result is VALID with a GFLOP/s rating of: 2.70451
HPCG-Benchmark-3.0_2016.04.03.14.00.40.yaml:  HPCG result is VALID with a GFLOP/s rating of: 2.68913
[root@venus001 scratch]#

Running with dssh

First I ran it on CentOS 7.2.

[root@venus001 ~]# clush -w venus00[4-8] 'docker -H localhost:2376 ps |grep -v CONTAINER'|sort
venus004: 0d77a2f202f1        192.168.12.11:5000/qnib/openmpi:cos7   "tail -f /dev/null"   3 hours ago         Up 3 hours                              venus004-ompi
venus005: 0f84e9660f8e        192.168.12.11:5000/qnib/openmpi:cos7   "tail -f /dev/null"   3 hours ago         Up 3 hours                              venus005-ompi
venus006: 434e269e4177        192.168.12.11:5000/qnib/openmpi:cos7   "tail -f /dev/null"   3 hours ago         Up 3 hours                              venus006-ompi
venus007: 1e435510e91d        192.168.12.11:5000/qnib/openmpi:cos7   "tail -f /dev/null"   3 hours ago         Up 3 hours                              venus007-ompi
venus008: 345a1b9859d1        192.168.12.11:5000/qnib/openmpi:cos7   "tail -f /dev/null"   3 hours ago         Up 3 hours                              venus008-ompi
[root@venus001 ~]#
[root@venus001 scratch]# /scratch/bin/shuf_xhpcg.sh /scratch/src/hpcg-3.0/COS7_MPI/bin/xhpcg 5 docker
Sun Apr  3 16:09:44 CEST 2016 # --host venus004,venus005,venus006,venus007 -mca plm_rsh_agent /scratch/bin/dssh
+ docker -H venus007:2376 exec -i venus007-ompi orted --hnp-topo-sig 0N:2S:0L3:4L2:8L1:8C:8H:x86_64 -mca ess '"env"' -mca orte_ess_jobid '"1496711168"' -mca orte_ess_vpid 4 -mca orte_ess_num_procs '"5"' -mca orte_hnp_uri '"1496711168.0;tcp://192.168.12.181,10.0.0.181,172.18.0.1,172.17.0.1,172.19.0.1:52992"' --tree-spawn -mca plm_rsh_agent '"/scratch/bin/dssh"' -mca plm '"rsh"' --tree-spawn
*snip*
real    13m33.880s
*snip*
Sun Apr  3 16:23:23 CEST 2016 # --host venus005,venus006,venus007,venus008 -mca plm_rsh_agent /scratch/bin/dssh

Which yields similar results to the bare-metal run, since it's the same user land.

[root@venus001 scratch]# grep "a GFLOP" HPCG-Benchmark-3.0_2016.04*
HPCG-Benchmark-3.0_2016.04.03.16.18.37.yaml:  HPCG result is VALID with a GFLOP/s rating of: 2.72324
HPCG-Benchmark-3.0_2016.04.03.16.32.13.yaml:  HPCG result is VALID with a GFLOP/s rating of: 2.70897
HPCG-Benchmark-3.0_2016.04.03.16.46.11.yaml:  HPCG result is VALID with a GFLOP/s rating of: 2.70728
HPCG-Benchmark-3.0_2016.04.03.17.00.07.yaml:  HPCG result is VALID with a GFLOP/s rating of: 2.69402
HPCG-Benchmark-3.0_2016.04.03.17.13.42.yaml:  HPCG result is VALID with a GFLOP/s rating of: 2.67581
[root@venus001 scratch]#

Second run with Fedora 22

Restart the container using the Fedora 22 tag.

[root@venus001 docker-openmpi]# clush -w venus00[4-8] /scratch/bin/restart_openmpi fd22
venus008: venus008-ompi
venus007: venus007-ompi
venus004: venus004-ompi
venus006: venus006-ompi
venus005: venus005-ompi
venus004: Status: Downloaded newer image for 192.168.12.11:5000/qnib/openmpi:fd22
venus007: Status: Downloaded newer image for 192.168.12.11:5000/qnib/openmpi:fd22
venus008: Status: Downloaded newer image for 192.168.12.11:5000/qnib/openmpi:fd22
venus005: Status: Downloaded newer image for 192.168.12.11:5000/qnib/openmpi:fd22
venus007: f1bddbbd0cadac7a730afdfd0b2f154837518939857e4e3daab227e887278274
venus004: 32596bdc0983b205032bc9abb10ca756617bf0b5a821207bd0b3d56f62870650
venus008: 30b79640f8b9713445a2014254407931d7f0b25d3220346dfdaf8bcb4b0e8518
venus006: Status: Downloaded newer image for 192.168.12.11:5000/qnib/openmpi:fd22
venus005: 01cb8cee27341242db37998dc954cf3988368d98c8989c629c9c5892eeaf7828
venus006: d087242878903f2175fb7a49bf8afc144da82ffe2d11976fa1ec0eb51ecb245f
[root@venus001 docker-openmpi]#

And off we go...

[root@venus001 scratch]# /scratch/bin/shuf_xhpcg.sh /scratch/src/hpcg-3.0/FD22_MPI/bin/xhpcg 5 docker
Sun Apr  3 17:43:33 CEST 2016 # --host venus004,venus005,venus006,venus007 -mca plm_rsh_agent /scratch/bin/dssh
+ docker -H venus005:2376 exec -i venus005-ompi orted --hnp-topo-sig 0N:2S:0L3:4L2:8L1:8C:8H:x86_64 -mca ess '"env"' -mca orte_ess_jobid '"1418657792"' -mca orte_ess_vpid 2 -mca orte_ess_num_procs '"5"' -mca orte_hnp_uri '"1418657792.0;tcp://192.168.12.181,10.0.0.181,172.18.0.1,172.17.0.1,172.19.0.1:35529"' --tree-spawn -mca plm_rsh_agent '"/scratch/bin/dssh"' -mca plm '"rsh"' --tree-spawn
*snip*
real    13m13.720s
Sun Apr  3 17:56:51 CEST 2016 # --host venus005,venus006,venus007,venus008 -mca plm_rsh_agent /scratch/bin/dssh

It's a bit faster (~100 MFLOP/s, ~3%) then the CentOS7 run.

[root@venus001 docker-login]# grep "a GFLOP" /scratch/HPCG-Benchmark-3.0_2016.04*
/scratch/HPCG-Benchmark-3.0_2016.04.03.17.52.05.yaml:  HPCG result is VALID with a GFLOP/s rating of: 2.78578
/scratch/HPCG-Benchmark-3.0_2016.04.03.18.05.20.yaml:  HPCG result is VALID with a GFLOP/s rating of: 2.79691
/scratch/HPCG-Benchmark-3.0_2016.04.03.18.18.58.yaml:  HPCG result is VALID with a GFLOP/s rating of: 2.77254
/scratch/HPCG-Benchmark-3.0_2016.04.03.18.32.30.yaml:  HPCG result is VALID with a GFLOP/s rating of: 2.78048
/scratch/HPCG-Benchmark-3.0_2016.04.03.18.45.39.yaml:  HPCG result is VALID with a GFLOP/s rating of: 2.79036
[root@venus001 docker-login]#

Ubuntu 15.10

For the fun of it I created a Ubuntu 15.10 image...

[root@venus001 docker-openmpi]# clush -w venus00[4-8] /scratch/bin/restart_openmpi u15.10
venus007: venus007-ompi
venus008: venus008-ompi
venus006: venus006-ompi
venus005: venus005-ompi
venus004: venus004-ompi
venus005: Status: Downloaded newer image for 192.168.12.11:5000/qnib/openmpi:u15.10
venus004: Status: Downloaded newer image for 192.168.12.11:5000/qnib/openmpi:u15.10
venus008: Status: Downloaded newer image for 192.168.12.11:5000/qnib/openmpi:u15.10
venus006: Status: Downloaded newer image for 192.168.12.11:5000/qnib/openmpi:u15.10
venus007: Status: Downloaded newer image for 192.168.12.11:5000/qnib/openmpi:u15.10
venus005: 65f8a28997be9902f69a0d35d871aa3bb486434ceb77242476827f1ba087a73f
venus006: b983a9b043d2935ca64ae61169f3387d0b69951125abb9240c69a93599c06817
venus004: e611123b77298776d221ca816ddb4509728a4ae2327b04463e52ef5918abdb29
venus008: adf7ebec267ee477d19052b4c534e233f28ebe9195c0065b6aff979481cb6f42
venus007: 0142e7a65a2d84242205e0b57f84c0673171052e983b6973b84e43c0a723b40d
[root@venus001 docker-openmpi]#

With a rather poor result... :)

[root@venus001 docker-openmpi]# grep "a GFLOP" /scratch/HPCG-Benchmark-3.0_2016.04*
/scratch/HPCG-Benchmark-3.0_2016.04.03.17.39.50.yaml:  HPCG result is VALID with a GFLOP/s rating of: 2.35045
/scratch/HPCG-Benchmark-3.0_2016.04.03.17.55.36.yaml:  HPCG result is VALID with a GFLOP/s rating of: 2.32825
/scratch/HPCG-Benchmark-3.0_2016.04.03.18.11.43.yaml:  HPCG result is VALID with a GFLOP/s rating of: 2.33404

Ubuntu 14.04

How about 14.04...?

[root@venus001 scratch]# grep "a GFLOP" /scratch/HPCG-Benchmark-3.0_2016.04*
/scratch/HPCG-Benchmark-3.0_2016.04.03.18.49.18.yaml:  HPCG result is VALID with a GFLOP/s rating of: 2.34876
/scratch/HPCG-Benchmark-3.0_2016.04.03.19.05.04.yaml:  HPCG result is VALID with a GFLOP/s rating of: 2.32608
/scratch/HPCG-Benchmark-3.0_2016.04.03.19.21.12.yaml:  HPCG result is VALID with a GFLOP/s rating of: 2.3275

OK, enough... :) I made the point about different user-lands and how they can be tailored to gain performance in contrast to the bare-metal installation in December 2014: Containerized MPI Workloads

Here are the results as a chart (min: 2 GFLOP/s):

Slurm vs mpirun

I have to admit, I am afraid I was wrong before. If I am not mistaken, slurmd takes care of the remote execution of the MPI ranks, not sshd.

[root@venus001 scratch]# cat /scratch/bin/run_hpcg.sh
#!/bin/bash
#SBATCH --workdir /scratch/
#SBATCH --ntasks-per-node 8

mpirun --allow-run-as-root /scratch/bin/xhpcg
[root@venus001 scratch]# ln -s /scratch/src/hpcg-3.0/COS7_MPI/bin/xhpcg /scratch/bin/
[root@venus001 scratch]# sbatch -w venus004,venus005,venus006,venus007 /scratch/bin/run_hpcg.sh
Submitted batch job 752
[root@venus001 scratch]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
venus*       up   infinite      4  alloc venus[004-007]
venus*       up   infinite      2   idle venus[001,008]
[root@venus001 scratch]#

Whereas within by using dssh and - bottom line - docker exec, orted is started as a child of the first process within the container.

My evil plan was to use the SLURM cluster on bare metal to tell dssh which nodes are available, spawn the container on each node, run the job within the container and remove the container afterwards.

But if I try -mca plm_rsh_agent /scratch/bin/dssh within the SLURM batch script it does not have an effect, as slurmstepd execute the remote orted. :(

I guess I have to compile slurm without Open MPI support to get an effect. Hmm...

Btw. The result of the SLURM run on bare-metal...

[root@venus001 docker-login]# grep "a GFLOP" /scratch/HPCG-Benchmark-3.0_2016.04*
/scratch/HPCG-Benchmark-3.0_2016.04.03.19.01.32.yaml:  HPCG result is VALID with a GFLOP/s rating of: 2.73566
[root@venus001 docker-login]#

Appendix

The hpcg binaries were build for each distribution:

root@venus001:/# cd /scratch/src/hpcg-3.0/
root@venus001:/scratch/src/hpcg-3.0# mkdir U15.10_MPI
root@venus001:/scratch/src/hpcg-3.0# cd U15.10_MPI/
root@venus001:/scratch/src/hpcg-3.0/U15.10_MPI# ../configure Linux_MPI
root@venus001:/scratch/src/hpcg-3.0/U15.10_MPI# make -j2
mpicxx -c -DHPCG_NO_OPENMP -I./src -I./src/Linux_MPI  -O3 -ffast-math -ftree-vectorize -ftree-vectorizer-verbose=0 -I../src ../src/main.cpp -o src/main.o
*snip*
mpicxx -c -DHPCG_NO_OPENMP -I./src -I./src/Linux_MPI  -O3 -ffast-math -ftree-vectorize -ftree-vectorizer-verbose=0 -I../src ../src/init.cpp -o src/init.o
mpicxx -c -DHPCG_NO_OPENMP -I./src -I./src/Linux_MPI  -O3 -ffast-math -ftree-vectorize -ftree-vectorizer-verbose=0 -I../src ../src/finalize.cpp -o src/finalize.o
mpicxx -DHPCG_NO_OPENMP -I./src -I./src/Linux_MPI  -O3 -ffast-math -ftree-vectorize -ftree-vectorizer-verbose=0 src/main.o src/CG.o src/CG_ref.o src/TestCG.o src/ComputeResidual.o src/ExchangeHalo.o src/GenerateGeometry.o src/GenerateProblem.o src/GenerateProblem_ref.o src/CheckProblem.o src/MixedBaseCounter.o src/OptimizeProblem.o src/ReadHpcgDat.o src/ReportResults.o src/SetupHalo.o src/SetupHalo_ref.o src/TestSymmetry.o src/TestNorms.o src/WriteProblem.o src/YAML_Doc.o src/YAML_Element.o src/ComputeDotProduct.o src/ComputeDotProduct_ref.o src/mytimer.o src/ComputeOptimalShapeXYZ.o src/ComputeSPMV.o src/ComputeSPMV_ref.o src/ComputeSYMGS.o src/ComputeSYMGS_ref.o src/ComputeWAXPBY.o src/ComputeWAXPBY_ref.o src/ComputeMG_ref.o src/ComputeMG.o src/ComputeProlongation_ref.o src/ComputeRestriction_ref.o src/CheckAspectRatio.o src/GenerateCoarseProblem.o src/init.o src/finalize.o  -o bin/xhpcg