HPC3: Expected Container Behaviour
The last blog post introduced the HPC Container Conformance (HPC3) project - a project to provide guidance on how to build and annotate container images for HPC use cases.
For HPW we'll need to cover two main parts (IMHO) first:
Entrypoint/Cmd
relationship: How do interactive users and batch systems expect a container to behave. We need to make sure that a container works withdocker run
,singularity run
andpodman run
out of the box (engine configuration already done)Annotation Baseline
: Which annotations do we need and want, some are mandatory and some are optional.
This blob post is going to set a baseline in Terms of Expected Container Behavior to make sure that we can switch HPC3 conformant images of the same application and - ideally - have the job ran in the same way.
Bioinformatics Paper from 2019
The bioinformatics folks created an excellent paper back in 2019 (link)
The paper hammers home a lot of the points that are valid for us as HPC community.
# | Statement | # | Statement | |
---|---|---|---|---|
1 | A package first | 7 | Add functional testing logic | |
2 | One tool, one container | 8 | Check the license of software | |
3 | Versions should be explicit | 9 | Make your package discoverable | |
4 | Avoid using ENTRYPOINT | 10 | Provide reproducible and documented builds | |
5 | Reduce suze as much as possible | 11 | Provide helpful usage message | |
6 | Keep data outside of the container |
I pretty much agree with all of the above; we should thrive to match these points - but I reckon we need to focus on the essentials and once we have those start ticking the remaining boxes off as we go along.
Let's first go over the ENTRYPOINT
/CMD
relationship.
Entrypoint/Cmd
First, we are going to define how an HPC container is expected to behave when it's instanciated. Containers might be used in different ways:
- Interactively on a low(er) powered laptop to debug a workflow.
- Interactively on a compute instance to debug initial performance
- In a batch script on a compute node, with different runtime/engines - say Sarus and Singularity
- Using a workflow manager like Nextflow
Interactively
Laptop
For all of the above we need all containers to act in a predictable, reproducible way. In our first iteration on HPC3 we propose to have all containers to drop in a if you do not do anything. With that I can do a quick check of a GROMACS container on my Laptop.
$ docker run -ti --rm -v $(pwd):/input -v /scratch -w /scratch \
quay.io/cqnib/gromacs/gcc-7.3.1/2021.5/tmpi:multi-arch
bash-4.2# ls /input/
benchMEM.tpr benchRIB.tpr
bash-4.2#
Input data is available, let's do a 500 step simulation run.
bash-4.2# gmx mdrun -s /input/benchMEM.tpr -nsteps 500
:-) GROMACS - gmx mdrun, 2021.5-spack (-:
GROMACS: gmx mdrun, version 2021.5-spack
*snip*
Command line:
gmx mdrun -s /input/benchMEM.tpr -nsteps 500
*snip*
Using 1 MPI thread
Using 8 OpenMP threads
starting mdrun 'Great Red Oystrich Makes All Chemists Sane'
500 steps, 1.0 ps.
Writing final coordinates.
Core t (s) Wall t (s) (%)
Time: 39.166 4.896 799.9
(ns/day) (hour/ns)
Performance: 17.682 1.357
bash-4.2# gmx mdrun -s /input/benchMEM.tpr -nsteps 500
:-) GROMACS - gmx mdrun, 2021.5-spack (-:
GROMACS is written by:
Andrey Alekseenko Emile Apol Rossen Apostolov
Paul Bauer Herman J.C. Berendsen Par Bjelkmar
Christian Blau Viacheslav Bolnykh Kevin Boyd
Aldert van Buuren Rudi van Drunen Anton Feenstra
Gilles Gouaillardet Alan Gray Gerrit Groenhof
Anca Hamuraru Vincent Hindriksen M. Eric Irrgang
Aleksei Iupinov Christoph Junghans Joe Jordan
Dimitrios Karkoulis Peter Kasson Jiri Kraus
Carsten Kutzner Per Larsson Justin A. Lemkul
Viveca Lindahl Magnus Lundborg Erik Marklund
Pascal Merz Pieter Meulenhoff Teemu Murtola
Szilard Pall Sander Pronk Roland Schulz
Michael Shirts Alexey Shvetsov Alfons Sijbers
Peter Tieleman Jon Vincent Teemu Virolainen
Christian Wennberg Maarten Wolf Artem Zhmurov
and the project leaders:
Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2019, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.
GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.
GROMACS: gmx mdrun, version 2021.5-spack
Executable: /opt/software/linux-amzn2-graviton2/gcc-7.3.1/gromacs-2021.5-5cqqitxyudlekqp3psur4ciswtfiyxdt/bin/gmx
Data prefix: /opt/software/linux-amzn2-graviton2/gcc-7.3.1/gromacs-2021.5-5cqqitxyudlekqp3psur4ciswtfiyxdt
Working dir: /scratch
Command line:
gmx mdrun -s /input/benchMEM.tpr -nsteps 500
Reading file /input/benchMEM.tpr, VERSION 4.6.3-dev-20130701-6e3ae9e (single precision)
Note: file tpx version 83, software tpx version 122
Overriding nsteps with value passed on the command line: 500 steps, 1 ps
Changing nstlist from 10 to 80, rlist from 1 to 1.103
Using 1 MPI thread
Using 8 OpenMP threads
starting mdrun 'Great Red Oystrich Makes All Chemists Sane'
500 steps, 1.0 ps.
Writing final coordinates.
Core t (s) Wall t (s) (%)
Time: 39.166 4.896 799.9
(ns/day) (hour/ns)
Performance: 17.682 1.357
GROMACS reminds you: "Working in the Burger Kings, Spitting on your Onion Rings" (Slim Shady)
Awesome, now we can iterate to a compute node with more compute power to check the performance of a small sample.
Compute Node
Now that we know the container works in general, let's log into an AWS g5.4xlarge instance with a GPU and run the node interactily.
$ docker run -ti --rm --gpus all \ # (1)
-v /shared/input/gromacs:/input \
-v cache:/cache -w /cache \ # (2)
quay.io/cqnib/gromacs-2021.5_gcc-7.3.1:x86_64_v3-cuda-tmpi
- nvidia-docker is going to hand over the GPUs of the node
- We are going to use a volume as cache, which puts it outside of the container file system
Now that we have a bash, we can run the same command as above and what it use the GPU.
$ docker run -ti --rm --gpus all \
-v /shared/input/gromacs:/input \
-v cache:/cache -w /cache \
quay.io/cqnib/gromacs-2021.5_gcc-7.3.1:x86_64_v3-cuda-tmpi
bash-4.2# gmx mdrun -s /input/benchMEM.tpr -nsteps 500
:-) GROMACS - gmx mdrun, 2021.5 (-:
*snip*
1 GPU selected for this run. # (1)
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Using 1 MPI thread
Using 16 OpenMP threads
*snip*
Core t (s) Wall t (s) (%)
Time: 11.531 0.721 1599.7
(ns/day) (hour/ns)
Performance: 120.109 0.200
- GROMACS recognizes one GPU - awesome!
$ docker run -ti --rm --gpus all \
-v /shared/input/gromacs:/input \
-v cache:/cache -w /cache \
quay.io/cqnib/gromacs-2021.5_gcc-7.3.1:x86_64_v3-cuda-tmpi
bash-4.2# gmx mdrun -s /input/benchMEM.tpr -nsteps 500
:-) GROMACS - gmx mdrun, 2021.5 (-:
GROMACS is written by:
Andrey Alekseenko Emile Apol Rossen Apostolov
Paul Bauer Herman J.C. Berendsen Par Bjelkmar
Christian Blau Viacheslav Bolnykh Kevin Boyd
Aldert van Buuren Rudi van Drunen Anton Feenstra
Gilles Gouaillardet Alan Gray Gerrit Groenhof
Anca Hamuraru Vincent Hindriksen M. Eric Irrgang
Aleksei Iupinov Christoph Junghans Joe Jordan
Dimitrios Karkoulis Peter Kasson Jiri Kraus
Carsten Kutzner Per Larsson Justin A. Lemkul
Viveca Lindahl Magnus Lundborg Erik Marklund
Pascal Merz Pieter Meulenhoff Teemu Murtola
Szilard Pall Sander Pronk Roland Schulz
Michael Shirts Alexey Shvetsov Alfons Sijbers
Peter Tieleman Jon Vincent Teemu Virolainen
Christian Wennberg Maarten Wolf Artem Zhmurov
and the project leaders:
Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2019, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.
GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.
GROMACS: gmx mdrun, version 2021.5
Executable: /opt/software/linux-amzn2-x86_64_v3/gcc-7.3.1/gromacs-2021.5-qxjzajo7druqz4urg5nv7jtmdc53olmh/bin/gmx
Data prefix: /opt/software/linux-amzn2-x86_64_v3/gcc-7.3.1/gromacs-2021.5-qxjzajo7druqz4urg5nv7jtmdc53olmh
Working dir: /cache
Command line:
gmx mdrun -s /input/benchMEM.tpr -nsteps 500
Reading file /input/benchMEM.tpr, VERSION 4.6.3-dev-20130701-6e3ae9e (single precision)
Note: file tpx version 83, software tpx version 122
Overriding nsteps with value passed on the command line: 500 steps, 1 ps
Changing nstlist from 10 to 100, rlist from 1 to 1.125
1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Using 1 MPI thread
Using 16 OpenMP threads
starting mdrun 'Great Red Oystrich Makes All Chemists Sane'
500 steps, 1.0 ps.
Writing final coordinates.
Core t (s) Wall t (s) (%)
Time: 11.574 0.723 1599.7
(ns/day) (hour/ns)
Performance: 119.659 0.201
GROMACS reminds you: "All that glitters may not be gold, but at least it contains free electrons." (John Desmond Baernal)
bash-4.2#
Batch Job
Nice, works as well - finally we are going to submit a SLURM job using an HPC runtime/engine.
Note
I am going to stick to nvidia-docker since I do want to keep this blog post focused on the container and not the engine
The SLRUM submit script is rather simplistic. It takes the image and execute the docker run
command from above (minus stdin
).
#!/bin/bash
#SBATCH --output=/shared/logs/%x_%j.out
mkdir -p /shared/jobs/${SLURM_JOBID} # (1)
cd /shared/jobs/${SLURM_JOBID}
docker run -t --rm --gpus all \
-v /shared/input/gromacs:/input \
-v /shared/jobs/${SLURM_JOBID}:/jobdir -w /jobdir \ # (2)
quay.io/cqnib/gromacs-2021.5_gcc-7.3.1:x86_64_v3-cuda-tmpi \
gmx mdrun -s /input/benchMEM.tpr
- We'll create a job directory on the shared storage to keep the result around.
- Switch to the
/jobdir
to store intermediate results.
When submitting the job to a partition with g5 instances, it is going to download the container if it is not already present and run the job - no suprises there.
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
c6a* up infinite 2 idle~ c6a-dy-c6a-4xl-[1-2]
g5 up infinite 1 idle~ g5-dy-g5-4xl-2
g5 up infinite 1 mix g5-dy-g5-4xl-1
Submit!
$ cat /shared/logs/gromacs-g5.sbatch_26.out
+ mkdir -p /shared/jobs/26
+ cd /shared/jobs/26
+ docker run -t --rm --gpus all -v /shared/input/gromacs:/input -v /shared/jobs/26:/jobdir -w /jobdir quay.io/cqnib/gromacs-2021.5_gcc-7.3.1:x86_64_v3-cuda-tmpi gmx mdrun -s /input/benchMEM.tpr
:-) GROMACS - gmx mdrun, 2021.5 (-:
*snip*
1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Using 1 MPI thread
Using 16 OpenMP threads
*snip*
Core t (s) Wall t (s) (%)
Time: 321.190 20.078 1599.7
(ns/day) (hour/ns)
Performance: 86.071 0.279
- GROMACS recognizes one GPU - awesome!
$ cat /shared/logs/gromacs-g5.sbatch_26.out
+ mkdir -p /shared/jobs/26
+ cd /shared/jobs/26
+ docker run -t --rm --gpus all -v /shared/input/gromacs:/input -v /shared/jobs/26:/jobdir -w /jobdir quay.io/cqnib/gromacs-2021.5_gcc-7.3.1:x86_64_v3-cuda-tmpi gmx mdrun -s /input/benchMEM.tpr
:-) GROMACS - gmx mdrun, 2021.5 (-:
GROMACS is written by:
Andrey Alekseenko Emile Apol Rossen Apostolov
Paul Bauer Herman J.C. Berendsen Par Bjelkmar
Christian Blau Viacheslav Bolnykh Kevin Boyd
Aldert van Buuren Rudi van Drunen Anton Feenstra
Gilles Gouaillardet Alan Gray Gerrit Groenhof
Anca Hamuraru Vincent Hindriksen M. Eric Irrgang
Aleksei Iupinov Christoph Junghans Joe Jordan
Dimitrios Karkoulis Peter Kasson Jiri Kraus
Carsten Kutzner Per Larsson Justin A. Lemkul
Viveca Lindahl Magnus Lundborg Erik Marklund
Pascal Merz Pieter Meulenhoff Teemu Murtola
Szilard Pall Sander Pronk Roland Schulz
Michael Shirts Alexey Shvetsov Alfons Sijbers
Peter Tieleman Jon Vincent Teemu Virolainen
Christian Wennberg Maarten Wolf Artem Zhmurov
and the project leaders:
Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2019, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.
GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.
GROMACS: gmx mdrun, version 2021.5
Executable: /opt/software/linux-amzn2-x86_64_v3/gcc-7.3.1/gromacs-2021.5-qxjzajo7druqz4urg5nv7jtmdc53olmh/bin/gmx
Data prefix: /opt/software/linux-amzn2-x86_64_v3/gcc-7.3.1/gromacs-2021.5-qxjzajo7druqz4urg5nv7jtmdc53olmh
Working dir: /jobdir
Command line:
gmx mdrun -s /input/benchMEM.tpr
Reading file /input/benchMEM.tpr, VERSION 4.6.3-dev-20130701-6e3ae9e (single precision)
Note: file tpx version 83, software tpx version 122
Changing nstlist from 10 to 100, rlist from 1 to 1.125
1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Using 1 MPI thread
Using 16 OpenMP threads
starting mdrun 'Great Red Oystrich Makes All Chemists Sane'
10000 steps, 20.0 ps.
Writing final coordinates.
Core t (s) Wall t (s) (%)
Time: 321.190 20.078 1599.7
(ns/day) (hour/ns)
Performance: 86.071 0.279
GROMACS reminds you: "You Own the Sun" (Throwing Muses)
Other scheduler
Beyond what we already did, we might want to run the container image in Kubernetes, or Nextflow for bioinformatics workflows. We need to keep that in mind when building and annotating HPC containers.
Some might ignore the ENTRYPOINT
and expect the container to not rely on the setup of an environment just before. It would be great if we can get completly rid of any tweaking of the default user environemnt.
Conclusion
I highly recommend to read the recommendations the bioinformatics community came up with. It is a good baseline for us to take into account mid term.
Bioinformatics have an advantage tho; a lot of their tooling and workflow systems are based on containers and with biocontaienrs they have a huge repository of all their tools. They need to herd the cats - we need to convince all the cats first.
Thus, we should focus on first proving a nuclei of an repository of HPC applications. Within the HPC3 project we'll start with GROMACS and PyTorch. All containers onboarded need to be swappable, so that an end-user can create a submission script once and 'just' switch to a different container while the rest of the logic stays the same.
Apendix
Spack
I used the develop
branch of Spack and ran spack containerize
using the spack.yaml
below. I needed to tweak the resulting Dockerfile
a bit. I'll annotate the Dockerfile below.
And since I bloged about Dockerfile Frontends - a lot of cool caching stuff we can do here.
I build the image on a g5.4xlarge instance. After all I want to run it on that exact instance anyway. Building on skylake gave me x86_64_v4
binaries, which segfaulted on the g5 instance (zen2
) - took me some time to remember.
spack:
view: true
specs:
- gromacs@2021.5 +cuda ~mpi
packages:
all:
target: [x86_64_v3]
concretizer:
unify: true
targets:
granularity: generic
container:
os_packages:
final:
- libgomp
extra_instructions:
final: RUN (echo /opt/view/lib && echo /opt/view/lib64) > /etc/ld.so.conf.d/spack-view.conf
&& ldconfig
images:
os: amazonlinux:2
spack:
ref: develop
FROM amazonlinux:2 as bootstrap
ENV SPACK_ROOT=/opt/spack \
CURRENTLY_BUILDING_DOCKER_IMAGE=1 \
container=docker
RUN yum update -y \
&& yum groupinstall -y "Development Tools" \
&& yum install -y \
curl \
findutils \
gcc-c++ \
gcc \
gcc-gfortran \
git \
gnupg2 \
hostname \
iproute \
make \
patch \
python3 \
python3-pip \
python3-setuptools \
unzip \
&& pip3 install boto3 \
&& rm -rf /var/cache/yum \
&& yum clean all
RUN mkdir $SPACK_ROOT && cd $SPACK_ROOT && \
git clone https://github.com/spack/spack.git . && git fetch origin develop:container_branch && git checkout container_branch && \
mkdir -p $SPACK_ROOT/opt/spack
RUN ln -s $SPACK_ROOT/share/spack/docker/entrypoint.bash \
/usr/local/bin/docker-shell \
&& ln -s $SPACK_ROOT/share/spack/docker/entrypoint.bash \
/usr/local/bin/interactive-shell \
&& ln -s $SPACK_ROOT/share/spack/docker/entrypoint.bash \
/usr/local/bin/spack-env
RUN mkdir -p /root/.spack \
&& cp $SPACK_ROOT/share/spack/docker/modules.yaml \
/root/.spack/modules.yaml \
&& rm -rf /root/*.* /run/nologin $SPACK_ROOT/.git
# [WORKAROUND]
# https://superuser.com/questions/1241548/
# xubuntu-16-04-ttyname-failed-inappropriate-ioctl-for-device#1253889
RUN [ -f ~/.profile ] \
&& sed -i 's/mesg n/( tty -s \\&\\& mesg n || true )/g' ~/.profile \
|| true
WORKDIR /root
SHELL ["docker-shell"]
# Creates the package cache
RUN spack bootstrap now && spack spec hdf5+mpi
ENTRYPOINT ["/bin/bash", "/opt/spack/share/spack/docker/entrypoint.bash"]
CMD ["interactive-shell"]
# Build stage with Spack pre-installed and ready to be used
FROM bootstrap as builder
# What we want to install and how we want to install it
# is specified in a manifest file (spack.yaml)
RUN mkdir /opt/spack-environment \
&& (echo "spack:" \
&& echo " view: /opt/view" \
&& echo " specs:" \
&& echo " - gromacs@2021.5 +cuda ~mpi" \
&& echo " packages:" \
&& echo " all:" \
&& echo " target:" \
&& echo " - x86_64_v3" \
&& echo " concretizer:" \
&& echo " unify: true" \
&& echo " targets:" \
&& echo " granularity: generic" \
&& echo " config:" \
&& echo " install_tree: /opt/software") > /opt/spack-environment/spack.yaml
# Install the software, remove unnecessary deps
RUN cd /opt/spack-environment && spack env activate . && spack install --fail-fast && spack gc -y
# Strip all the binaries
RUN find -L /opt/view/* -type f -exec readlink -f '{}' \; | \
xargs file -i | \
grep 'charset=binary' | \
grep 'x-executable\|x-archive\|x-sharedlib' | \
grep -v "cuda-" |\ # (1)
awk -F: '{print $1}' | xargs strip -s
# Modifications to the environment that are necessary to run
RUN cd /opt/spack-environment && \
spack env activate --sh -d . >> /etc/profile.d/z10_spack_environment.sh
# Bare OS image to run the installed executables
FROM amazonlinux:2
COPY --from=builder /opt/spack-environment /opt/spack-environment
COPY --from=builder /opt/software /opt/software
COPY --from=builder /opt/._view /opt/._view
COPY --from=builder /opt/view /opt/view
COPY --from=builder /etc/profile.d/z10_spack_environment.sh /etc/profile.d/z10_spack_environment.sh
RUN yum update -y && amazon-linux-extras install epel -y \
&& yum install -y libgomp \
&& rm -rf /var/cache/yum && yum clean all
RUN (echo /opt/view/lib && echo /opt/view/lib64) > /etc/ld.so.conf.d/spack-view.conf && ldconfig
ENTRYPOINT ["/bin/bash", "--rcfile", "/etc/profile", "-l", "-c", "$*", "--" ]
CMD [ "/bin/bash" ]
- The cuda packages are not going to be stripped somehow. I sneaked in this little
grep -v
to exclude those.