Skip to content

HPC3: Expected Container Behaviour

The last blog post introduced the HPC Container Conformance (HPC3) project - a project to provide guidance on how to build and annotate container images for HPC use cases.

For HPW we'll need to cover two main parts (IMHO) first:

  1. Entrypoint/Cmd relationship: How do interactive users and batch systems expect a container to behave. We need to make sure that a container works with docker run, singularity run and podman run out of the box (engine configuration already done)
  2. Annotation Baseline: Which annotations do we need and want, some are mandatory and some are optional.

This blob post is going to set a baseline in Terms of Expected Container Behavior to make sure that we can switch HPC3 conformant images of the same application and - ideally - have the job ran in the same way.

Bioinformatics Paper from 2019

The bioinformatics folks created an excellent paper back in 2019 (link)

bio paper

The paper hammers home a lot of the points that are valid for us as HPC community.

# Statement # Statement
1 A package first 7 Add functional testing logic
2 One tool, one container 8 Check the license of software
3 Versions should be explicit 9 Make your package discoverable
4 Avoid using ENTRYPOINT 10 Provide reproducible and documented builds
5 Reduce suze as much as possible 11 Provide helpful usage message
6 Keep data outside of the container

I pretty much agree with all of the above; we should thrive to match these points - but I reckon we need to focus on the essentials and once we have those start ticking the remaining boxes off as we go along.
Let's first go over the ENTRYPOINT/CMD relationship.

Entrypoint/Cmd

First, we are going to define how an HPC container is expected to behave when it's instanciated. Containers might be used in different ways:

  1. Interactively on a low(er) powered laptop to debug a workflow.
  2. Interactively on a compute instance to debug initial performance
  3. In a batch script on a compute node, with different runtime/engines - say Sarus and Singularity
  4. Using a workflow manager like Nextflow

Interactively

Laptop

For all of the above we need all containers to act in a predictable, reproducible way. In our first iteration on HPC3 we propose to have all containers to drop in a if you do not do anything. With that I can do a quick check of a GROMACS container on my Laptop.

$ docker run -ti --rm -v $(pwd):/input -v /scratch -w /scratch \
         quay.io/cqnib/gromacs/gcc-7.3.1/2021.5/tmpi:multi-arch
bash-4.2# ls /input/
benchMEM.tpr  benchRIB.tpr
bash-4.2# 

Input data is available, let's do a 500 step simulation run.

bash-4.2# gmx mdrun -s /input/benchMEM.tpr -nsteps 500
                  :-) GROMACS - gmx mdrun, 2021.5-spack (-:
GROMACS:      gmx mdrun, version 2021.5-spack
*snip*
Command line:
gmx mdrun -s /input/benchMEM.tpr -nsteps 500
*snip*
Using 1 MPI thread
Using 8 OpenMP threads

starting mdrun 'Great Red Oystrich Makes All Chemists Sane'
500 steps,      1.0 ps.

Writing final coordinates.
               Core t (s)   Wall t (s)        (%)
      Time:       39.166        4.896      799.9
               (ns/day)    (hour/ns)
Performance:       17.682        1.357
bash-4.2# gmx mdrun -s /input/benchMEM.tpr -nsteps 500
                  :-) GROMACS - gmx mdrun, 2021.5-spack (-:

                           GROMACS is written by:
   Andrey Alekseenko              Emile Apol              Rossen Apostolov
         Paul Bauer           Herman J.C. Berendsen           Par Bjelkmar
      Christian Blau           Viacheslav Bolnykh             Kevin Boyd
   Aldert van Buuren           Rudi van Drunen             Anton Feenstra
   Gilles Gouaillardet             Alan Gray               Gerrit Groenhof
      Anca Hamuraru            Vincent Hindriksen          M. Eric Irrgang
      Aleksei Iupinov           Christoph Junghans             Joe Jordan
   Dimitrios Karkoulis            Peter Kasson                Jiri Kraus
      Carsten Kutzner              Per Larsson              Justin A. Lemkul
      Viveca Lindahl            Magnus Lundborg             Erik Marklund
      Pascal Merz             Pieter Meulenhoff            Teemu Murtola
      Szilard Pall               Sander Pronk              Roland Schulz
      Michael Shirts            Alexey Shvetsov             Alfons Sijbers
      Peter Tieleman              Jon Vincent              Teemu Virolainen
   Christian Wennberg            Maarten Wolf              Artem Zhmurov
                           and the project leaders:
      Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2019, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.

GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS:      gmx mdrun, version 2021.5-spack
Executable:   /opt/software/linux-amzn2-graviton2/gcc-7.3.1/gromacs-2021.5-5cqqitxyudlekqp3psur4ciswtfiyxdt/bin/gmx
Data prefix:  /opt/software/linux-amzn2-graviton2/gcc-7.3.1/gromacs-2021.5-5cqqitxyudlekqp3psur4ciswtfiyxdt
Working dir:  /scratch
Command line:
gmx mdrun -s /input/benchMEM.tpr -nsteps 500

Reading file /input/benchMEM.tpr, VERSION 4.6.3-dev-20130701-6e3ae9e (single precision)
Note: file tpx version 83, software tpx version 122
Overriding nsteps with value passed on the command line: 500 steps, 1 ps
Changing nstlist from 10 to 80, rlist from 1 to 1.103


Using 1 MPI thread
Using 8 OpenMP threads

starting mdrun 'Great Red Oystrich Makes All Chemists Sane'
500 steps,      1.0 ps.

Writing final coordinates.

               Core t (s)   Wall t (s)        (%)
      Time:       39.166        4.896      799.9
               (ns/day)    (hour/ns)
Performance:       17.682        1.357

GROMACS reminds you: "Working in the Burger Kings, Spitting on your Onion Rings" (Slim Shady)

Awesome, now we can iterate to a compute node with more compute power to check the performance of a small sample.

Compute Node

Now that we know the container works in general, let's log into an AWS g5.4xlarge instance with a GPU and run the node interactily.

$ docker run -ti --rm --gpus all \ # (1)
         -v /shared/input/gromacs:/input \
         -v cache:/cache -w /cache \ # (2)
         quay.io/cqnib/gromacs-2021.5_gcc-7.3.1:x86_64_v3-cuda-tmpi
  1. nvidia-docker is going to hand over the GPUs of the node
  2. We are going to use a volume as cache, which puts it outside of the container file system

Now that we have a bash, we can run the same command as above and what it use the GPU.

$ docker run -ti --rm --gpus all \
   -v /shared/input/gromacs:/input \
   -v cache:/cache -w /cache \
   quay.io/cqnib/gromacs-2021.5_gcc-7.3.1:x86_64_v3-cuda-tmpi
bash-4.2# gmx mdrun -s /input/benchMEM.tpr -nsteps 500
                     :-) GROMACS - gmx mdrun, 2021.5 (-:
*snip*
1 GPU selected for this run. # (1)
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Using 1 MPI thread
Using 16 OpenMP threads 
*snip*
                Core t (s)   Wall t (s)        (%)
      Time:       11.531        0.721     1599.7
               (ns/day)    (hour/ns)
Performance:      120.109        0.200
  1. GROMACS recognizes one GPU - awesome!
$ docker run -ti --rm --gpus all \
   -v /shared/input/gromacs:/input \
   -v cache:/cache -w /cache \
   quay.io/cqnib/gromacs-2021.5_gcc-7.3.1:x86_64_v3-cuda-tmpi
bash-4.2# gmx mdrun -s /input/benchMEM.tpr -nsteps 500
                     :-) GROMACS - gmx mdrun, 2021.5 (-:

                           GROMACS is written by:
   Andrey Alekseenko              Emile Apol              Rossen Apostolov     
         Paul Bauer           Herman J.C. Berendsen           Par Bjelkmar       
      Christian Blau           Viacheslav Bolnykh             Kevin Boyd        
   Aldert van Buuren           Rudi van Drunen             Anton Feenstra      
   Gilles Gouaillardet             Alan Gray               Gerrit Groenhof      
      Anca Hamuraru            Vincent Hindriksen          M. Eric Irrgang      
      Aleksei Iupinov           Christoph Junghans             Joe Jordan        
   Dimitrios Karkoulis            Peter Kasson                Jiri Kraus        
      Carsten Kutzner              Per Larsson              Justin A. Lemkul     
      Viveca Lindahl            Magnus Lundborg             Erik Marklund       
      Pascal Merz             Pieter Meulenhoff            Teemu Murtola       
      Szilard Pall               Sander Pronk              Roland Schulz       
      Michael Shirts            Alexey Shvetsov             Alfons Sijbers      
      Peter Tieleman              Jon Vincent              Teemu Virolainen     
   Christian Wennberg            Maarten Wolf              Artem Zhmurov       
                           and the project leaders:
      Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2019, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.

GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS:      gmx mdrun, version 2021.5
Executable:   /opt/software/linux-amzn2-x86_64_v3/gcc-7.3.1/gromacs-2021.5-qxjzajo7druqz4urg5nv7jtmdc53olmh/bin/gmx
Data prefix:  /opt/software/linux-amzn2-x86_64_v3/gcc-7.3.1/gromacs-2021.5-qxjzajo7druqz4urg5nv7jtmdc53olmh
Working dir:  /cache
Command line:
gmx mdrun -s /input/benchMEM.tpr -nsteps 500

Reading file /input/benchMEM.tpr, VERSION 4.6.3-dev-20130701-6e3ae9e (single precision)
Note: file tpx version 83, software tpx version 122
Overriding nsteps with value passed on the command line: 500 steps, 1 ps
Changing nstlist from 10 to 100, rlist from 1 to 1.125


1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Using 1 MPI thread
Using 16 OpenMP threads 

starting mdrun 'Great Red Oystrich Makes All Chemists Sane'
500 steps,      1.0 ps.

Writing final coordinates.

               Core t (s)   Wall t (s)        (%)
      Time:       11.574        0.723     1599.7
               (ns/day)    (hour/ns)
Performance:      119.659        0.201

GROMACS reminds you: "All that glitters may not be gold, but at least it contains free electrons." (John Desmond Baernal)
bash-4.2#

Batch Job

Nice, works as well - finally we are going to submit a SLURM job using an HPC runtime/engine.

Note

I am going to stick to nvidia-docker since I do want to keep this blog post focused on the container and not the engine

The SLRUM submit script is rather simplistic. It takes the image and execute the docker run command from above (minus stdin).

gromacs-g5.sbatch
#!/bin/bash
#SBATCH --output=/shared/logs/%x_%j.out

mkdir -p /shared/jobs/${SLURM_JOBID} # (1)
cd /shared/jobs/${SLURM_JOBID}


docker run -t --rm --gpus all \
    -v /shared/input/gromacs:/input \
    -v /shared/jobs/${SLURM_JOBID}:/jobdir -w /jobdir \ # (2)
    quay.io/cqnib/gromacs-2021.5_gcc-7.3.1:x86_64_v3-cuda-tmpi \
    gmx mdrun -s /input/benchMEM.tpr
  1. We'll create a job directory on the shared storage to keep the result around.
  2. Switch to the /jobdir to store intermediate results.

When submitting the job to a partition with g5 instances, it is going to download the container if it is not already present and run the job - no suprises there.

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
c6a*         up   infinite      2  idle~ c6a-dy-c6a-4xl-[1-2]
g5           up   infinite      1  idle~ g5-dy-g5-4xl-2
g5           up   infinite      1    mix g5-dy-g5-4xl-1

Submit!

$ sbatch -N1 -p g5 gromacs-g5.sbatch 
Submitted batch job 22
$ cat /shared/logs/gromacs-g5.sbatch_26.out 
+ mkdir -p /shared/jobs/26
+ cd /shared/jobs/26
+ docker run -t --rm --gpus all -v /shared/input/gromacs:/input -v /shared/jobs/26:/jobdir -w /jobdir quay.io/cqnib/gromacs-2021.5_gcc-7.3.1:x86_64_v3-cuda-tmpi gmx mdrun -s /input/benchMEM.tpr
                :-) GROMACS - gmx mdrun, 2021.5 (-:
*snip*
1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Using 1 MPI thread
Using 16 OpenMP threads 
*snip*
               Core t (s)   Wall t (s)        (%)
      Time:      321.190       20.078     1599.7
               (ns/day)    (hour/ns)
Performance:       86.071        0.279
  1. GROMACS recognizes one GPU - awesome!
$ cat /shared/logs/gromacs-g5.sbatch_26.out 
+ mkdir -p /shared/jobs/26
+ cd /shared/jobs/26
+ docker run -t --rm --gpus all -v /shared/input/gromacs:/input -v /shared/jobs/26:/jobdir -w /jobdir quay.io/cqnib/gromacs-2021.5_gcc-7.3.1:x86_64_v3-cuda-tmpi gmx mdrun -s /input/benchMEM.tpr
                :-) GROMACS - gmx mdrun, 2021.5 (-:

                           GROMACS is written by:
   Andrey Alekseenko              Emile Apol              Rossen Apostolov     
         Paul Bauer           Herman J.C. Berendsen           Par Bjelkmar       
      Christian Blau           Viacheslav Bolnykh             Kevin Boyd        
   Aldert van Buuren           Rudi van Drunen             Anton Feenstra      
   Gilles Gouaillardet             Alan Gray               Gerrit Groenhof      
      Anca Hamuraru            Vincent Hindriksen          M. Eric Irrgang      
      Aleksei Iupinov           Christoph Junghans             Joe Jordan        
   Dimitrios Karkoulis            Peter Kasson                Jiri Kraus        
      Carsten Kutzner              Per Larsson              Justin A. Lemkul     
      Viveca Lindahl            Magnus Lundborg             Erik Marklund       
      Pascal Merz             Pieter Meulenhoff            Teemu Murtola       
      Szilard Pall               Sander Pronk              Roland Schulz       
      Michael Shirts            Alexey Shvetsov             Alfons Sijbers      
      Peter Tieleman              Jon Vincent              Teemu Virolainen     
   Christian Wennberg            Maarten Wolf              Artem Zhmurov       
                           and the project leaders:
      Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2019, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.

GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS:      gmx mdrun, version 2021.5
Executable:   /opt/software/linux-amzn2-x86_64_v3/gcc-7.3.1/gromacs-2021.5-qxjzajo7druqz4urg5nv7jtmdc53olmh/bin/gmx
Data prefix:  /opt/software/linux-amzn2-x86_64_v3/gcc-7.3.1/gromacs-2021.5-qxjzajo7druqz4urg5nv7jtmdc53olmh
Working dir:  /jobdir
Command line:
gmx mdrun -s /input/benchMEM.tpr

Reading file /input/benchMEM.tpr, VERSION 4.6.3-dev-20130701-6e3ae9e (single precision)
Note: file tpx version 83, software tpx version 122
Changing nstlist from 10 to 100, rlist from 1 to 1.125

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Using 1 MPI thread
Using 16 OpenMP threads 

starting mdrun 'Great Red Oystrich Makes All Chemists Sane'
10000 steps,     20.0 ps.

Writing final coordinates.

               Core t (s)   Wall t (s)        (%)
      Time:      321.190       20.078     1599.7
               (ns/day)    (hour/ns)
Performance:       86.071        0.279

GROMACS reminds you: "You Own the Sun" (Throwing Muses)

Other scheduler

Beyond what we already did, we might want to run the container image in Kubernetes, or Nextflow for bioinformatics workflows. We need to keep that in mind when building and annotating HPC containers. Some might ignore the ENTRYPOINT and expect the container to not rely on the setup of an environment just before. It would be great if we can get completly rid of any tweaking of the default user environemnt.

Conclusion

I highly recommend to read the recommendations the bioinformatics community came up with. It is a good baseline for us to take into account mid term.
Bioinformatics have an advantage tho; a lot of their tooling and workflow systems are based on containers and with biocontaienrs they have a huge repository of all their tools. They need to herd the cats - we need to convince all the cats first. 😃

Thus, we should focus on first proving a nuclei of an repository of HPC applications. Within the HPC3 project we'll start with GROMACS and PyTorch. All containers onboarded need to be swappable, so that an end-user can create a submission script once and 'just' switch to a different container while the rest of the logic stays the same.

Apendix

Spack

I used the develop branch of Spack and ran spack containerize using the spack.yaml below. I needed to tweak the resulting Dockerfile a bit. I'll annotate the Dockerfile below. And since I bloged about Dockerfile Frontends - a lot of cool caching stuff we can do here.

I build the image on a g5.4xlarge instance. After all I want to run it on that exact instance anyway. Building on skylake gave me x86_64_v4 binaries, which segfaulted on the g5 instance (zen2) - took me some time to remember. 😃

spack:
   view: true
   specs:
   - gromacs@2021.5 +cuda ~mpi
   packages:
      all:
         target: [x86_64_v3]
   concretizer:
      unify: true
      targets:
         granularity: generic
   container:
      os_packages:
         final:
         - libgomp
      extra_instructions:
         final: RUN (echo /opt/view/lib && echo /opt/view/lib64) > /etc/ld.so.conf.d/spack-view.conf
         && ldconfig
      images:
         os: amazonlinux:2
         spack:
         ref: develop
FROM amazonlinux:2 as bootstrap

ENV SPACK_ROOT=/opt/spack \
   CURRENTLY_BUILDING_DOCKER_IMAGE=1 \
   container=docker

RUN yum update -y \
&& yum groupinstall -y "Development Tools" \
&& yum install -y \
      curl \
      findutils \
      gcc-c++ \
      gcc \
      gcc-gfortran \
      git \
      gnupg2 \
      hostname \
      iproute \
      make \
      patch \
      python3 \
      python3-pip \
      python3-setuptools \
      unzip \
&& pip3 install boto3 \
&& rm -rf /var/cache/yum \
&& yum clean all

RUN mkdir $SPACK_ROOT && cd $SPACK_ROOT && \
   git clone https://github.com/spack/spack.git . && git fetch origin develop:container_branch && git checkout container_branch  && \
   mkdir -p $SPACK_ROOT/opt/spack

RUN ln -s $SPACK_ROOT/share/spack/docker/entrypoint.bash \
         /usr/local/bin/docker-shell \
&& ln -s $SPACK_ROOT/share/spack/docker/entrypoint.bash \
         /usr/local/bin/interactive-shell \
&& ln -s $SPACK_ROOT/share/spack/docker/entrypoint.bash \
         /usr/local/bin/spack-env

RUN mkdir -p /root/.spack \
&& cp $SPACK_ROOT/share/spack/docker/modules.yaml \
      /root/.spack/modules.yaml \
&& rm -rf /root/*.* /run/nologin $SPACK_ROOT/.git

# [WORKAROUND]
# https://superuser.com/questions/1241548/
#     xubuntu-16-04-ttyname-failed-inappropriate-ioctl-for-device#1253889
RUN [ -f ~/.profile ]                                               \
&& sed -i 's/mesg n/( tty -s \\&\\& mesg n || true )/g' ~/.profile \
|| true


WORKDIR /root
SHELL ["docker-shell"]

# Creates the package cache
RUN spack bootstrap now && spack spec hdf5+mpi

ENTRYPOINT ["/bin/bash", "/opt/spack/share/spack/docker/entrypoint.bash"]
CMD ["interactive-shell"]

# Build stage with Spack pre-installed and ready to be used
FROM bootstrap as builder


# What we want to install and how we want to install it
# is specified in a manifest file (spack.yaml)
RUN mkdir /opt/spack-environment \
&&  (echo "spack:" \
&&   echo "  view: /opt/view" \
&&   echo "  specs:" \
&&   echo "  - gromacs@2021.5 +cuda ~mpi" \
&&   echo "  packages:" \
&&   echo "    all:" \
&&   echo "      target:" \
&&   echo "      - x86_64_v3" \
&&   echo "  concretizer:" \
&&   echo "    unify: true" \
&&   echo "    targets:" \
&&   echo "      granularity: generic" \
&&   echo "  config:" \
&&   echo "    install_tree: /opt/software") > /opt/spack-environment/spack.yaml

# Install the software, remove unnecessary deps
RUN cd /opt/spack-environment && spack env activate . && spack install --fail-fast && spack gc -y

# Strip all the binaries
RUN find -L /opt/view/* -type f -exec readlink -f '{}' \; | \
   xargs file -i | \
   grep 'charset=binary' | \
   grep 'x-executable\|x-archive\|x-sharedlib' | \
   grep -v "cuda-" |\ # (1)
   awk -F: '{print $1}' | xargs strip -s 

# Modifications to the environment that are necessary to run
RUN cd /opt/spack-environment && \
   spack env activate --sh -d . >> /etc/profile.d/z10_spack_environment.sh

# Bare OS image to run the installed executables
FROM amazonlinux:2

COPY --from=builder /opt/spack-environment /opt/spack-environment
COPY --from=builder /opt/software /opt/software
COPY --from=builder /opt/._view /opt/._view
COPY --from=builder /opt/view /opt/view
COPY --from=builder /etc/profile.d/z10_spack_environment.sh /etc/profile.d/z10_spack_environment.sh

RUN yum update -y && amazon-linux-extras install epel -y \
&& yum install -y libgomp \
&& rm -rf /var/cache/yum  && yum clean all

RUN (echo /opt/view/lib && echo /opt/view/lib64) > /etc/ld.so.conf.d/spack-view.conf && ldconfig
ENTRYPOINT ["/bin/bash", "--rcfile", "/etc/profile", "-l", "-c", "$*", "--" ]
CMD [ "/bin/bash" ]
  1. The cuda packages are not going to be stripped somehow. I sneaked in this little grep -v to exclude those.

Comments