Skip to content

February 2, 2023 • qnib • 15 min read

HPC3: Expected Container Behaviour

100% human-written — no AI tools are used to write these posts.

hpc3 container

The last blog post introduced the HPC Container Conformance (HPC3) project - a project to provide guidance on how to build and annotate container images for HPC use cases.

For HPW we’ll need to cover two main parts (IMHO) first:

  1. Entrypoint/Cmd relationship: How do interactive users and batch systems expect a container to behave. We need to make sure that a container works with docker run, singularity run and podman run out of the box (engine configuration already done)
  2. Annotation Baseline: Which annotations do we need and want, some are mandatory and some are optional.

This blob post is going to set a baseline in Terms of Expected Container Behavior to make sure that we can switch HPC3 conformant images of the same application and - ideally - have the job ran in the same way.

Bioinformatics Paper from 2019

The bioinformatics folks created an excellent paper back in 2019 (link)

bio paper

The paper hammers home a lot of the points that are valid for us as HPC community.

#Statement#Statement
1A package first7Add functional testing logic
2One tool, one container8Check the license of software
3Versions should be explicit9Make your package discoverable
4Avoid using ENTRYPOINT10Provide reproducible and documented builds
5Reduce suze as much as possible11Provide helpful usage message
6Keep data outside of the container

I pretty much agree with all of the above; we should thrive to match these points - but I reckon we need to focus on the essentials and once we have those start ticking the remaining boxes off as we go along.
Let’s first go over the ENTRYPOINT/CMD relationship.

Entrypoint/Cmd

First, we are going to define how an HPC container is expected to behave when it’s instanciated. Containers might be used in different ways:

  1. Interactively on a low(er) powered laptop to debug a workflow.
  2. Interactively on a compute instance to debug initial performance
  3. In a batch script on a compute node, with different runtime/engines - say Sarus and Singularity
  4. Using a workflow manager like Nextflow

Interactively

Laptop

For all of the above we need all containers to act in a predictable, reproducible way. In our first iteration on HPC3 we propose to have all containers to drop in a if you do not do anything. With that I can do a quick check of a GROMACS container on my Laptop.

$ docker run -ti --rm -v $(pwd):/input -v /scratch -w /scratch \
         quay.io/cqnib/gromacs/gcc-7.3.1/2021.5/tmpi:multi-arch
bash-4.2# ls /input/
benchMEM.tpr  benchRIB.tpr
bash-4.2#

Input data is available, let’s do a 500 step simulation run.

Snippet

  ```bash
  bash-4.2# gmx mdrun -s /input/benchMEM.tpr -nsteps 500
                    :-) GROMACS - gmx mdrun, 2021.5-spack (-:
  GROMACS:      gmx mdrun, version 2021.5-spack
  *snip*
  Command line:
  gmx mdrun -s /input/benchMEM.tpr -nsteps 500
  *snip*
  Using 1 MPI thread
  Using 8 OpenMP threads

  starting mdrun 'Great Red Oystrich Makes All Chemists Sane'
  500 steps,      1.0 ps.

  Writing final coordinates.
                 Core t (s)   Wall t (s)        (%)
        Time:       39.166        4.896      799.9
                 (ns/day)    (hour/ns)
  Performance:       17.682        1.357
  ```

Full Output

  ```bash
  bash-4.2# gmx mdrun -s /input/benchMEM.tpr -nsteps 500
                    :-) GROMACS - gmx mdrun, 2021.5-spack (-:

                             GROMACS is written by:
     Andrey Alekseenko              Emile Apol              Rossen Apostolov
           Paul Bauer           Herman J.C. Berendsen           Par Bjelkmar
        Christian Blau           Viacheslav Bolnykh             Kevin Boyd
     Aldert van Buuren           Rudi van Drunen             Anton Feenstra
     Gilles Gouaillardet             Alan Gray               Gerrit Groenhof
        Anca Hamuraru            Vincent Hindriksen          M. Eric Irrgang
        Aleksei Iupinov           Christoph Junghans             Joe Jordan
     Dimitrios Karkoulis            Peter Kasson                Jiri Kraus
        Carsten Kutzner              Per Larsson              Justin A. Lemkul
        Viveca Lindahl            Magnus Lundborg             Erik Marklund
        Pascal Merz             Pieter Meulenhoff            Teemu Murtola
        Szilard Pall               Sander Pronk              Roland Schulz
        Michael Shirts            Alexey Shvetsov             Alfons Sijbers
        Peter Tieleman              Jon Vincent              Teemu Virolainen
     Christian Wennberg            Maarten Wolf              Artem Zhmurov
                             and the project leaders:
        Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

  Copyright (c) 1991-2000, University of Groningen, The Netherlands.
  Copyright (c) 2001-2019, The GROMACS development team at
  Uppsala University, Stockholm University and
  the Royal Institute of Technology, Sweden.
  check out http://www.gromacs.org for more information.

  GROMACS is free software; you can redistribute it and/or modify it
  under the terms of the GNU Lesser General Public License
  as published by the Free Software Foundation; either version 2.1
  of the License, or (at your option) any later version.

  GROMACS:      gmx mdrun, version 2021.5-spack
  Executable:   /opt/software/linux-amzn2-graviton2/gcc-7.3.1/gromacs-2021.5-5cqqitxyudlekqp3psur4ciswtfiyxdt/bin/gmx
  Data prefix:  /opt/software/linux-amzn2-graviton2/gcc-7.3.1/gromacs-2021.5-5cqqitxyudlekqp3psur4ciswtfiyxdt
  Working dir:  /scratch
  Command line:
  gmx mdrun -s /input/benchMEM.tpr -nsteps 500

  Reading file /input/benchMEM.tpr, VERSION 4.6.3-dev-20130701-6e3ae9e (single precision)
  Note: file tpx version 83, software tpx version 122
  Overriding nsteps with value passed on the command line: 500 steps, 1 ps
  Changing nstlist from 10 to 80, rlist from 1 to 1.103


  Using 1 MPI thread
  Using 8 OpenMP threads

  starting mdrun 'Great Red Oystrich Makes All Chemists Sane'
  500 steps,      1.0 ps.

  Writing final coordinates.

                 Core t (s)   Wall t (s)        (%)
        Time:       39.166        4.896      799.9
                 (ns/day)    (hour/ns)
  Performance:       17.682        1.357

  GROMACS reminds you: "Working in the Burger Kings, Spitting on your Onion Rings" (Slim Shady)
  ```

Awesome, now we can iterate to a compute node with more compute power to check the performance of a small sample.

Compute Node

Now that we know the container works in general, let’s log into an AWS g5.4xlarge instance with a GPU and run the node interactily.

$ docker run -ti --rm --gpus all \ # (1)
         -v /shared/input/gromacs:/input \
         -v cache:/cache -w /cache \ # (2)
         quay.io/cqnib/gromacs-2021.5_gcc-7.3.1:x86_64_v3-cuda-tmpi
  1. nvidia-docker is going to hand over the GPUs of the node
  2. We are going to use a volume as cache, which puts it outside of the container file system

Now that we have a bash, we can run the same command as above and what it use the GPU.

Snippet

  ```bash
  $ docker run -ti --rm --gpus all \
     -v /shared/input/gromacs:/input \
     -v cache:/cache -w /cache \
     quay.io/cqnib/gromacs-2021.5_gcc-7.3.1:x86_64_v3-cuda-tmpi
  bash-4.2# gmx mdrun -s /input/benchMEM.tpr -nsteps 500
                       :-) GROMACS - gmx mdrun, 2021.5 (-:
  *snip*
  1 GPU selected for this run. # (1)
  Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
  PP tasks will do (non-perturbed) short-ranged interactions on the GPU
  PP task will update and constrain coordinates on the CPU
  PME tasks will do all aspects on the GPU
  Using 1 MPI thread
  Using 16 OpenMP threads
  *snip*
                  Core t (s)   Wall t (s)        (%)
        Time:       11.531        0.721     1599.7
                 (ns/day)    (hour/ns)
  Performance:      120.109        0.200
  ```

  1. GROMACS recognizes one GPU - awesome!

Full Output

  ```bash
  $ docker run -ti --rm --gpus all \
     -v /shared/input/gromacs:/input \
     -v cache:/cache -w /cache \
     quay.io/cqnib/gromacs-2021.5_gcc-7.3.1:x86_64_v3-cuda-tmpi
  bash-4.2# gmx mdrun -s /input/benchMEM.tpr -nsteps 500
                       :-) GROMACS - gmx mdrun, 2021.5 (-:

                             GROMACS is written by:
     Andrey Alekseenko              Emile Apol              Rossen Apostolov
           Paul Bauer           Herman J.C. Berendsen           Par Bjelkmar
        Christian Blau           Viacheslav Bolnykh             Kevin Boyd
     Aldert van Buuren           Rudi van Drunen             Anton Feenstra
     Gilles Gouaillardet             Alan Gray               Gerrit Groenhof
        Anca Hamuraru            Vincent Hindriksen          M. Eric Irrgang
        Aleksei Iupinov           Christoph Junghans             Joe Jordan
     Dimitrios Karkoulis            Peter Kasson                Jiri Kraus
        Carsten Kutzner              Per Larsson              Justin A. Lemkul
        Viveca Lindahl            Magnus Lundborg             Erik Marklund
        Pascal Merz             Pieter Meulenhoff            Teemu Murtola
        Szilard Pall               Sander Pronk              Roland Schulz
        Michael Shirts            Alexey Shvetsov             Alfons Sijbers
        Peter Tieleman              Jon Vincent              Teemu Virolainen
     Christian Wennberg            Maarten Wolf              Artem Zhmurov
                             and the project leaders:
        Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

  Copyright (c) 1991-2000, University of Groningen, The Netherlands.
  Copyright (c) 2001-2019, The GROMACS development team at
  Uppsala University, Stockholm University and
  the Royal Institute of Technology, Sweden.
  check out http://www.gromacs.org for more information.

  GROMACS is free software; you can redistribute it and/or modify it
  under the terms of the GNU Lesser General Public License
  as published by the Free Software Foundation; either version 2.1
  of the License, or (at your option) any later version.

  GROMACS:      gmx mdrun, version 2021.5
  Executable:   /opt/software/linux-amzn2-x86_64_v3/gcc-7.3.1/gromacs-2021.5-qxjzajo7druqz4urg5nv7jtmdc53olmh/bin/gmx
  Data prefix:  /opt/software/linux-amzn2-x86_64_v3/gcc-7.3.1/gromacs-2021.5-qxjzajo7druqz4urg5nv7jtmdc53olmh
  Working dir:  /cache
  Command line:
  gmx mdrun -s /input/benchMEM.tpr -nsteps 500

  Reading file /input/benchMEM.tpr, VERSION 4.6.3-dev-20130701-6e3ae9e (single precision)
  Note: file tpx version 83, software tpx version 122
  Overriding nsteps with value passed on the command line: 500 steps, 1 ps
  Changing nstlist from 10 to 100, rlist from 1 to 1.125


  1 GPU selected for this run.
  Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
  PP tasks will do (non-perturbed) short-ranged interactions on the GPU
  PP task will update and constrain coordinates on the CPU
  PME tasks will do all aspects on the GPU
  Using 1 MPI thread
  Using 16 OpenMP threads

  starting mdrun 'Great Red Oystrich Makes All Chemists Sane'
  500 steps,      1.0 ps.

  Writing final coordinates.

                 Core t (s)   Wall t (s)        (%)
        Time:       11.574        0.723     1599.7
                 (ns/day)    (hour/ns)
  Performance:      119.659        0.201

  GROMACS reminds you: "All that glitters may not be gold, but at least it contains free electrons." (John Desmond Baernal)
  bash-4.2#
  ```

Batch Job

Nice, works as well - finally we are going to submit a SLURM job using an HPC runtime/engine.

Note

I am going to stick to nvidia-docker since I do want to keep this blog post focused on the container and not the engine

The SLRUM submit script is rather simplistic. It takes the image and execute the docker run command from above (minus stdin).

#!/bin/bash
#SBATCH --output=/shared/logs/%x_%j.out

mkdir -p /shared/jobs/${SLURM_JOBID} # (1)
cd /shared/jobs/${SLURM_JOBID}


docker run -t --rm --gpus all \
    -v /shared/input/gromacs:/input \
    -v /shared/jobs/${SLURM_JOBID}:/jobdir -w /jobdir \ # (2)
    quay.io/cqnib/gromacs-2021.5_gcc-7.3.1:x86_64_v3-cuda-tmpi \
    gmx mdrun -s /input/benchMEM.tpr
  1. We’ll create a job directory on the shared storage to keep the result around.
  2. Switch to the /jobdir to store intermediate results.

When submitting the job to a partition with g5 instances, it is going to download the container if it is not already present and run the job - no suprises there.

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
c6a*         up   infinite      2  idle~ c6a-dy-c6a-4xl-[1-2]
g5           up   infinite      1  idle~ g5-dy-g5-4xl-2
g5           up   infinite      1    mix g5-dy-g5-4xl-1

Submit!

$ sbatch -N1 -p g5 gromacs-g5.sbatch
Submitted batch job 22

Snippet

  ```bash
  $ cat /shared/logs/gromacs-g5.sbatch_26.out
  + mkdir -p /shared/jobs/26
  + cd /shared/jobs/26
  + docker run -t --rm --gpus all -v /shared/input/gromacs:/input -v /shared/jobs/26:/jobdir -w /jobdir quay.io/cqnib/gromacs-2021.5_gcc-7.3.1:x86_64_v3-cuda-tmpi gmx mdrun -s /input/benchMEM.tpr
                  :-) GROMACS - gmx mdrun, 2021.5 (-:
  *snip*
  1 GPU selected for this run.
  Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
  PP tasks will do (non-perturbed) short-ranged interactions on the GPU
  PP task will update and constrain coordinates on the CPU
  PME tasks will do all aspects on the GPU
  Using 1 MPI thread
  Using 16 OpenMP threads
  *snip*
                 Core t (s)   Wall t (s)        (%)
        Time:      321.190       20.078     1599.7
                 (ns/day)    (hour/ns)
  Performance:       86.071        0.279
  ```

  1. GROMACS recognizes one GPU - awesome!

Full Output

  ```bash
  $ cat /shared/logs/gromacs-g5.sbatch_26.out
  + mkdir -p /shared/jobs/26
  + cd /shared/jobs/26
  + docker run -t --rm --gpus all -v /shared/input/gromacs:/input -v /shared/jobs/26:/jobdir -w /jobdir quay.io/cqnib/gromacs-2021.5_gcc-7.3.1:x86_64_v3-cuda-tmpi gmx mdrun -s /input/benchMEM.tpr
                  :-) GROMACS - gmx mdrun, 2021.5 (-:

                             GROMACS is written by:
     Andrey Alekseenko              Emile Apol              Rossen Apostolov
           Paul Bauer           Herman J.C. Berendsen           Par Bjelkmar
        Christian Blau           Viacheslav Bolnykh             Kevin Boyd
     Aldert van Buuren           Rudi van Drunen             Anton Feenstra
     Gilles Gouaillardet             Alan Gray               Gerrit Groenhof
        Anca Hamuraru            Vincent Hindriksen          M. Eric Irrgang
        Aleksei Iupinov           Christoph Junghans             Joe Jordan
     Dimitrios Karkoulis            Peter Kasson                Jiri Kraus
        Carsten Kutzner              Per Larsson              Justin A. Lemkul
        Viveca Lindahl            Magnus Lundborg             Erik Marklund
        Pascal Merz             Pieter Meulenhoff            Teemu Murtola
        Szilard Pall               Sander Pronk              Roland Schulz
        Michael Shirts            Alexey Shvetsov             Alfons Sijbers
        Peter Tieleman              Jon Vincent              Teemu Virolainen
     Christian Wennberg            Maarten Wolf              Artem Zhmurov
                             and the project leaders:
        Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

  Copyright (c) 1991-2000, University of Groningen, The Netherlands.
  Copyright (c) 2001-2019, The GROMACS development team at
  Uppsala University, Stockholm University and
  the Royal Institute of Technology, Sweden.
  check out http://www.gromacs.org for more information.

  GROMACS is free software; you can redistribute it and/or modify it
  under the terms of the GNU Lesser General Public License
  as published by the Free Software Foundation; either version 2.1
  of the License, or (at your option) any later version.

  GROMACS:      gmx mdrun, version 2021.5
  Executable:   /opt/software/linux-amzn2-x86_64_v3/gcc-7.3.1/gromacs-2021.5-qxjzajo7druqz4urg5nv7jtmdc53olmh/bin/gmx
  Data prefix:  /opt/software/linux-amzn2-x86_64_v3/gcc-7.3.1/gromacs-2021.5-qxjzajo7druqz4urg5nv7jtmdc53olmh
  Working dir:  /jobdir
  Command line:
  gmx mdrun -s /input/benchMEM.tpr

  Reading file /input/benchMEM.tpr, VERSION 4.6.3-dev-20130701-6e3ae9e (single precision)
  Note: file tpx version 83, software tpx version 122
  Changing nstlist from 10 to 100, rlist from 1 to 1.125

  1 GPU selected for this run.
  Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
  PP tasks will do (non-perturbed) short-ranged interactions on the GPU
  PP task will update and constrain coordinates on the CPU
  PME tasks will do all aspects on the GPU
  Using 1 MPI thread
  Using 16 OpenMP threads

  starting mdrun 'Great Red Oystrich Makes All Chemists Sane'
  10000 steps,     20.0 ps.

  Writing final coordinates.

                 Core t (s)   Wall t (s)        (%)
        Time:      321.190       20.078     1599.7
                 (ns/day)    (hour/ns)
  Performance:       86.071        0.279

  GROMACS reminds you: "You Own the Sun" (Throwing Muses)
  ```

Other scheduler

Beyond what we already did, we might want to run the container image in Kubernetes, or Nextflow for bioinformatics workflows. We need to keep that in mind when building and annotating HPC containers. Some might ignore the ENTRYPOINT and expect the container to not rely on the setup of an environment just before. It would be great if we can get completly rid of any tweaking of the default user environemnt.

Conclusion

I highly recommend to read the recommendations the bioinformatics community came up with. It is a good baseline for us to take into account mid term.
Bioinformatics have an advantage tho; a lot of their tooling and workflow systems are based on containers and with biocontaienrs they have a huge repository of all their tools. They need to herd the cats - we need to convince all the cats first. 😃

Thus, we should focus on first proving a nuclei of an repository of HPC applications. Within the HPC3 project we’ll start with GROMACS and PyTorch. All containers onboarded need to be swappable, so that an end-user can create a submission script once and ‘just’ switch to a different container while the rest of the logic stays the same.

Apendix

Spack

I used the develop branch of Spack and ran spack containerize using the spack.yaml below. I needed to tweak the resulting Dockerfile a bit. I’ll annotate the Dockerfile below. And since I bloged about Dockerfile Frontends - a lot of cool caching stuff we can do here.

I build the image on a g5.4xlarge instance. After all I want to run it on that exact instance anyway. Building on skylake gave me x86_64_v4 binaries, which segfaulted on the g5 instance (zen2) - took me some time to remember. 😃

spack.yaml

  ```bash
  spack:
     view: true
     specs:
     - gromacs@2021.5 +cuda ~mpi
     packages:
        all:
           target: [x86_64_v3]
     concretizer:
        unify: true
        targets:
           granularity: generic
     container:
        os_packages:
           final:
           - libgomp
        extra_instructions:
           final: RUN (echo /opt/view/lib && echo /opt/view/lib64) > /etc/ld.so.conf.d/spack-view.conf
           && ldconfig
        images:
           os: amazonlinux:2
           spack:
           ref: develop
  ```

Dockerfile

  ```bash
  FROM amazonlinux:2 as bootstrap

  ENV SPACK_ROOT=/opt/spack \
     CURRENTLY_BUILDING_DOCKER_IMAGE=1 \
     container=docker

  RUN yum update -y \
  && yum groupinstall -y "Development Tools" \
  && yum install -y \
        curl \
        findutils \
        gcc-c++ \
        gcc \
        gcc-gfortran \
        git \
        gnupg2 \
        hostname \
        iproute \
        make \
        patch \
        python3 \
        python3-pip \
        python3-setuptools \
        unzip \
  && pip3 install boto3 \
  && rm -rf /var/cache/yum \
  && yum clean all

  RUN mkdir $SPACK_ROOT && cd $SPACK_ROOT && \
     git clone https://github.com/spack/spack.git . && git fetch origin develop:container_branch && git checkout container_branch  && \
     mkdir -p $SPACK_ROOT/opt/spack

  RUN ln -s $SPACK_ROOT/share/spack/docker/entrypoint.bash \
           /usr/local/bin/docker-shell \
  && ln -s $SPACK_ROOT/share/spack/docker/entrypoint.bash \
           /usr/local/bin/interactive-shell \
  && ln -s $SPACK_ROOT/share/spack/docker/entrypoint.bash \
           /usr/local/bin/spack-env

  RUN mkdir -p /root/.spack \
  && cp $SPACK_ROOT/share/spack/docker/modules.yaml \
        /root/.spack/modules.yaml \
  && rm -rf /root/*.* /run/nologin $SPACK_ROOT/.git

  # [WORKAROUND]
  # https://superuser.com/questions/1241548/
  #     xubuntu-16-04-ttyname-failed-inappropriate-ioctl-for-device#1253889
  RUN [ -f ~/.profile ]                                               \
  && sed -i 's/mesg n/( tty -s \\&\\& mesg n || true )/g' ~/.profile \
  || true


  WORKDIR /root
  SHELL ["docker-shell"]

  # Creates the package cache
  RUN spack bootstrap now && spack spec hdf5+mpi

  ENTRYPOINT ["/bin/bash", "/opt/spack/share/spack/docker/entrypoint.bash"]
  CMD ["interactive-shell"]

  # Build stage with Spack pre-installed and ready to be used
  FROM bootstrap as builder


  # What we want to install and how we want to install it
  # is specified in a manifest file (spack.yaml)
  RUN mkdir /opt/spack-environment \
  &&  (echo "spack:" \
  &&   echo "  view: /opt/view" \
  &&   echo "  specs:" \
  &&   echo "  - gromacs@2021.5 +cuda ~mpi" \
  &&   echo "  packages:" \
  &&   echo "    all:" \
  &&   echo "      target:" \
  &&   echo "      - x86_64_v3" \
  &&   echo "  concretizer:" \
  &&   echo "    unify: true" \
  &&   echo "    targets:" \
  &&   echo "      granularity: generic" \
  &&   echo "  config:" \
  &&   echo "    install_tree: /opt/software") > /opt/spack-environment/spack.yaml

  # Install the software, remove unnecessary deps
  RUN cd /opt/spack-environment && spack env activate . && spack install --fail-fast && spack gc -y

  # Strip all the binaries
  RUN find -L /opt/view/* -type f -exec readlink -f '{}' \; | \
     xargs file -i | \
     grep 'charset=binary' | \
     grep 'x-executable\|x-archive\|x-sharedlib' | \
     grep -v "cuda-" |\ # (1)
     awk -F: '{print $1}' | xargs strip -s

  # Modifications to the environment that are necessary to run
  RUN cd /opt/spack-environment && \
     spack env activate --sh -d . >> /etc/profile.d/z10_spack_environment.sh

  # Bare OS image to run the installed executables
  FROM amazonlinux:2

  COPY --from=builder /opt/spack-environment /opt/spack-environment
  COPY --from=builder /opt/software /opt/software
  COPY --from=builder /opt/._view /opt/._view
  COPY --from=builder /opt/view /opt/view
  COPY --from=builder /etc/profile.d/z10_spack_environment.sh /etc/profile.d/z10_spack_environment.sh

  RUN yum update -y && amazon-linux-extras install epel -y \
  && yum install -y libgomp \
  && rm -rf /var/cache/yum  && yum clean all

  RUN (echo /opt/view/lib && echo /opt/view/lib64) > /etc/ld.so.conf.d/spack-view.conf && ldconfig
  ENTRYPOINT ["/bin/bash", "--rcfile", "/etc/profile", "-l", "-c", "$*", "--" ]
  CMD [ "/bin/bash" ]
  ```

  1. The cuda packages are not going to be stripped somehow. I sneaked in this little `grep -v` to exclude those.