Match Node-Specific Needs Using Manifest Lists

In the previous post I explained how hardware optimized images are used to get the best performance / functionality out of a node.

ReCap

Running binaries only compiled for generic x86-64, does not give you all the nice CPU flags:

$ docker run --rm -ti --device=/dev/nvidia{0,ctl,-uwm} qnib/cv-tf-dev:1.12.0
Using TensorFlow backend.
Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA

Compiled for a Broadwell CPU does, but the image used here includes the CUDA toolkit for CUDA9.0, while the host provides the CUDA driver for CUDA 9.2:

$ docker run --rm -ti --device=/dev/nvidia{0,ctl,-uwm} qnib/cv-nccl90-tf-dev:broadwell_1.12.0
Using TensorFlow backend.
libcuda reported version is: 390.30.0
kernel reported version is: 396.44.0
kernel version 396.44.0 does not match DSO version 390.30.0 -- cannot find working devices in this configuration
[]

An image build with CUDA 9.2 gets us one step closer. Unfortunately TensorFlow is compiled with default flags which requires the latest GPUs (>NVIDIA P100).

$ docker run --rm -ti --device=/dev/nvidia{0,ctl,-uwm} qnib/cv-nccl92-tf-dev:broadwell_1.12.0
Using TensorFlow backend.
Ignoring visible gpu device (device: 0, name: Tesla M60, pci bus id: 0000:00:1e.0, compute capability: 5.2) with Cuda compute capability 5.2.
The minimum required Cuda capability is 7.0.
[]

What is needed is an image compiled with CUDA_COMPUTE_CAPABILITIES=5.2.

$ docker run --rm -ti --device=/dev/nvidia{0,ctl,-uwm} qnib/cv-nccl92-tf-dev:broadwell_nvcap52_1.12.0
Using TensorFlow backend.
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6723 MB memory)
-> physical GPU (device: 0, name: Tesla M60, pci bus id: 0000:00:1e.0, compute capability: 5.2)
['/job:localhost/replica:0/task:0/device:GPU:0']

Naming Sucks

That was possible for a long time now and it did not fly because incorporating the target into image names and tags just plain sucks. It does not really help when submitting a job that get scheduled on a Broadwell with M60 OR a Skylake with two V100s.

For this to work one needs to know in advance where it is going to be scheduled and thus the scheduling needs to be constraint, so that the workload only gets scheduled on a node that matches the image.

Platform FTW

That is not the first time that problem was solved tho. Official base images designed to run on multiple platforms will work it out through ManifestLists. A ManifestList is just an index of images identified by a platform object. The tool manifest-tool (and docker manifest btw) allows to specify this using a simple yaml file:

image: myprivreg:5000/someimage:latest
manifests:
  -
    image: myprivreg:5000/someimage:ppc64le
    platform:
      architecture: ppc64le
      os: linux
  -
    image: myprivreg:5000/someimage:amd64
    platform:
      architecture: amd64
      os: linux

In order to use it docker pull has an experimental feature --platform. Downloading the PowerPC version of ubuntu? Just do:

$ docker pull --platform=linux/ppc64le ubuntu
Using default tag: latest
latest: Pulling from library/ubuntu
2a9179d9b269: Pull complete
8fe609a92e3f: Pull complete
b726957e1026: Pull complete
42ba7c91fb87: Pull complete
Digest: sha256:7a47ccc3bbe8a451b500d2b53104868b46d60ee8f5b35a24b41a86077c650210
Status: Downloaded newer image for ubuntu:latest

Granted, that does not make much sense in the context of CPU architectures, as you won't be able to run this image on AMD64.

$ docker run -ti --rm ubuntu echo Huhu
standard_init_linux.go:207: exec user process caused "exec format error"
$ docker pull --platform=linux/amd64 ubuntu
Using default tag: latest
latest: Pulling from library/ubuntu
Digest: sha256:7a47ccc3bbe8a451b500d2b53104868b46d60ee8f5b35a24b41a86077c650210
Status: Downloaded newer image for ubuntu:latest
$ docker run -ti --rm ubuntu echo Huhu
Huhu

But still... :)

Platform Applied

In the context of what is discussed here, I incorperate the different aspects of the images into one 'meta' image (a.k.a ManifestList). I compacted the yaml a bit to not use to much space.

image: qnib/cv-tf:1.12.0-rev9
manifests:
    image: qnib/cv-tf-dev:1.12.0-rev11
    platform:
      architecture: amd64
      os: linux
    image: qnib/cv-tf-dev:skylake_1.12.0-rev6
    platform:
      features:
        - skylake
    image: qnib/cv-nccl90-tf-dev:1.12.0-rev1
    platform:
      features:
        - nvidia-390-30
    image: qnib/cv-nccl92-tf-dev:1.12.0-rev11
    platform:
      features:
        - nvidia-396-44
    image: qnib/cv-nccl90-tf-dev:broadwell_1.12.0-rev2
    platform:
      features:
        - broadwell
        - nvidia-390-30
    image: qnib/cv-nccl92-tf-dev:broadwell_1.12.0-rev8
    platform:
      features:
        - broadwell
        - nvidia-396-44
    image: qnib/cv-nccl92-tf-dev:skylake_1.12.0-rev6
    platform:
      features:
        - nvidia-396-44
        - skylake
    image: qnib/cv-nccl92-tf-dev:skylake512_1.12.0-rev8
    platform:
      features:
        - nvidia-396-44
        - skylake512
    image: qnib/cv-nccl92-tf-dev:nvcap52_1.12.0-rev3
    platform:
      features:
        - nv-compute-5-2
        - nvidia-396-44
    image: qnib/cv-nccl92-tf-dev:nvcap37_1.12.0-rev4
    platform:
      features:
        - nv-compute-3-7
        - nvidia-396-44
    image: qnib/cv-nccl92-tf-dev:broadwell_nvcap52_1.12.0-rev2
    platform:
      features:
        - broadwell
        - nv-compute-5-2
        - nvidia-396-44

This ManifestList can now be used to download the correct image via --platform (with a little change to the engine):

$ docker pull --platform=linux/amd64:broadwell:nv-compute-5-2:nvidia-396-44 qnib/cv-tf:1.12.0-rev9
1.12.0-rev9: Pulling from qnib/cv-tf
Digest: sha256:bb3ffb86b26892c03667544a7ec296ea0f8bc76842adb4d702bf32baacdc0221
Status: Downloaded newer image for qnib/cv-tf:1.12.0-rev9

This results in the same image as before.

$ docker image inspect -f '{{.Id}}' qnib/cv-nccl92-tf-dev:broadwell_nvcap52_1.12.0-rev2
sha256:21894c739c326d6f3942dfcf36cb9afb73f951f9aafab6b049d273323a0429e8
$ docker image inspect -f '{{.Id}}' qnib/cv-tf:1.12.0-rev9
sha256:21894c739c326d6f3942dfcf36cb9afb73f951f9aafab6b049d273323a0429e8

Engine Configuration

In order to be practical, this needs to be configured on an engine level, so that my Tensorflow job specifies the generic name qnib/cv-tf:1.12.0-rev9 and the engine will download the correct image for the system it runs on.

One possible idea is to put it in the daemon.json, like this:

$ sudo cat /etc/docker/daemon.json
{
  "debug": true,
  "tls": true,
  "tlscacert": "/etc/docker/ca.pem",
  "tlscert": "/etc/docker/cert.pem",
  "tlskey": "/etc/docker/key.pem",
  "tlsverify": true,
  "experimental": true,
  "platform-features": [
    "broadwell",
    "nv-compute-5-2",
    "nvidia-396-44"
  ]
}

Doing so, the engine will add the platform-features automatically, quite like the manual download using:
--platform=linux/amd64:broadwell:nv-compute-5-2:nvidia-396-44.

No need to specify different image names to make sure the correct image is scheduled. The following K8s job will fetch the correct image depending on the engine it is scheduled on.

apiVersion: batch/v1
kind: Job
metadata:
  name: TensorFlow
spec:
  backoffLimit: 1
  template:
    spec:
      containers:
      - name: tensorflow
        image: qnib/cv-tf:1.12.0-rev9
        resources:
          limits:
            qnib.org/gpu: 1

Ongoing Discussions

IMHO that is going to improve the reproducible, deterministic, reliable execution of images with a dependency on the host. Be it a GPU or a CPU.

If you want to know more, visit the moby/moby issue and add a :thumbsup: to show your support. :)