Jeremy Minton — Docker caching with poetry

Jeremy Minton About

Docker caching with poetry

Posted by Jeremy Minton, Friday 9 June 2023,

Trying to travel more with my remote work has motivated me to search for ways to reduce network traffic in my workflow - and hopefully save some money on mobile data plans. This would have the additional, although less significant, benefits of reducing build times and reducing energy consumption.

As document in previous posts, I work in docker containers and docker best practices do help. Well structured build files cache layers more efficiently; pinning to specific image versions prevents unexpected updates; and prudent choices of base images can be smaller downloads. General best practices, like including only necessary dependencies can, of course, help too. But assuming we’re doing quality work, package cacheing between container builds would be the next step.

To achieve this, I will investigate Docker’s RUN cache. Specifically, I will look at setting this up with poetry python packaging and in the context of developing in containers with VSCode’s Dev Containers extension.

The poetry cache

poetry will cache packages by default - as any smart package manager would. The default cache locations are documented here, for example ~/.cache/pypoetry on linux.

While this is all we really need to know to start using this with the Docker cache, a few more commands to control the “poetry cache” are useful, and some details about how the cache is managed are interesting, although sparse.

Docker’s `RUN` cache

While caching layers is used by default and leveraged by best practices, single steps can still be quite substantial. Installing python dependencies is one case where a large number of packages are downloaded and installed in one step. Breaking these steps down is possible with pip, although not desirable for maintainability, but not with poetry given the entire environment is resolved with the lock process.

The run cache, solves this problem by providing a cached volume that persists between builds. Specifically, the docs say

The RUN command supports a specialized cache, which you can use when you need a more fine-grained cache between runs. For example, when installing packages, you don’t always need to fetch all of your packages from the internet each time. You only need the ones that have changed.

To solve this problem, you can use RUN --mount type=cache. Using the explicit cache with the –mount flag keeps the contents of the target directory preserved between builds. When this layer needs to be rebuilt, then it’ll use the apt cache in /var/cache/apt.

Neither this short description nor the dockerfile reference material give a clear explanation of when the cache is invalidated. For that reason, we shall run some quick tests.

`RUN` cache experiments

The experimental setup will use this Dockerfile:

# syntax=docker/dockerfile:1.3
FROM python:3.10 as base

ARG CACHE=/cache
RUN mkdir $CACHE

COPY input.txt .

RUN --mount=type=cache,target=$CACHE \
    if [ -f "$CACHE/tmp.txt" ]; then \
        echo "Quickly reading cached result!"; \
    else  \
        echo "Slow generation of result from scratch."; \
        cat input.txt >> $CACHE/tmp.txt; \
    fi \
    && cat $CACHE/tmp.txt

It prints whether a cache file exists or not and if there is no cached file, it copies input.txt to a cache file. In either case, it prints the contents of the cache file, new or old.

A clean build of this:

echo "foo" > input.txt; docker build --progress=plain .

...
#12 [base 4/6] RUN --mount=type=cache,target=/cache     if [ -f "/cache/tmp.txt" ]; then         echo "Quickly reading cached result!";     else          echo "Slow generation of result from scratch.";         cat input.txt >> /cache/tmp.txt;     fi     && cat /cache/tmp.txt
#12 0.211 Slow generation of result from scratch.
#12 0.214 foo
#12 DONE 0.2s
...

and running the same command a second time produces,

...
#12 [base 4/6] RUN --mount=type=cache,target=/cache     if [ -f "/cache/tmp.txt" ]; then         echo "Quickly reading cached result!";     else          echo "Slow generation of result from scratch.";         cat input.txt >> /cache/tmp.txt;     fi     && cat /cache/tmp.txt
#12 CACHED

Not what we wanted! To avoid layer caching, update the value of input.txt so the copy command and all following layers will be rebuilt:

echo "bar" > input.txt; docker build --progress=plain .

...
#12 [base 4/6] RUN --mount=type=cache,target=/cache     if [ -f "/cache/tmp.txt" ]; then         echo "Quickly reading cached result!";     else          echo "Slow generation of result from scratch.";         cat input.txt >> /cache/tmp.txt;     fi     && cat /cache/tmp.txt
#12 0.233 Quickly reading cached result!
#12 0.234 foo
#12 DONE 0.2s
...

Now we can see that the command layer was rebuilt but the printed value from tmp.txt is foo, the cached value instead of the new value bar - plus, of course, our print statement.

We are now ready to pose some questions.

Is the RUN cache invalidated between builds?

No, obviously.

The basic procedure of changing the value in input.txt shows that the cached value is used between build calls when dependencies change. Further, changes to the dockerfile itself: by replacing cat results.txt with echo "nada"; still results in Quickly reading cached result!, or the cached value being used.

You’d hope this would be the case though, otherwise this wouldn’t be a very useful feature.

Is the RUN cache invalidated between layers?

No. Appending

RUN --mount=type=cache,target=$CACHE ls $CACHE
RUN ls $CACHE

to the base Dockerfile returns

...
#13 [base 5/6] RUN --mount=type=cache,target=/cache ls /cache
#13 0.367 tmp.txt
#13 DONE 0.4s

#14 [base 6/6] RUN ls /cache
#14 DONE 0.3s
...

which shows how the same cache is mounted during commands where --mount is used. It also shows the cache is not mounted or copied to the resulting image, as discussed in more detail here.

Is the RUN cache invalidated between different Dockerfiles?

No, and this was a little surprising to me.

Renaming (or moving) the dockerfile, with or without changing the base image, still results in Quickly reading cached result! and the cached value being printed. This shows that the cache is only indexed by the id and not the image or dockerfile. An understandable choice given run cache is designed for package managers, which should be common between builds as long as the id/hash of the packages is consistent. Without said consistency, the build becomes dependent on the host state, hence my surprise.

Is the RUN cache invalidated with `--no-cache`?

Yes.

docker build --no-cache .

results in “Slow generation of result from scratch.” regardless of the content of input.txt: changed or unchanged.

Is the RUN cache invalidated if mounted in different locations?

No, appending

RUN --mount=type=cache,target=$CACHE ls $CACHE
RUN --mount=type=cache,target=${CACHE}_foo ls ${CACHE}_foo

to Dockerfile returns

...
#13 [base 5/6] RUN --mount=type=cache,target=/cache ls /cache
#13 0.367 tmp.txt
#13 DONE 0.4s

#13 [base 5/6] RUN --mount=type=cache,target=/cache_foo ls /cache_foo
#14 DONE 0.3s
...

This shows that the new caches are created for each mount location.

Exploring the `id` option to `--mount`

From the docs:

id: Optional ID to identify separate/different caches. Defaults to value of target.

This explains the results from the previous section: the two targets, cache and cache_foo, result in two caches.

Explicitly setting the id option allows multiple independent caches to be used at the same location and, conversely, allows the same cache to be used at different locations.

For example, writing output to tmp.txt using id=foo for the mount command cache, then appending the docker file with

RUN --mount=type=cache,target=$CACHE,id=foo ls $CACHE
RUN --mount=type=cache,target=$CACHE,id=bar ls $CACHE
RUN --mount=type=cache,target=${CACHE}_foo,id=foo ls ${CACHE}_foo
RUN --mount=type=cache,target=${CACHE}_foo,id=bar ls ${CACHE}_foo

produces

#12 [base  7/11] RUN --mount=type=cache,target=/cache,id=foo ls /cache
#12 0.357 tmp_foo.txt
#12 DONE 0.4s

#13 [base  8/11] RUN --mount=type=cache,target=/cache,id=bar ls /cache
#13 DONE 0.4s

#14 [base 10/11] RUN --mount=type=cache,target=/cache_foo,id=foo ls /cache_foo
#14 0.277 tmp_foo.txt
#14 DONE 0.3s

#15 [base 11/11] RUN --mount=type=cache,target=/cache_foo,id=bar ls /cache_foo
#15 DONE 0.4s

id=foo shows the tmp.txt file, regardless of mount location and id=bar shows no files regardless of location.

Manually clearing `RUN` cache

Show docker disk usage with

docker system df

docker system df -v

will show individual caches and the build caches are CACHE_TYPE=exec.cachemount.

To clear the cache, use docker builder prune. It has the following short help

Usage:  docker buildx prune

Remove build cache

Options:
  -a, --all                  Include internal/frontend images
      --builder string       Override the configured builder instance
      --filter filter        Provide filter values (e.g., "until=24h")
  -f, --force                Do not prompt for confirmation
      --keep-storage bytes   Amount of disk space to keep for cache
      --verbose              Provide a more verbose output

How to clear caches by id? That seems to be an open question. Maybe there is a filter for it.

Mount a build cache as a volume in a running container.

Using containers as development environments, it is common to want to install new packages. If the package manager performs full dependency resolution, then this will trigger a full download of all the packages. For this reason, it would be useful to be able to mount a RUN cache from the build into the container to avoid that full download.

This will no doubt be hacky as it is quite a particular use-case and, indeed, I have not found anything in the documentation to achieve this. The best I could find are some instructions on this Stack Overflow post.

In short, you can find docker artifacts, including anonymous volumes, with docker buildx du --verbose, which are used for the cached volumes. Grepping on “cached mount” is a useful way to shorten this list. Then use the id to find the folder in the docker overlays directory. Installation dependent, but it will be something like /var/lib/docker/overlay2/{id}. That folder can then be mounted. Unfortunately, the use of hashed ids makes this a bit finicky and possibly not very stable.

Future work

For reusing the cache in a running container, I would prefer to find a way to specify the host directory used as the cache to, for example, mount the corresponding package manager cache from the host machine. This may not be possible, though, if that’s where the line on host state dependence is drawn.
The from and source options for RUN caches seem interesting and possibly quite powerful in combination with multi-stage builds. I can’t think of impactful use-cases for this with my build process though, so I would like to experiment with this feature to better understand how it could be used effectively.
Ultimately, the next step change in ease-of-use will be using locally hosted package server proxies. This would separate the package caching and image building concerns, making each easier to achieve and manage.

Posted Friday 9 June 2023 Share

You might also like: