Trying to travel more with my remote work has motivated me to search for ways to reduce network traffic in my workflow - and hopefully save some money on mobile data plans. This would have the additional, although less significant, benefits of reducing build times and reducing energy consumption.
As document in previous posts, I work in docker containers and docker best practices do help. Well structured build files cache layers more efficiently; pinning to specific image versions prevents unexpected updates; and prudent choices of base images can be smaller downloads. General best practices, like including only necessary dependencies can, of course, help too. But assuming we’re doing quality work, package cacheing between container builds would be the next step.
To achieve this, I will investigate Docker’s RUN
cache. Specifically, I will look
at setting this up with poetry python packaging and in the context of developing in
containers with VSCode’s Dev Containers extension.
The poetry cache
poetry
will cache packages by default - as any smart package manager would.
The default cache locations are documented here, for example ~/.cache/pypoetry
on linux.
While this is all we really need to know to start using this with the Docker cache, a few more commands to control the “poetry cache” are useful, and some details about how the cache is managed are interesting, although sparse.
Docker’s RUN
cache
While caching layers is used by default and leveraged by best practices, single steps can still be quite substantial. Installing python dependencies is one case where a large number of packages are downloaded and installed in one step. Breaking these steps down is possible with pip
, although not desirable for maintainability, but not with poetry
given the entire environment is resolved with the lock process.
The run cache, solves this problem by providing a cached volume that persists between builds. Specifically, the docs say
The
RUN
command supports a specialized cache, which you can use when you need a more fine-grained cache between runs. For example, when installing packages, you don’t always need to fetch all of your packages from the internet each time. You only need the ones that have changed.To solve this problem, you can use
RUN --mount type=cache
. Using the explicit cache with the –mount flag keeps the contents of the target directory preserved between builds. When this layer needs to be rebuilt, then it’ll use the apt cache in/var/cache/apt
.
Neither this short description nor the dockerfile reference material give a clear explanation of when the cache is invalidated. For that reason, we shall run some quick tests.
RUN
cache experiments
The experimental setup will use this Dockerfile
:
# syntax=docker/dockerfile:1.3
FROM python:3.10 as base
ARG CACHE=/cache
RUN mkdir $CACHE
COPY input.txt .
RUN --mount=type=cache,target=$CACHE \
if [ -f "$CACHE/tmp.txt" ]; then \
echo "Quickly reading cached result!"; \
else \
echo "Slow generation of result from scratch."; \
cat input.txt >> $CACHE/tmp.txt; \
fi \
&& cat $CACHE/tmp.txt
It prints whether a cache file exists or not and
if there is no cached file, it copies input.txt
to a cache file.
In either case, it prints the contents of the cache file, new or old.
A clean build of this:
echo "foo" > input.txt; docker build --progress=plain .
...
#12 [base 4/6] RUN --mount=type=cache,target=/cache if [ -f "/cache/tmp.txt" ]; then echo "Quickly reading cached result!"; else echo "Slow generation of result from scratch."; cat input.txt >> /cache/tmp.txt; fi && cat /cache/tmp.txt
#12 0.211 Slow generation of result from scratch.
#12 0.214 foo
#12 DONE 0.2s
...
and running the same command a second time produces,
...
#12 [base 4/6] RUN --mount=type=cache,target=/cache if [ -f "/cache/tmp.txt" ]; then echo "Quickly reading cached result!"; else echo "Slow generation of result from scratch."; cat input.txt >> /cache/tmp.txt; fi && cat /cache/tmp.txt
#12 CACHED
Not what we wanted! To avoid layer caching, update the value of input.txt
so the
copy command and all following layers will be rebuilt:
echo "bar" > input.txt; docker build --progress=plain .
...
#12 [base 4/6] RUN --mount=type=cache,target=/cache if [ -f "/cache/tmp.txt" ]; then echo "Quickly reading cached result!"; else echo "Slow generation of result from scratch."; cat input.txt >> /cache/tmp.txt; fi && cat /cache/tmp.txt
#12 0.233 Quickly reading cached result!
#12 0.234 foo
#12 DONE 0.2s
...
Now we can see that the command layer was rebuilt but the printed value from tmp.txt
is foo
, the cached value instead of the new value bar
- plus, of course, our print statement.
We are now ready to pose some questions.
Is the RUN cache invalidated between builds?
No, obviously.
The basic procedure of changing the value in input.txt
shows that the cached
value is used between build calls when dependencies change.
Further, changes to the dockerfile itself: by replacing cat results.txt
with echo "nada"
;
still results in Quickly reading cached result!
, or the cached value being used.
You’d hope this would be the case though, otherwise this wouldn’t be a very useful feature.
Is the RUN cache invalidated between layers?
No. Appending
RUN --mount=type=cache,target=$CACHE ls $CACHE
RUN ls $CACHE
to the base Dockerfile
returns
...
#13 [base 5/6] RUN --mount=type=cache,target=/cache ls /cache
#13 0.367 tmp.txt
#13 DONE 0.4s
#14 [base 6/6] RUN ls /cache
#14 DONE 0.3s
...
which shows how the same cache is mounted during commands where --mount
is
used. It also shows the cache is not mounted or copied to the resulting image,
as discussed in more detail here.
Is the RUN cache invalidated between different Dockerfiles?
No, and this was a little surprising to me.
Renaming (or moving) the dockerfile, with or without changing the base image, still
results in Quickly reading cached result!
and the cached value being printed.
This shows that the cache is only indexed by the id
and not the image or dockerfile.
An understandable choice given run cache is designed for package managers, which
should be common between builds as long as the id/hash of the packages is consistent.
Without said consistency, the build becomes dependent on the host state, hence my surprise.
Is the RUN cache invalidated with --no-cache
?
Yes.
docker build --no-cache .
results in “Slow generation of result from scratch.
” regardless of the content of input.txt
: changed or unchanged.
Is the RUN cache invalidated if mounted in different locations?
No, appending
RUN --mount=type=cache,target=$CACHE ls $CACHE
RUN --mount=type=cache,target=${CACHE}_foo ls ${CACHE}_foo
to Dockerfile returns
...
#13 [base 5/6] RUN --mount=type=cache,target=/cache ls /cache
#13 0.367 tmp.txt
#13 DONE 0.4s
#13 [base 5/6] RUN --mount=type=cache,target=/cache_foo ls /cache_foo
#14 DONE 0.3s
...
This shows that the new caches are created for each mount location.
Exploring the id
option to --mount
From the docs:
id: Optional ID to identify separate/different caches. Defaults to value of target.
This explains the results from the previous section: the two targets, cache
and cache_foo
, result in two caches.
Explicitly setting the id
option allows multiple independent caches to be used
at the same location and, conversely, allows the same cache to be used at
different locations.
For example, writing output to tmp.txt
using id=foo
for the mount command cache,
then appending the docker file with
RUN --mount=type=cache,target=$CACHE,id=foo ls $CACHE
RUN --mount=type=cache,target=$CACHE,id=bar ls $CACHE
RUN --mount=type=cache,target=${CACHE}_foo,id=foo ls ${CACHE}_foo
RUN --mount=type=cache,target=${CACHE}_foo,id=bar ls ${CACHE}_foo
produces
#12 [base 7/11] RUN --mount=type=cache,target=/cache,id=foo ls /cache
#12 0.357 tmp_foo.txt
#12 DONE 0.4s
#13 [base 8/11] RUN --mount=type=cache,target=/cache,id=bar ls /cache
#13 DONE 0.4s
#14 [base 10/11] RUN --mount=type=cache,target=/cache_foo,id=foo ls /cache_foo
#14 0.277 tmp_foo.txt
#14 DONE 0.3s
#15 [base 11/11] RUN --mount=type=cache,target=/cache_foo,id=bar ls /cache_foo
#15 DONE 0.4s
id=foo
shows the tmp.txt
file, regardless of mount location and id=bar
shows no files regardless of location.
Manually clearing RUN
cache
Show docker disk usage with
docker system df
Or
docker system df -v
will show individual caches and the build caches are CACHE_TYPE=exec.cachemount
.
To clear the cache, use docker builder prune
. It has the following short help
Usage: docker buildx prune
Remove build cache
Options:
-a, --all Include internal/frontend images
--builder string Override the configured builder instance
--filter filter Provide filter values (e.g., "until=24h")
-f, --force Do not prompt for confirmation
--keep-storage bytes Amount of disk space to keep for cache
--verbose Provide a more verbose output
How to clear caches by id? That seems to be an open question. Maybe there is a filter for it.
Mount a build cache as a volume in a running container.
Using containers as development environments, it is common to want to install
new packages. If the package manager performs full dependency resolution, then
this will trigger a full download of all the packages. For this reason, it would
be useful to be able to mount a RUN
cache from the build into the container
to avoid that full download.
This will no doubt be hacky as it is quite a particular use-case and, indeed, I have not found anything in the documentation to achieve this. The best I could find are some instructions on this Stack Overflow post.
In short, you can find docker artifacts, including anonymous volumes, with
docker buildx du --verbose
, which are used for the cached volumes.
Grepping on “cached mount” is a useful way to shorten this list.
Then use the id to find the folder in the docker overlays directory. Installation dependent, but it will be something like /var/lib/docker/overlay2/{id}
.
That folder can then be mounted.
Unfortunately, the use of hashed ids makes this a bit finicky and possibly not
very stable.
Future work
- For reusing the cache in a running container, I would prefer to find a way to specify the host directory used as the cache to, for example, mount the corresponding package manager cache from the host machine. This may not be possible, though, if that’s where the line on host state dependence is drawn.
- The
from
andsource
options forRUN
caches seem interesting and possibly quite powerful in combination with multi-stage builds. I can’t think of impactful use-cases for this with my build process though, so I would like to experiment with this feature to better understand how it could be used effectively. - Ultimately, the next step change in ease-of-use will be using locally hosted package server proxies. This would separate the package caching and image building concerns, making each easier to achieve and manage.
Posted Friday 9 June 2023 Share