The CI Flake
2024-08-16 · Work · ProgrammingI analyzed a flaky test failure in our Materialize CI today:
I had seen this error already once or twice in the last year, but it was incredibly rare in our Continuous Integration (CI) runs, and never happened locally. As usual, there were more pressing product issues to debug, so I never looked into it. But last week I switched most of our CI tests to run on Hetzner Cloud instead of AWS to save some money. Suddenly this issue started occurring more often in CI, so my thinking was that it must somehow be timing-dependent.
Before investigating my first instinct was that we are not writing the docker-compose.yaml
file properly, which had already led to flaky calls to Docker Compose before (source code):
So yesterday I added a file.flush()
after the yaml.dump
and hoped to be done with it. This morning I woke up and the issue was still occurring!
Based on the logged docker
call this should be the code causing the problem (source code):
Running the test in an endless loop locally had no success of reproducing the issue:
I modified the conditional in the above Python code to be if True:
and surely that reproduces the issue:
So how is it possible to be landing in this code path sometimes? I checked how the composition.workflows
are generated and it looks quite complex indeed (source code):
Since we are reading and analyzing the mzcompose.py
file, something might be running in parallel and corrupting/overwriting it. If I empty the file I can indeed reproduce the issue with the original code:
It feels like I’m getting closer, but nothing should be running in parallel, even in CI each agent is running in isolation, the git pull
is long finished at this point based on the CI logs, so it should not interfere. As an additional frustration, I’ve been wondering why this flake is mostly (but not only) occurring when running this test? The mzcompose.py
file in question is one of the simplest ones around.
Until here it felt like I was getting closer, but now I had no more leads. I went as far as searching for Python bugs around inheritance that could explain this issue, but if Python was buggy in such basic behavior, this code definitely wouldn’t be the first one to run into it. Could it be a cosmic ray?!
Since this issue seemed unexplainable without the universe conspiring against me, I went back to the basics and checked what else is running in CI and I found something extremely suspicious just before the actual test execution in our CI script (source code):
Oh no! There is another $service=0
call here, my entire investigation so far might have been useless! From looking at the CI output it’s not clear which one is actually running, but the fact that I’ve never seen the flake locally points to the new suspect. After all this code is only running in CI, and not locally, all in the name of cleaner CI logs! Checking the CI output there is a strange line occurring:
Well, there is a pipe between the mzcompose
and grep
processes, but why would it break? Let’s run the mzcompose
command in isolation:
It lists all workflows, of which we have two. The grep
command is only looking for the default
line. Apparently, grep
is smart enough to exit immediately when it finds a match when used with -q
(manpage):
Easy enough to verify, I ran the true
command, which exits instantly:
So this broken pipe
output is indeed explained by grep
exitting early. But why does the mzcompose
process breaking lead to the code in the if
’s body being executed? Normally the first command in a pipe failing would be ignored, since the second command succeeded, and I’m seeing the return code being 0
here. Checking the beginning of the script reveals that we use the pipefail
setting:
Bash’s manpage explains:
The return status of a pipeline is the exit status of the last command, unless the pipefail option is enabled. If pipefail is enabled, the pipeline’s return status is the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands exit successfully.
Typically, Python buffers its output when it doesn’t write directly to a terminal, so I still can’t reproduce the issue with grep
locally. Then I remembered that we had an issue with our Python tests’ stdout and stderr output being garbled, which made us enable PYTHONUNBUFFERED
in our ci-builder Dockerfile:
With this revelation the flake was easy enough to reproduce. Just running this would fail occasionally, still less than 1% on my system:
Adding a small sleep
between each print
leads to a reliable failure:
Without PYTHONUNBUFFERED
it would have taken us 8 KiB of output before we’d run into the broken pipe, none of our tests has that many workflows.
That concludes this long investigation into what turned out to be an interesting flake that we’ve been seeing in CI! The trivial fix is removing the -q
from the grep
call, but we could also have disabled pipefail
for that one pipe or we could have unset PYTHONUNBUFFERED
.
One lesson from this debugging session is that even simple behaviors can combine in unexpected ways to create strange flaky failures. In this case all of these conspired to make this CI flake possible:
- Misdirection: Run different code in CI compared to locally
- Switch to another kind of server with slightly different performance characteristics
- Disable Python’s output buffering
- Enable Bash’s
pipefail
- Make
grep
run quietly
In hindsight I realized why this specific skip-version-upgrade/mzcompose.py
was having the flake most often. It is one of the smallest test files:
But other than the test definitions that are even smaller and never had this no such service: default
error, the skip-version-upgrade/mzcompose.py
file contains multiple workflows, so the smaller files could not run into the issue at all: