poetry install
abruptly exited with Error code 1 in Buildkite but you cannot reproduce it on your local machine?! It took my team around half day to track down the root cause and I hope this post can save you some time here.
For your information, we use docker to build our images for the production environment and all tests are executed inside the docker container in Buildkite. This gives us the confidence if we have an issue in our CI/CD pipeline, we can reproduce it locally.
Apparently, we take this too far, variety of base OS (Linux for Buildkite, macOS for local) plus other hardware disparity could easily put us off. And this time, it's TTY.
Normally when you run a test locally, we tend to use an interactive shell whereas in the pipeline it is normally discouraged for sake of performance and cost. It's useless as well given it's uncommon for developers to connect to a build machine and give the build extra input. However, this difference can mask issues that can be exposed easily and earlier.
We initially observed an inconsistent build outcome. Due to unpinned Poetry version. Everything is fine before Poetry v1.22. But once the docker cached layer expired and the latest Poetry kicked in, you'll see a broken poetry install
. It's so sudden that even if you turn on -- verbose
, there's no extra insight you'll get around the stack trace.
This issue has been recorded here. Essentially, the stdlib
method used in cleo
acts differently in the newer version and it's sneaked into Poetry v1.3 without being caught by tests. Who will come up with a test case like that?
Anyway, if you meet the same issue and want to avoid this issue in your CI/CD pipeline, please
use
poetry --quiet
orpoetry --no-ansi
if you still want this Poetry versionpin your Poetry to version v1.2.2
Discussion time Considering we treat this as a surprise, what will you recommend to remediate similar issues in the future? For example, do you always recommend pin Poetry and any other relevant tools in the CI/CD pipeline for a consistent build outcome or will you advocate a fast fail approach so that problems can be spotted earlier? Leave your thoughts in the comment section because I'm keen to know your opinion!