Nix based continuous integration

One of my favorite things to come out of the late 90s was Extreme Programming, which despite being named as if it was an offshoot of the contemporary X Games ¹, brought a surprising amount of good things to the more accident prone field of software development. One of these is Continuous integration, which few projects can do without these days, or at least should not.

The software that runs the pipeline is also shifting, with some shops being still on Jenkins, to others moving to GitLab Runner, a plethora of third-party cloud solutions like CircleCI, Drone, newer self-hosted solutions like GoCD or upstart GitHub Actions.

They all have a rough strategy in common: Setup an environment and run a series of script commands to build the software and run tests. Unfortunately, they share a lot of issues as well.

Continuous pains

Waiting for the CI is slow: In a sane development environment a developer working on software with a non-trivial compile time has a powerful desktop machine that is at least as fast as any CI server set up. The latter might also be shared and will have jobs queued up during peak operating hours. To submit a job to the CI the new code has to be pushed, the pipeline triggered, a new runner instantiated, the environment set up and ideally some caches restored.

But why are there issues only found later in CI? Sometimes a project has a test suite so large that running it locally is too expensive, but that does not hold true for most mid-size projects — to the contrary, the developer will be running tests as part of the regular development cycle. In my experience, most of these issues are due to a dirty working copy, a testing procedure difference or an environment mismatch.

A common dirty working copy issue is forgetting to check in a source file. Liberal use of git status and keeping the local copy reasonably clean prevents most of these, luckily they tend to fail quickly.

The story of `preflight.sh`

Testing procedure differences are much more expensive and happen when the local developers testing workflow differs from the one run by the CI. It can be very frustrating having to wait for half an hour only for the CI to fail with a Clippy warning. Maybe I even executed Clippy beforehand, but ran into the long-standing Clippy bug where it does not lint files that have been checked with cargo check before!

On many occasions I have written a little helper script called preflight.sh that attempts to approximate the CI procedure while making some trade-offs for speed:

#!/bin/sh -e

cargo fmt -- --check

# Fix for clippy bug:
find src/ -name \*.rs -exec touch {} +
cargo clippy --all-targets --all-features -- -D warnings

cargo build --all-targets --all-features
cargo test --all-targets --all-features

echo "✓ all good"

Steps are orded with the cheapest ones first to minimize the time to first failure. Running ./preflight.sh before pushing to CI catches trivial mistakes quickly and when run on the local machine, a lot of download and compile caches will be hit from the previous compilation.

Some issues will still slip through though and unfortunately we cannot reuse preflight.sh, as most projects will differ ever so slightly in their setup here, an issue which we will address later.

Protect your local environment

Environment mismatches are initially most felt by the developer that sets up the CI but by everyone else sooner or later. Even though some dependencies might be managed by your languages build tool, there will still be a script setting up the CI which installs whatever is required. This usually ends up being a lot when starting from a seven megabye alpine base image. There will be differences between the local and the CI environment, be it a different variant of tar or that one version of the command-line tool that is still missing a feature.

This can somewhat be alleviated by using Docker on the dev desktop as well, but now development is not fun anymore: Shell scripts to setup all the mounts will be written, the image is either too large or will be missing development tools and sharing the cache between your local IDE and the toolchain inside the image becomes a real challenge involving filesystem user ids or permission issues. Starting up the application used to take milliseconds, now it takes tens of seconds. It might get worse when the need for speed requires multi-stage builds and soon your Dockerfiles themselves will introduce new problems — you can tell if you have reached this point if docker prune becomes a routine step during bug triangulation.

There simply is no substitute for having all tooling installed locally to work efficiently.

A good indicator of how healthy your overall setup is in terms of complexity is the size of your ci.yml file. If it is large there might be a good deal of either build or testing knowledge that has quietly migrated there, which is an issue. Sometimes this will result in desperate actions like installing gitlab-runner locally and instructing it to one-off run the described job, at which point GitLab CI has become your package manager and build system.

Solving the environment issues with nix

There is a remedy for these issues using nix though. In the previous two articles of this series we laid out an example for a short and simple way of specifying an environment and building software inside of it. It again pays dividends, as our whole process of setting up the environment, building and running the tests can be summarized as

$ nix-shell --command ./build.sh
$ nix-shell --command ./test.sh

Throw in a --pure for good measure if you like.

We have solved two of our problems, the environment is completely captured inside shell.nix and test.sh is what preflight.sh aspires to be, a way to run the CI pipeline directly.

We can try a quick sanity check to see if our application works on an arbitrary CI pipeline. Since most of them use Docker internally, we can fire up a container and pretend it is a new CI instance, using nixos/nix as base instance:

$ docker run --rm -it -v $(pwd):/src nixos/nix
54218d186d03:/# cd src
54218d186d03:/src# nix-shell --command ./build.sh
...
54218d186d03:/src# nix-shell --command ./test.sh

This runs without issues, provided no private repositories need to be cloned².

As an example, here is the complete GitHub actions file to run it:

name: Run CI
on: [push]

jobs:
  build:
    name: Build application
    runs-on: ubuntu-latest
    container:
      image: nixos/nix
    steps:
      - uses: actions/checkout@v2
      - run: |
          nix-shell --command ./build.sh
          nix-shell --command ./test.sh

The short configuration show above translates well to other CI systems, which all have a way of specifying a base container and a set of commands to run as a job description.

Getting rid of Docker

In the case of GitHub Actions, all CI runners run on a new virtual machine instantiated from a rather full-featured base image. We can cut docker out of the loop completely by install nix directly onto said image, this cuts about ten seconds from our startup time already.

Installing nix on the GitHub Actions CI comes with some pitfalls, but thankfully Cachix has prepared an action to cover this. Our final CI script thus looks like

name: Run CI
on: [push]

jobs:
  build:
    name: Build application
    runs-on: ubuntu-latest
    steps:
      - uses: cachix/install-nix-action@v10
      - uses: actions/checkout@v2
      - run: |
          nix-shell --command ./build.sh
          nix-shell --command ./test.sh

At this point there are no containers involved in our complete process from development to contiuous integration.

Conclusion

We have functionally completed our build-test-CI cycle without having to write copiuos amounts of configuration or glue scripts and ideally how the resulting product works is fairly obvious to anyone joining the project even at this stage.

In the next step, we will look at caching, trading a bit of complexity to speed up our builds. Thankfully this is less of urgent issue though, as we can comfortably run the CI on every developer machine and thus do not rely on it for our core development process.

The natural successor to both was, of course, Tony Hawk’s Pro Skater 2 in the early aughts. ↩︎
If that were the case, we would need to apk add git openssh-client and setup some private keys, but that is out of the scope of this article. Most CIs will have some sort of mechanism setting these up that differs from vendor to vendor. ↩︎