Sometimes ops related issues end up a veritable black hole. Things that should have worked from the get go but do not end up taking time, while providing plenty of frustration instead. Turnaround time increases and sometimes one has to throw in the towel, chalk it up to fate and commit a workaround. Luckily, this is not one of those stories.
Kubernetes and its little cousin k3s, for all the good they offer, often slow down development when running a binary becomes building a container and scheduling a pod. The most frustrating variation is when things break at the very last minute, usually due to suble differences in container runtime environments or builders. One of those issues is DNS failures.
A curious DNS failure
Taking an example from the wild, imagine a container in a pod failing due to DNS issues. After replacing the command
in the pod with /bin/sh
(it is rarely wrong to include busybox
in even the most bare containers) to stop it from crashing
command: ["/bin/sh", "-c", "sleep 3600"]
we can experiment:
$ kubectl exec -it example-pod -- /bin/sh
/ # hostname
example-pod
/ # hostname -i
hostname: example-pod: Unknown host
/ # ping example-pod
ping: bad address 'example-pod'
The pod knows its own hostname, but cannot resolve it — even though it is listed in /etc/hosts
! Swapping out the image entirely for busybox
image: busybox:latest
and repeating the experiment works:
/ # hostname
example-pod
/ # hostname -i
10.42.1.33
Going local
At this point, we can stop our experiments on Kubernetes and try with local containers to gain insight faster. One so far untold caveat is that the original image was built using the nix dockerTools, by building a minimal example for experimentation:
{ pkgs ? (import <nixpkgs>) { } }:
pkgs.dockerTools.buildImage {
name = "nixtest";
tag = "latest";
contents = with pkgs; [ busybox strace ];
}
$ nix-build experiment.nix && docker load -i result
Loaded image: nixtest:latest
This image will fail in the same manner as our problematic one above. In contrast a simple debian:stable
will work just fine.
Conveniently strace is included in our derivation, so after installing it using apt
in the debian container as well, we can compare the output of an strace
on hostname
. For the broken container, part of the output is:
/ # strace -e file hostname -i
[...]
stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=79, ...}) = 0
openat(AT_FDCWD, "/etc/host.conf", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/resolv.conf", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/etc/nsswitch.conf", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[...]
Comparing with the debian output, we see that an attempt to open /etc/hosts
is notably absent. One key difference 1 though is the failing open of nsswitch.conf
, whose secrets man nsswitch.conf
2 quickly spills:
The Name Service Switch (NSS) configuration file, /etc/nsswitch.conf, is used by the GNU C Library and certain other applications to determine the sources from which to obtain name-service information in a range of categories, and in what order.
[…]
hosts Host names and numbers, used by
gethostbyname(3)
and related functions.
Solving the DNS issue
Manually adding a minimal nsswitch.conf
instantly fixes our problem:
/ # echo 'hosts: files dns' > /etc/nsswitch.conf
/ # hostname -i
172.17.0.2
We can finalize this in our minimal image for good, although dockerTools
unfortunately lacks an uncomplicated way to put arbitrary data onto paths inside the containter in a declarative manner, so we resort to extraCommands
:
{ pkgs ? (import <nixpkgs>) { } }:
pkgs.dockerTools.buildImage {
name = "nixtest";
tag = "latest";
# Add `nsswitch.conf` to ensure DNS queries are resolved properly.
extraCommands = ''
mkdir -p etc
echo 'hosts: files dns' > etc/nsswitch.conf
'';
contents = with pkgs; [ busybox ];
}
Conclusion
Containers are in many cases convenient, but the container ecosystem surprises the unsuspecting user quite often. Somewhat obscure features of a Linux system are cast into the limelight, prompting an informative but time-consuming search for answers.
While /etc/hosts
is fairly common knowledge, I would wager nsswitch.conf
is not; hopefully these notes will change that. As usual, in the end the simple fix in no way indicates how much of a ruckus it caused in the first place.
-
Not shown here: A painful dive into the glibc sources that yielded no results. ↩︎
-
I will readily admit that I googled it first. ↩︎