Docker Was Too Slow, So We Replaced It: Nix in Production [video]

Doesn’t surprise me at all that people who know what they’re doing are building their own images with nix for ML. Tens of millions of dollars have been wasted in the past 2 years by teams who are too stupid to upgrade from buggy software bundled into their “golden” docker container, or too slow to upgrade their drivers/kernels/toolkits. It’s such a shame. It’s not that hard.

Edit: see also the horrors that exist when you mix nvidia software versions: https://developer.nvidia.com/blog/cuda-c-compiler-updates-im...

CuriouslyC · 5 months ago

I use Nix and like it, but in terms of DX docker is still miles ahead. I liken it to Python vs Rust. Use the right tool.

dustbunny · 5 months ago

Can you be explicit in how the dollars are being wasted? Maybe it's obvious to you but omjow does an old kernel waste money?

nickysielicki · 5 months ago

The modern ML cards are much more raw than people realize. This isn’t a highly mature ecosystem with stable software, there are horrible bugs. It’s gotten better, but there are still big problems, and the biggest problem is that so many people are too stupid to use new releases with the fixes. They stick to the versions they already have because of nothing other than superstition.

Go look at the llama 3 whitepaper and look at how frequently their training jobs died and needed to be restarted. Quoting:

> During a 54-day snapshot period of pre-training, we experienced a total of 466 job interruptions. Of these, 47 were planned interruptions due to automated maintenance operations such as firmware upgrades or operator-initiated operations like configuration or dataset updates. The remaining 419 were unexpected interruptions, which are classified in Table 5. Approximately 78% of the unexpected interruptions are attributed to confirmed hardware issues, such as GPU or host component failures, or suspected hardware-related issues like silent data corruption and unplanned individual host maintenance events.

[edit: to be clear, this is not meant to be a dig on the meta training team. They probably know what they’re doing. Rather, it’s meant to give an idea of how immature the nvidia ecosystem was when they trained llama 3 in early 2024. This is the kind of failure rates you can expect if you opt into using the same outdated software they were forced to use at the time.]

The firmware and driver quality is not what people think it is. There’s also a lot of low-level software like NCCL and the toolkit that exacerbates issues in specific drivers and firmware versions. Grep for “workaround” in the NCCL source code and you’ll see some of them. It’s impossible to validate and test all permutations. It’s also worth mentioning that the drivers interact with a lot of other kernel subsystems. I’d point to HMM, the heterogeneous memory manager, which is hugely important for nvidia-uvm, which was only introduced in v6.1 and sees a lot of activity.

Or go look at the amount of patches being applied to mlx5. Not all of those patches get back ported into stable trees. God help you if your network stack uses an out of tree driver.

Deleted Comment

Alternative: just produce relocatable builds that don’t require all of this unnecessary extra infrastructure

hamandcheese · 5 months ago

Please elaborate. How does one "just" do that?

forrestthewoods · 5 months ago

Deploying computer programs isn't that hard. What you actually need to run is pretty straight forward. Depend on glibc, copypaste all your other shared lib dependencies and plop them in RPATH. Pretend `/lib` is locked at initial install. Remove `/usr/lib` from the path and include everything.

Docker was made because Linux sucks at running computer programs. Which is a very silly thing to be bad at. But here we are.

What has happened in more recent years is that CMake sucks ass so people have been abusing Docker and now Nix as build system. Blech.

The speaker does get it right at the end. A Bazel/Buck2 type solution is correct. An actual build system. They're abusing Nix and adding more layers to provide better caching. Sure, I guess.

If you step back and look at the final output of what needs to be produced and deployed it's not all that complicated. Especially if you get rid of the super complicated package dependency graph resolution step and simply vendor the versions of the packages you want. Which everyone should do, and a company like Anthropic should definitely do.

Deleted Comment

musicale · 5 months ago

Docker is overkill if all you really need is app packaging.

Docker containers may not be portable anyway when the CUDA version used in the container has to match the kernel driver and GPU firmware, etc.

pluto_modadic · 5 months ago

New corollary: sometimes new tech gets built because you don't know how to correctly use existing tech.

dima55 · 5 months ago

Are you referring to this Nix effort or to Docker? Because that largely applies to most usages of Docker.

nine_k · 5 months ago

Saying that Docker could be replaced by a simple script that does chroot + ufw + nsenter is like saying that Dropbox could be a simple script using rsync and cron. That is, technically not wrong, but it completely misses the UX / product perspective.

zenmac · 5 months ago

Great, can't wait for the systemd crew come out with: Docker was Too Slow, So We Replaced It: Systemd in Production [asciinema]

mianos · 5 months ago

No joke, it's already there, systemd-nspawn can run OCI containers.

miladyincontrol · 5 months ago

Honestly I've been loving systemd-nspawn using mkosi to build containers, distroless ones too at that where sensible. Works a treat for building vms too.

Scales wonderfully, fine grained permissions and configuration are exactly how you'd hope coming from systemd services. I appreciate it leverages various linux-isms like btrfs snapshots for faster read only or ephemeral containers.

People still by large have this weird assumption that you can only do OS containers with nspawn, never too sure where that idea came from.

whateveracct · 5 months ago

funnily enough, I stopped using Docker and use NixOS-configured systemd services half a decade ago and never looked back

flyer23 · 5 months ago

"half a decade ago" is the nix way of saying "5 years ago" :P

What does systemd have to do with the video?

amadio · 5 months ago

At CERN, software stacks are created centrally and software distribution for experiments is done with CVMFS (https://cernvm.cern.ch/fs/), an HTTP-based read-only FUSE filesystem.

EESSI (https://eessi.io) has taken this model further by using CVMFS, Gentoo Prefix (https://prefix.gentoo.org), and EasyBuild to create full HPC environments for various architectures.

CVMFS also has a docker driver to allow only used parts of a container image to be fetched on demand, which is very good for cases in which only a small part of a fat image is used in a job. EESSI has some documentation about it here: https://www.eessi.io/docs/tutorial/containers/

fouc · 5 months ago

Given that he calls their containers "pods", does anyone think he's using `podman --rootfs` to bypass overlayfs and the 128 layer limit? Or is it just a coincidence they're called "pods" ?

Or he's using NixOS as the image OS and using nixos-containers (which use systemd-nspawn)

0x6c6f6c · 5 months ago

Since no one responded, they're running on Kubernetes, where a unit of containers is called a "pod". A pod may be one or more containers, but it's the smallest deployable unit in the Kubernetes space.

Their docker images were 11-35GB. Using the nix dockerTools approach would have resulted 100-300MB layers. These also may not even cache well between tags, though that's my intuition not knowledge. Especially if that's true, it wouldn't have improved the overall pull time issues they were having, which was 70-210s or image pull time on many new builds.

In their case they added a sidecar container which was actually an init container, which runs before the primary container of the pod runs. They did utilize root privileges to perform things like bind mounting of nix store paths into the running container which made it possible for the container to run software provided in the /nix/store available from those bind mounts. This also meant both the Kubernetes hosts and containers did not require the nix daemon, the nix-sidecar running within the pod orchestrated pulling derivations , binding them, and running garbage collection at low priority in the background to ensure host SSDs don't run out of storage, while still allowing referenced derivations in the cluster to persist, improving sync time where the SSD may already contain all necessary derivations for a new pod startup.

imiric · 5 months ago

Some people, when confronted with a problem, think "I know, I'll use Nix." Now they have two problems.

k__ · 5 months ago

Seems like anti-intellectualism is spreading at HN, too.

Oh, please.

The only anti-intellectualism is not accepting that every technology has tradeoffs.

HumanOstrich · 5 months ago

Yup, and there's a high correlation between people rewriting everything in Rust and converting everything else to Nix. It's like a complexity fetish.

It's removing complexity elsewhere, usually much more. Once you have invested in a relatively fixed bit of complexity, your other tasks become much easier to complete.

Once you have invested in understanding the Clifford algebra, your whole classical electrodynamics turns from 30 equations into two.

Once you have invested in writing a Fortran compiler, you can write numerical computations much easier and shorter than in assembly.

Once you have invested in learning Nix, your whole elaborate build infra, plus chef / puppet / saltstack suddenly shrinks to a few declarations.

Etc.

justinrubek · 5 months ago

I see it more as a simplicity fetish. I don't want to have to deal with the complexity of software that is not packaged via nix. It's a big headache to deal with, I need it to be simple and work.

umvi · 5 months ago

"your entire static website is running on GitHub pages? Sounds like legacy tech debt. I need to replace it with kubernetes pronto"