Barco: Linux Containers from Scratch in C

As the maintainer of a Go container runtime (runc), and having worked with Rust in various other projects, while they can be better languages for building large projects, they make it harder to understand what exactly your program is doing when writing software like this.

One example that immediately comes to mind from Rust is a bug with O_PATH file descriptors I found a while ago[1], which would've made certain code we use in runc not work. And from Go, here is a bug I just found in their code for handling file descriptors for ForkExec[2] which is causing issues in a runc patch I'm working on. Neither of these issues exist in C programs. Though of course, C programs have their own issues. For better or worse, the Linux kernel APIs are easiest to use from C.

In runc we actually implement the core container setup code in C because Go doesn't allow you to do everything we need for setting up a container (it has gotten better though, in the past it was completely impossible to set up a container properly in pure Go -- now you can set one up but there are still certain configurations that are not possible to implement in pure Go, such as "docker exec"). You also cannot run Go in single-threaded mode, which means that certain kernel APIs (unshare(CLONE_NEWUSER) for instance) simply cannot be used from regular Go code.

[1]: https://github.com/rust-lang/rust/issues/62314 [2]: https://github.com/golang/go/issues/61751

"While most of the tools used in the Linux containers ecosystem are written in Go, I believe C is a better fit for a lower level tool like a container runtime. runc, the most used implementation of the OCI runtime specs written in Go, re-execs itself and use a module written in C for setting up the environment before the container process starts. crun aims to be also usable as a library that can be easily included in programs without requiring an external process for managing OCI containers."

> barco enforces a minimal set of restrictions to run untrusted code, which is not recommended for production use, where a more robust solution should be used.

Aren't containers never suitable for running untrusted code? You need AppArmor, bwrap, or similar AFAIK.

cyphar · 2 years ago

bwrap is a container and AppArmor is used by basically every container runtime if the system is using AppArmor (otherwise they use SELinux). Seccomp is also enabled by default, and I would argue it is a more significant protection against container breakouts because it protects against kernel 0-days as well and doesn't rely on LSM hooks to block operations. The real question is whether you are using user namespaces.

Jessica Frazelle ran a public bug bounty to break out of a container image that is properly secured, and as far as I know nobody collected the bounty. The website isn't up at the moment, maybe she took it down. https://contained.af/

jppittma · 2 years ago

Sounds like free money to me. You just press Ctrl+D, and you're out.

creatonez · 2 years ago

If built to spec, then the various container technologies in the kernel used together are theoretically secure. It closes all of the holes that we know about, aside from a few trivial things like the container spying on process id numbers on the host, and of course the vast potential to accidentally misconfigure it.

However, all this code is quite complex, and the kernel and the software ecosystem are lacking in having a layered approach to security that goes all the way down to the low-level nitty gritty stuff. For example, kernel memory structures are not robustly protected against the usual memory exploits, and there isn't as strong W^X protection as desired. Windows, in contrast, is able to provide layered security through a variety of approaches, including running the entire operating system in a virtual machine, with the host ensuring integrity of kernel memory. These sorts of layered approaches to security are desirable because there will always be defects in any complex software.

Side note: AppArmor and bwrap are distinct. Bubblewrap is a relatively simple userspace program that makes use of existing kernel containerization features (the same ones that Docker/Podman use), whereas AppArmor and SELinux are security features that are patched into the kernel itself. AppArmor and SELinux have made some progress in adding layered low-level security to the kernel, but it's not particularly impressive. Bubblewrap has done great work in exposing the kernel's existing tech to users, but they are not fundamental improvements to the kernel itself.

viraptor · 2 years ago

> aside from a few trivial things like the container spying on process id numbers on the host

Containers with own PID namespace can't spy on process IDs on the host though? Not sure what you mean here.

> and there isn't as strong W^X protection as desired

What level is desired? Bootup warnings for W^X got merged a while ago. Changes that try to include anything violating it are rejected (see bcachefs).

> Windows, in contrast, is able to provide layered security through a variety of approaches, including running the entire operating system in a virtual machine, with the host ensuring integrity of kernel memory.

What? Xen existed for years, that's not "in contrast". Secureboot and lockdown exists on Linux too. There's also per-service firecracker microvm.

> whereas AppArmor and SELinux are new security features that are patched into the kernel itself

That's very misleading. They're not new - selinux is over 2 decades old. They're also not "patched in" - LSMs have been integrated into Linux for a very long time with multiple implementations available. Selinux had multilabel security created for gov use. It's quite impressive actually.

loeg · 2 years ago

I would probably point at a virtual machine for a convenient place to run untrusted code. It's not perfect -- there are VM escapes -- but it's more convenient than a dedicated, air-gapped machine.

CameronNemo · 2 years ago

GKE runs every kubelet in its own gvisor-like userspace hypervisor.

https://cloud.google.com/blog/products/containers-kubernetes...

viraptor · 2 years ago

Depends what you mean by suitable. If you run the service as a new user, it's more secure than running without a new namespace (you're isolated from other apps) and potentially less secure than running on host (one more layer of indirection for system resource access).

Since in reality most attacks will be against your app itself before the attacker has direct access to syscalls, I see namespaces/containers as extra protection.

charcircuit · 2 years ago

>Aren't containers never suitable for running untrusted code?

They are suitable provided the kernel is secure.

cyphar · 2 years ago

This is tautologically true -- "Is X secure? Yes, assuming the technology X uses is secure."

The more nuanced answer is that containers have several layers of protections (seccomp, LSMs, user namespaces, namespaces, cgroups, capabilities, and standard process permissions by running as an unprivileged user) which all act together to help protect against container attacks. It's not perfect, but most container breakout attacks we've had so far are related to when container runtimes have to operate on a container during process setup (IMHO because the process for creating a container process is far from atomic) -- some of these attacks were enabled by kernel bugs which we went and fixed as well. It is very difficult to break out of a container once it has been configured and left alone.

lucavallin · 2 years ago

barco is a project I worked on to learn more about Linux containers and the Linux kernel, based on other guides on the internet.

teleforce · 2 years ago

Looks like a good project to learn container from scratch.

Just wondering the main reason you're C since most of the container project now seems to be using Go or Rust?

serf · 2 years ago

I can't answer for the developer, but the answer to that with most small one-person-show projects is familiarity/comfort/ability.

the head-space that adopting a new language for a specific project takes is immense compared to tackling it in a familiar language that you know you're already able in; there is rarely a benefit to doing so outside of team environments where a certain level of on-boarding is expected, or because you have a really niche language requirement/feature that your project is begging for.

nazgulsenpai · 2 years ago

I came across this last week when reading about different container runtimes -- crun is implemented in C[0].

Their explanation:

[0]https://github.com/containers/crun

I haven't written much C since college and I felt nostalgic, so I went for it.

Deleted Comment

murphyslaw · 2 years ago

This could be quite useful in a CS course on containers.

metadat · 2 years ago

How did you come up with the name "Barco"?

lcvln · 2 years ago

It's Venetian (my native language) for "hay barrack": http://vec.wikipedia.org/wiki/Barco

voidmain0001 · 2 years ago

It’s Spanish for boat.

mlashcorp · 2 years ago

Also, Portuguese for boat.

29athrowaway · 2 years ago

docker has to do with ships and barco means ship.

intelVISA · 2 years ago

awesome, thx for sharing this :)

zamalek · 2 years ago

nedt · 2 years ago

When I did a talk about docker I also wanted to show a bit of what it does under the hood without going through all the layers and without too much details. This ~120 lines of shell script is really good in providing just an intro into what's needed for containers: https://github.com/p8952/bocker/blob/master/bocker (not mine)

favflam · 2 years ago

Question: for sandboxing untrusted code, should I invest time in learning more linux container stuff or switch to learning WASI? I am inclined towards WASI myself.

lmm · 2 years ago

I have more faith in WASI. Linux containers is inherently in a whack-a-mole position where they're trying to retrofit security onto something that was never built for it, which very rarely works.

jeroenhd · 2 years ago

There's a huge performance hit for many programming languages when you run them inside a WASM runtime. Memory also behaves very different from normal applications. Autovectorizasation also isn't universally supported by WASM compilers yet, which can be costly for performance.

Properly configured WASI runtime are great for security but they're worse than containers on most other fronts. I don't think the downsides make sense unless you're building a business that lets random customers upload WASM files you execute.

elcapitan · 2 years ago

Just came to say that this is a very nicely set up C project, understandable structure and simple Makefile, very beginner friendly, congrats :)

hippospark · 2 years ago

There is a similar project _Linux containers in 500 lines of code_ [1], the code is a bit old, but the procedure is quite simple. [1]: https://blog.lizzie.io/linux-containers-in-500-loc.html

shortrounddev2 · 2 years ago

Very cool, I was thinking of doing something similar in windows

semiquaver · 2 years ago

What is the underlying isolation technology that would be used in windows?

AkshitGarg · 2 years ago

Windows also supports containers: https://learn.microsoft.com/en-us/virtualization/windowscont...

yukIttEft · 2 years ago

Could this also be implemented as a bashscript?

Probably: https://github.com/p8952/bocker