The Debian group is admirable, and have positively changed the standards for OS design several times. Reminds me I should donate to their coffee fund around tax time =3
I’ve said it many times and I’ll repeat it here - Debian will be one of the few Linux distros we have right now, that will still exist 100 years from now.
Yea, it’s not as modern in terms of versioning and risk compared to the likes of Arch, but that’s also a feature!
I always want to donate more to open source projects but as far as I know there aren't any I can get tax credits for in Canada. My budget is strapped just enough that I can't quite afford to donate for nothing.
Any Canadian residents here know of any tax credit eligible software projects to donate to?
I don't get how someone achieves reproducibility of builds: what about files metadata like creation/modification timestamps? Do they forge them? Or are these data treated as not important enough (like it 2 files with different metadata but identical contents should have the same checksum when hashed)?
There's lots of info on the Debian site about their reproducibility efforts, and there's a story from 2024's DebConf that may be of interest: https://lwn.net/Articles/985739/
ASLR shouldn’t be an issue unless you intend to capture the entire memory state of the application. It’s an intermediate representation in memory, not an output of any given step of a build.
Annoying edge cases come up for things like internal object serialization to sort things like JSON keys in config files.
Since the build is reproducible, it should not matter when it was built. If you want to trace a build back to its source, there are much better ways than a timestamp.
C compilers offer __DATE__ and __TIME__ macros, which expand to string constants that describe the date and time that the preprocessor was invoked. Any code using these would have different strings each time it was built, and would need to be modified. I can't think of a good reason for them to be used in an actual production program, but for whatever reason, they exist.
This is not quite right. At least in Debian, only files that are newer than some standardised date are to that standardised date. This "clamping" preserves any metadata in older files.
Maybe dumb question but why would this change the reproducibility? If you clone a git repo, do you not get the meta data as it is stored in git? Or would the files have the modification date of the cloning?
You clone source from git, but then you use them to build some artifacts. The artifacts build time may differ, yet with reproducible builds - the artifact should match.
Those aren't needed to generate a hash of a file. And that metadata isn't part of the file itself (or at least doesn't need to be), it's part of the filesystem or OS
That's an acceptable answer for the simple case when you distribute just a file, but what if your distribution is something more complex, like an archive with some sub-archives? Metadata in the internal files will affect the checksum of the resulting archive.
> ... what about files metadata like creation/modification timestamps? Do they forge them?
The least difficult to solve for reproducible build but yes.
The real question is: why, in the past, was an entire ecosystem created where non-determinism was the norm and everybody thought it was somehow ok?
Instead of asking: "how one achieves reproducibility?" we may wonder "why did people got out of their way to make sure something as simple as a timestamp would screw determinism?".
For that's the anti-security mindset we have to fight. And Debian did.
TBH security is someone the source of the issues, as it often involves adding randomness. For example, replacing deterministic hashes by keyed hashes to protect from hash flooding DoS led to deterministic output becoming nondeterministic (e.g. when displaying a hash table in its natural order).
It's my understanding that is about generating the .iso file from the .deb files, not about generating the .deb files from source. Generating .deb from source in a reproducible way is still a work in progress.
Is the build infrastructure for Debian also reproducible? It seems like we if someone wants to inject malware in Debian package binaries (without injecting them into the source), they have to target the build infrastructure (compilers, linkers and whatever wrapper code is written around them).
Also, is someone else also compiling these images, so we have evidence that the Debian compiling servers were not compromised?
I think there's also a similar thing for the images, but I might be wrong and I definitely don't have the link handy at the moment.
There's lots of documentation about all of the things on Debian's site at the links in the brief. And LWN also had a story last year about Holger Levsen's talk on the topic from DebConf: https://lwn.net/Articles/985739/
Working on it! But in general the answer is that for most purposes it's good enough to show that many independently produced pieces of hardware can reproduce the same results.
If the build is reproducible inside VMs, then the build can be done on different architectures: say x86 and ARM. If we end up with the same live image, then we're talking something entirely different altogether: either both x86 and ARM are backdoored the same way or the attack is software. Or there's no backdoor (which is a possibility we have to fancy too).
For user space? No you can definitely do a stage 0 build which depends only on about 364 bytes of x86_64 binary (though ironically I haven't managed to get this to work for me yet).
The liability is EFI underneath that, and the Intel ring -1 stuff (which we should be mandating is open source).
Reproducible: If Alice and Bob both download and compile the same source code, Alice's binary is byte-for-byte identical to Bob's binary.
Normal: Before Debian's initiative to handle this problem, most people didn't think hard about all the ways system-specific differences might wind up in binaries. For example: __DATE__ and __TIME__ macros in C, parallel builds finishing in different order, anything that produces a tar file (or zip etc.) usually by default asks the OS for the input files' modification time and puts that into the bytes of the tar file, filesystems may list files in a directory in different order and this may also get preserved in tar/zip files or other places...
Why it's important: With reproducible builds, anyone can check the official binaries of Debian match the source code. This means going forward, any bad actors who want to sneak backdoors or other malware into Debian will have to find a way to put it in the source code, where it will be easier for people to spot.
The important property that anyone can verify the untainted relationship between the binary and the source (providing we do the same for both tool chains, not relying on a blessed binary at any point) is useful if people do actually verify outside the debian sphere.
I hope they promote tools to enable easy verification on systems external to debian build machines.
as the 'xz' backdoor was in the source code, and remained there for a while before anyone spotted it, it doesn't necessarily guarantee that backdoors/malware won't make their way into the source of a very-widely-redistributed project.
So how do those work in these Debian reproducible builds? Do they outlaw those directives? Or do they set those based on something other than the current date and time? Or something else?
Open source means "you can see the code for what you run". Except... how do you know that your executables were actually built from that code? You either trust your distro, or you build it yourself, which can be a hassle.
Now that the build is reproducible, you don't need to trust your distro alone. It's always exactly the same binary, which means it'll have one correct sha256sum. You can have 10 other trusted entities build the same binary with the same code and publish a signature of that sha256sum, confirming they got the same thing. You can check all ten of those. The likelihood that 10 different entities are colluding to lie to you is a lot lower than just your distro lying to you.
Reproducible builds actually solve a lot of problems. (Whether these are real problems, who really knows, but people spend a lot of money to solve them.)
At my last job, some team spent forever making our software build in a special federal government build cluster for federal government customers. (Apparently a requirement for everything now? I didn't go to those meetings.) They couldn't just pull our Docker images from Docker Hub; the container had to be assembled on their infrastructure. Meanwhile, our builds were reproducible and required no external dependencies other than Bazel, so you could git checkout our release branch, "bazel build //oci" and verify that the sha256 of the containers is identical to what's on Docker Hub. No special infrastructure necessary. It even works across architectures and platforms, so while our CI machines were linux / x86_64, you can build on your darwin / aarch64 laptop and get the exact same bytes, every time.
In a world where everything is reproducible, you don't need special computers to do secure builds. You can just build on a bunch of normal computers and verify that they all generate the same bytes. That's neat!
(I'll also note that the government's requirements made no sense. The way the build ended up working was that our CI system build the binaries, and then the binaries were sent to the special cluster, and there a special Dockerfile assembled the binaries into the image that the customers would use. As far as I can tell, this offers no guarantee that the code we said was in the image was in the image, but it checked their checkbox. I don't see that stuff getting any better over the next 4 years, so...)
It's a link in a chain that allows you to trust programs you run.
- At the start of the chain, developers write software they claim is secure. But very few people trust the word of just one developer.
- Over time other developers look at the code and also pronounce it secure. Once enough independent developers from different countries and backgrounds do this, people start to believe it really is secure. As measure of security this isn't perfect, but it is verifiable and measurable in the sense more is always better, so if you set the bar very high you can be very confident.
- Somebody takes that code, goes through a complex process to produce a binary, releases it, and pronounces it is secure because it is only based on code that you trust, because of the process above. You should not believe this. That somebody could have introduced malicious code and you would never know.
- Therefore before reproducible builds, your only way to get a binary you knew was built from code you had some level of trust in was to build it yourself. But most people can't do that, so they have to trust that Debian, Google, Apple, Microsoft or whoever that are no backdoors have been added. Maybe people do place their faith in those companies, but is is misplaced. It's misplaced because countries like Australia have laws that allow them to compel such companies to silently introduce malicious code and distribute it to you. Australia's law is called the "Assistance and Access Bill (2018)". Countries don't introduce such laws for no reason. It's almost certain it is being used now.
- But now the build can be reproducible. That means many developers can obtain the same trusted source code from the source the original builder claimed he used, build the binary themselves, verify it is identical to the original so publicly validate the claim. Once enough independent developers from different countries and backgrounds do this, people start to believe it really built from the trusted sources.
- Ergo reproducible builds allow everyone, as opposed to just software developers, to run binaries they can be very confident was built just from code that has some measurable and verifiable level of trustworthiness.
It's a remarkable achievement for other reasons too. Although the ideas behind reproducible builds are very simple, it turned out executing it was about as simple as other straightforward ideas like "lets put a man on old moon". It seems build something as complex as an entire OS was beyond any company, or capitalism/socialism/communism, or a country. It's the product of something we've only seen arise in the last 40 years, open source, and it been built by a bunch of idealistic volunteers who weren't paid to do it. To wit: it wasn't done by commercial organisations like RedHat, or Ubuntu. It was done by Debian. That said, other similar efforts have since arisen like F-Droid, but they aren't on this scale.
Should be trivial to put in, if not. Install the package and maybe prepare some datasource hints while reproducing the image. Depends on where you'll be using it.
The trick will be in the details, as usual. User data that both does useful work... and plays nicely with immutability.
I suspect it would be more sensible to skip the gymnastics of trying to manicure something inherently resistant, and instead, lean in on reproducibility. Make it as you want it, skip the extra work.
Want another? Great - they're freely reproducible :)
I’ve said it many times and I’ll repeat it here - Debian will be one of the few Linux distros we have right now, that will still exist 100 years from now.
Yea, it’s not as modern in terms of versioning and risk compared to the likes of Arch, but that’s also a feature!
Any Canadian residents here know of any tax credit eligible software projects to donate to?
There's lots of info on the Debian site about their reproducibility efforts, and there's a story from 2024's DebConf that may be of interest: https://lwn.net/Articles/985739/
The hard things involve things like unstable hash orderings, non-sorted filesystem listing, parallel execution, address-space randomization, ...
Annoying edge cases come up for things like internal object serialization to sort things like JSON keys in config files.
Since the build is reproducible, it should not matter when it was built. If you want to trace a build back to its source, there are much better ways than a timestamp.
Yes. All archive entries and date source code macros and any other timestamps are set to a standardized date (in the past).
Deleted Comment
I never actually checked that.
Deleted Comment
The least difficult to solve for reproducible build but yes.
The real question is: why, in the past, was an entire ecosystem created where non-determinism was the norm and everybody thought it was somehow ok?
Instead of asking: "how one achieves reproducibility?" we may wonder "why did people got out of their way to make sure something as simple as a timestamp would screw determinism?".
For that's the anti-security mindset we have to fight. And Debian did.
Sorting had to be added to that kind of output.
Software was more artisanal in nature…
Also, is someone else also compiling these images, so we have evidence that the Debian compiling servers were not compromised?
I think there's also a similar thing for the images, but I might be wrong and I definitely don't have the link handy at the moment.
There's lots of documentation about all of the things on Debian's site at the links in the brief. And LWN also had a story last year about Holger Levsen's talk on the topic from DebConf: https://lwn.net/Articles/985739/
Every country in the world should have the capability of producing "good enough" hardware.
"Fully Countering Trusting Trust through Diverse Double-Compiling (DDC) - Countering Trojan Horse attacks on Compilers"
https://dwheeler.com/trusting-trust/
If the build is reproducible inside VMs, then the build can be done on different architectures: say x86 and ARM. If we end up with the same live image, then we're talking something entirely different altogether: either both x86 and ARM are backdoored the same way or the attack is software. Or there's no backdoor (which is a possibility we have to fancy too).
Deleted Comment
You must ultimately root trust in some set of binaries and any hardware that you use.
The liability is EFI underneath that, and the Intel ring -1 stuff (which we should be mandating is open source).
Normal: Before Debian's initiative to handle this problem, most people didn't think hard about all the ways system-specific differences might wind up in binaries. For example: __DATE__ and __TIME__ macros in C, parallel builds finishing in different order, anything that produces a tar file (or zip etc.) usually by default asks the OS for the input files' modification time and puts that into the bytes of the tar file, filesystems may list files in a directory in different order and this may also get preserved in tar/zip files or other places...
Why it's important: With reproducible builds, anyone can check the official binaries of Debian match the source code. This means going forward, any bad actors who want to sneak backdoors or other malware into Debian will have to find a way to put it in the source code, where it will be easier for people to spot.
I hope they promote tools to enable easy verification on systems external to debian build machines.
So how do those work in these Debian reproducible builds? Do they outlaw those directives? Or do they set those based on something other than the current date and time? Or something else?
Now that the build is reproducible, you don't need to trust your distro alone. It's always exactly the same binary, which means it'll have one correct sha256sum. You can have 10 other trusted entities build the same binary with the same code and publish a signature of that sha256sum, confirming they got the same thing. You can check all ten of those. The likelihood that 10 different entities are colluding to lie to you is a lot lower than just your distro lying to you.
At my last job, some team spent forever making our software build in a special federal government build cluster for federal government customers. (Apparently a requirement for everything now? I didn't go to those meetings.) They couldn't just pull our Docker images from Docker Hub; the container had to be assembled on their infrastructure. Meanwhile, our builds were reproducible and required no external dependencies other than Bazel, so you could git checkout our release branch, "bazel build //oci" and verify that the sha256 of the containers is identical to what's on Docker Hub. No special infrastructure necessary. It even works across architectures and platforms, so while our CI machines were linux / x86_64, you can build on your darwin / aarch64 laptop and get the exact same bytes, every time.
In a world where everything is reproducible, you don't need special computers to do secure builds. You can just build on a bunch of normal computers and verify that they all generate the same bytes. That's neat!
(I'll also note that the government's requirements made no sense. The way the build ended up working was that our CI system build the binaries, and then the binaries were sent to the special cluster, and there a special Dockerfile assembled the binaries into the image that the customers would use. As far as I can tell, this offers no guarantee that the code we said was in the image was in the image, but it checked their checkbox. I don't see that stuff getting any better over the next 4 years, so...)
https://wiki.debian.org/ReproducibleBuilds/About
It validates that publicly available downloads aren't different from what is claimed.
- At the start of the chain, developers write software they claim is secure. But very few people trust the word of just one developer.
- Over time other developers look at the code and also pronounce it secure. Once enough independent developers from different countries and backgrounds do this, people start to believe it really is secure. As measure of security this isn't perfect, but it is verifiable and measurable in the sense more is always better, so if you set the bar very high you can be very confident.
- Somebody takes that code, goes through a complex process to produce a binary, releases it, and pronounces it is secure because it is only based on code that you trust, because of the process above. You should not believe this. That somebody could have introduced malicious code and you would never know.
- Therefore before reproducible builds, your only way to get a binary you knew was built from code you had some level of trust in was to build it yourself. But most people can't do that, so they have to trust that Debian, Google, Apple, Microsoft or whoever that are no backdoors have been added. Maybe people do place their faith in those companies, but is is misplaced. It's misplaced because countries like Australia have laws that allow them to compel such companies to silently introduce malicious code and distribute it to you. Australia's law is called the "Assistance and Access Bill (2018)". Countries don't introduce such laws for no reason. It's almost certain it is being used now.
- But now the build can be reproducible. That means many developers can obtain the same trusted source code from the source the original builder claimed he used, build the binary themselves, verify it is identical to the original so publicly validate the claim. Once enough independent developers from different countries and backgrounds do this, people start to believe it really built from the trusted sources.
- Ergo reproducible builds allow everyone, as opposed to just software developers, to run binaries they can be very confident was built just from code that has some measurable and verifiable level of trustworthiness.
It's a remarkable achievement for other reasons too. Although the ideas behind reproducible builds are very simple, it turned out executing it was about as simple as other straightforward ideas like "lets put a man on old moon". It seems build something as complex as an entire OS was beyond any company, or capitalism/socialism/communism, or a country. It's the product of something we've only seen arise in the last 40 years, open source, and it been built by a bunch of idealistic volunteers who weren't paid to do it. To wit: it wasn't done by commercial organisations like RedHat, or Ubuntu. It was done by Debian. That said, other similar efforts have since arisen like F-Droid, but they aren't on this scale.
The trick will be in the details, as usual. User data that both does useful work... and plays nicely with immutability.
I suspect it would be more sensible to skip the gymnastics of trying to manicure something inherently resistant, and instead, lean in on reproducibility. Make it as you want it, skip the extra work.
Want another? Great - they're freely reproducible :)