Bzip2 crate switches from C to 100% Rust

How realistic is it for the Trifecta Tech implementation to start displacing the "official" implementation used by linux distros, which hasn't seen an upstream release since 2019?

Fedora recently swapped the original Adler zlib implementation with zlib-ng, so that sort of thing isn't impossible. You just need to provide a C ABI compatible with the original one.

Pesthuf · 6 months ago

If it hasn't seen an upstream release since 2019, doesn't that mean the implementation is just... finished? Maybe there's no more bugs to fix and features to add. And in that case, I don't see what's wrong with it.

LinusU · 6 months ago

Isn't 10-15% faster compression, and 5-10% faster decompression, a very nice "feature"?

> [...] doesn't that mean the implementation is just... finished?

I don't think that it _necessarily_ means that, e.g. all projects that haven't had a release since 2019 aren't finished? Probably most of them are simply abandoned?

On the other hand, a finished implementation is certainly a _possible_ explanation for why there have been no releases.

In this specific case, there are a handful of open bugs on their issue tracker. So that would indicate that the project isn't finished.

ref: https://sourceware.org/bugzilla/buglist.cgi?product=bzip2

wmf · 6 months ago

Ubuntu is using Rust sudo so it's definitely possible.

egorfine · 6 months ago

It's not. At least not yet. It's planned for 25.10, but thankfully sudo will be packaged and available for a few versions after that as promised [1].

[1] https://discourse.ubuntu.com/t/adopting-sudo-rs-by-default-i...

masfuerte · 6 months ago

They do provide a compatible C ABI. Someone "just" needs to do the work to make it happen.

tiffanyh · 6 months ago

I think that is the goal of uutils.

https://uutils.github.io/

coldpie · 6 months ago

1) This is a cool project and I wish them success. It would be really cool if these became the default utilities some day soon.

2) I think the MIT license was a mistake. These are often cloning GNU utilities, so referencing GNU source in its original language and then re-implementing it in Rust would be the obvious thing to do. But porting GPL-licensed code to an MIT licensed project is not allowed. Instead, the utilities must be re-implemented from scratch, which seems like a waste of effort. I would be interested in doing the work of porting GNU source to Rust, but I'm not interested in re-writing them all from scratch, so I haven't contributed to this project.

cocoa19 · 6 months ago

I hope some are improved too.

The performance boost in tools like ripgrep and tokei is insane compared to the tools they replace (grep and cloc respectively).

kpcyrd · 6 months ago

I briefly looked a this and there's already cargo-c configuration, which is good, but it's currently namespaced differently, so it won't get automatically detected by C programs as `libbz2`:

https://github.com/trifectatechfoundation/libbzip2-rs/blob/8...

I'm not familiar enough with the symbols of bzip2 to say anything about ABI compatibility.

I have a toy project to explore things like that, but it's difficult to set aside the amount of time needed to maintain an implementation of the GNU operating system. I would welcome pull requests though:

https://github.com/kpcyrd/platypos

rlpb · 6 months ago

> You just need to provide a C ABI compatible with the original one.

How does this interact with dynamic linking? Doesn't the current Rust toolchain mandate static linking?

alxhill · 6 months ago

The commenters below are confusing two things - Rust binaries can be dynamically linked, but because Rust doesn’t have a stable ABI you can’t do this across compiler versions the way you would with C. So in practice, everything is statically linked.

bluGill · 6 months ago

Rust cannot dynamic link to rust. It can dynamic link to C and be dynamicly linked by C - if you combine the two you can cheat but it is still C that you are dealing with not rust even if rust is on both sides.

sedatk · 6 months ago

No. https://doc.rust-lang.org/reference/linkage.html#r-link.dyli...

arcticbull · 6 months ago

Rust lets you generate dynamic C-linkage libraries.

Use crate-type=["cdylib"]

nicoburns · 6 months ago

Dynamic linking works fine if you target the C ABI.

conradev · 6 months ago

Rust importing Rust must be statically linked, yes. You can statically link Rust into a dynamic library that other libraries link to, though!

timeon · 6 months ago

You can use dynamic linking in Rust with C ABI. Which means going through `unsafe` keyword - also known as 'trust me bro'. Static linking directly to Rust source means it is checked by compiler so there is no need for unsafe.

deknos · 6 months ago

i wait until they come to the hard stuff like awk, sed and grep.

GuB-42 · 6 months ago

ripgrep is one of the best grep replacement you can find, maybe even the best, and also one of the most famous Rust projects.

I don't know of a sed equivalent, but I guess that would be easy to implement as Rust has good regex support (see ripgrep), and 90%+ of sed usage is search-and-replace. The other commands don't look hard to implement and because they are not used as much, optimizing these is less of a priority.

I don't know about awk, it is a full programming language, but I guess it is far from an impossible task to implement.

Now the real hard part is making a true, bug-for-bug compatible replacement of the GNU version of these tools, but while good to have, it is not strictly necessary. For example, Busybox is very popular, maybe even more so than GNU in terms of number of devices, and it has its own (likely simplified) version of grep, sed and awk.

egorfine · 6 months ago

What would be the point?

i'd be curious if they're using the same llvm codegen (with the same optimization) backend for the c and rust versions. if so, where the speedups are coming from?

(ie, is it some kind of rust auto-simd thing, did they use the opportunity to hand optimize other parts or is it making use of newer optimized libraries, or... other)

eru · 6 months ago

Just speculating: Rust can hand over more hints to the code generator. Eg you don't have to worry about aliasing as much as with C pointers. See https://en.wikipedia.org/wiki/Aliasing_(computing)#Conflicts...

MBCook · 6 months ago

This makes a lot of sense to me, though I don’t know the official answer so I’m just sort of guessing along too.

Linked from the article is another on how they used c2rust to do the initial translation.

https://trifectatech.org/blog/translating-bzip2-with-c2rust/

For our purposes, it points out places where the code isn’t very optimal because the C code has no guarantees on the ranges of variables, etc.

It also points out a lot of people just use ‘int’ even when the number will never be very big.

But with the proper type the Rust compiler can decide to do something else if it will perform better.

So I suspect your idea that it allows unlocking better optimizations though more knowledge is probably the right answer.

Too · 6 months ago

Ergonomics of using the right data structures and algorithms can also play a big role. In C, everything beyond a basic array is too much hassle.

adgjlsfhk1 · 6 months ago

C is honestly a pretty bad language for writing modern high performance code. Between C99 and C21, there was a ~20 year gap where the language just didn't add features needed to idiomatically target lots of the new instructions added (without inline asm). Just getting good abstract machine instructions for clz/popcnt/clmul/pdep etc helps a lot for writing this kind of code.

zzo38computer · 6 months ago

Popcount, clz, and ctz are provided as nonstandard functions in GCC (and clang might also support them in GNU mode, but I don't know for sure). PDEP and PEXT do not seem to be, but I think they should be (and PEXT is something that INTERCAL already had, anyways) (although PDEP and PEXP can be used with -mbmi2 on x86, but are not available for general use). The MOR and MXOR of MMIX are also something that I would want to be available as built-in functions.

WhereIsTheTruth · 6 months ago

any rewrite, in X, Y, Z language gives you the opportunity to speed things up, there is nothing inherent to rust

dralley · 6 months ago

rwaksmunski · 6 months ago

I use this crate to process 100s of TB of Common Crawl data, I appreciate the speedups.

viraptor · 6 months ago

What's the reason for using bz2 here? Wouldn't it be faster to do a one off conversion to zstd? It beats bzip2 in every metric at higher compression levels as far as I know.

Common Crawl delivers the data as bz2. Indeed I store intermediate data in zstd with ZFS.

declan_roberts · 6 months ago

That assumes you're processing the data more than once.

anon-3988 · 6 months ago

Is this data available as torrents?

malux85 · 6 months ago

Yeah came here to say a 14% speed up in compression is pretty good!

aidenn0 · 6 months ago

bzip2 (particularly parallel implementations thereof) are already relatively competitive for compression. The decompression time is where it lags behind because lz77 based algorithms can be incredibly fast at decompression.

koakuma-chan · 6 months ago

It's blazingly fast

firesteelrain · 6 months ago

Anyone know if this will by default resolve the 11 outstanding CVEs?

Ironically there is one CVE reported in the bzip2 crate

[1] https://app.opencve.io/cve/?product=bzip2&vendor=bzip2_proje...

tialaramex · 6 months ago

There's certainly a contrast between the "Oops a huge file causes a runtime failure" reported for that crate and a bunch of "Oops we have bounds misses" in C. I wonder how hard anybody worked on trying to exploit the bounds misses to get code execution. It may or may not be impossible to achieve that escalation.

Philpax · 6 months ago

> The bzip2 crate before 0.4.4

They're releasing 0.6.0 today :>

Dead Comment

wiz21c · 6 months ago

FTA:

> Why bother working on this algorithm from the 90s that sees very little use today?

What's in use nowadays ? zstd ?

ahh saw this: https://quixdb.github.io/squash-benchmark/

a-dub · 6 months ago

xvilka · 6 months ago

I hope they or Prossimo will also look and reimplement in the similar fashion the core Internet protocols - BGP, OSPF and RIP, other routing implementations, DNS servers, and so on.

everfrustrated · 6 months ago

Check out

https://nlnet.nl/project/current.html https://www.sovereign.tech/programs/fund

There's been good support over the last couple of years to fund rewriting critical internet & OS tools into safer languages like Rust.

Eg BGP in Rust https://www.nlnetlabs.nl/projects/routing/rotonda/

Thank you, precisely what I had in mind! Somehow I missed this project. As well as Holo[1] (routing)

[1] https://github.com/holo-routing/holo

dataking · 6 months ago

https://www.memorysafety.org/initiative/ this page mentions TLS and DNS which goes some way towards your suggestion.

throw10920 · 6 months ago

Is that domain actually about memory safety or about Rust?

nickpsecurity · 6 months ago

One guy did Ironsides DNS in SPARK Ada which has stronger proofs.

Nothing against Ada, it's a good language. The only problem would be finding contributors in that case.

broken_broken_ · 6 months ago

About not having perf on macOS: you can get quite far with dtrace for profiling. That’s what the original flame graph script in Perl mentions using and what the flame graph Rust reimplementation also uses. It does not have some metrics like cache misses or micro instructions retired but still it can be very useful.

zoobab · 6 months ago

Lbzip2 had much faster decompressing speed, using all available CPU cores.

It's 2025, and most programs like Python are stuck at one CPU core.

guappa · 6 months ago

Thanks for showing us you have no understanding of python's situation.

pdimitar · 6 months ago

Feel free to enlighten future readers?