A byte string library for Rust

Burntsushi is a serious power contributor to the Rust ecosystem. Obviously ripgrep and the regex crate are top tier. Big thank you from someone who uses your work regularly :)

Also really cool to see more and more rust crates releasing their 1.0 versions. I feel like one of the big jumps for Rust (for me at least) will be when critical non-stdlib libraries are all 1.0 and there's less "which one do I choose". I understand competition is good in situations (more so opinionated frameworks) but I do love that I don't think about what regex library to choose since there's one obvious winner with tons of community support and resources.

m12k · 3 years ago

> I understand competition is good in situations

But also, consolidation at a given layer of an ecosystem allows that layer to be treated as a foundation so the next layer on top of it can be where the competition happens instead

bowsamic · 3 years ago

I don't think always that's a good thing. Every ecosystem I've seen that happen in has just bred an army of lazy developers who rely on libraries for basically everything.

dhosek · 3 years ago

I’ve always been a bit puzzled by the reluctance to release 1.0 I can understand if you’re flailing around about what the API should be or if you don’t think your code actually works, but otherwise, why not start at 1.0? The first public release of finl_unicode was 1.0 (there ended up being 1.0.1 and 1.0.2 to fix some issues with docs.rs requirements that could not be anticipated), but the API was predetermined and I have good tests so I know my code is accurate so why not release as 1.0?

burntsushi · 3 years ago

OK, so unfortunately, this issue gets really tangled. I could give you a short answer, but that will invite a question. And then my answer to that question is likely to invite another. I've had this conversation many times and it always goes the same way. So I will try to anticipate those questions, but... it's subtle.

> I’ve always been a bit puzzled

I'll respond with my reasons, but I want to emphasize that I am be descriptive, not prescriptive.

The main reason why I don't just start with 1.0 out of the gate is because I generally want 1.0 to indicate some level of maturity and stability. That is, once I publish 1.0, ideally, I won't publish a 2.0. Or if I do, that timeline will be measured in years. It takes a while to get that kind of confidence with a library's API. If I had started with bstr 1.0, then this blog post would be talking about bstr 3.0. Not 1.0. Empirically, bstr 1.0 would not have been the commitment to stability that I want 1.0 to mean.

So, first question at this point is usually: well why not just increase x in x.0.0 as needed? It's okay to have 1.0 and 2.0 and 3.0. We have semver after all!

What I say to that is, yes, absolutely, you can do that. But it's absolutely a preference with respect to how often you want to release breaking change releases. My preference is to do it very rarely. Or as rarely as I can manage. The main practical reason for it is that breaking change releases create churn, and they lead to transition periods where, in the best of cases, compilation times take a hit.

For example, if I released regex 2.0, no code would break. At some point, people would start migrating to it. And for some period of time, it's likely that many projects would be building both regex 1.0 and regex 2.0 in their dependency trees. regex is not exactly lightweight, and so people are going to hunt down these issues in their trees and get everyone to migrate to regex 2.0. It's work. It's tedious. It's annoying. If I start putting out new breaking change releases of the regex crate frequently, then I'm going to annoy people in a way that is proportional to the frequency of releases. By committing to a policy that 1.0 means "I'm unlikely to publish a breaking change release for at least a few years," then that 1.0 is going to be a signal to folks that they are signing up for a dependency that is probably not going to cause them churn.

It's also especially important for bstr, because folks want to use it as a public dependency. So if I'm releasing semver incompatible releases frequently, then that's going to cause a lot of painful churn for users of bstr. It no longer just becomes a matter of compilation times. But you'll need to get your entire dependency tree migrated over, or else you risk things not working if multiple crates try to interoperate via bstr's API.

I suppose the next question at this point is, "but it's just a version number, why attach special significance to it that isn't in semver?" semver is useful for communicating breaking changes. And I think it's also useful to use the version number to communicate stability as well. But to be totally clear here: I am (EDIT) NOT trying to be an advocate here. I'm not saying this is what you or what everyone should do. There are trade offs here. I tend to build library crates that others build on, so my bias is to move slowly. But if I built crates (and I do) that are closer to the application (or even an application itself), then I'm generally much happier to just push out breaking change releases at a higher frequency.

I think the last question is, "but anything goes in 0.x.y, so says semver, so now people never know if they're getting a breaking release or not." Indeed, that is what semver says, and if that were how Cargo implemented semver, I'd probably start with 1.0 releases. (Or at the very least, publish a 1.0 release much much sooner.) But Cargo does not implement semver that way. With Cargo, 0.x.y is semver incompatible with 0.(x+1).z. That is, incrementing the leftmost non-zero digit in a version creates a new semver incompatible release from Cargo's perspective. So I get all the benefits of semver when I use 0.x.y, without needing to publish 1.0.0. The main downside is that the 'minor' and 'patch' components of the version number get collapsed into one number. But I can live with that until I publish 1.0.

loudmax · 3 years ago

I interpret a major version number release as a commitment that the API won't make breaking changes. So releasing a 1.0 version of a library is kind of like promising that you'll make some attempt not to drastically alter the behavior. If you're doing this as a hobby or side project you might not want to make that kind of commitment.

pizza234 · 3 years ago

Following semantic versioning, libraries are essentially allowed to change very liberally on 0.x, so it makes sense to reach 1.0 only once the crate/API is stable.

> Should byte strings be added to std?

> Some folks have expressed a desire for bstr or something like it to be put into the standard library. I’m not sure how I feel about wholesale adopting bstr as it is. bstr is somewhat opinionated in that it provides several Unicode operations (like grapheme, word and sentence segmentation), for example, that std has deliberately chosen to leave to the crate ecosystem.

Yes, ok, but could we -at least- have the same Debug impl bstr has ? I'd love to be able to print "human-readable" Vec<u8> :')

burntsushi · 3 years ago

That's what the very next paragraph addresses haha.

So the Debug impl for Vec<u8> is just the Debug impl for Vec<T>. Doing otherwise means specializing for Vec<u8>, and it's not totally clear to me that it makes sense to do that. Doing it effectively requires assuming that a Vec<u8> everywhere is UTF-8 or close to it.

I do mention that we could add a '[u8]::debug_utf8()' method that returns a type with a nice Debug impl for byte strings. Kind of like how we have 'Path::display()', but for the Display impl. But that is kind of annoying in a way that doesn't really apply to Display impls. It's very common to derive(Debug), and if the debug impl is only accessible via a method, then derive(Debug) doesn't work. So then you have to write your Debug impl by hand, which is... annoying.

Anyway, point is, it's just not totally straight-forward to bring bstr into std.

As I said in the blog, I think the highest value thing that could be brought into std is substring search that works on &[u8].

zozbot234 · 3 years ago

> It's very common to derive(Debug), and if the debug impl is only accessible via a method, then derive(Debug) doesn't work.

The right way of doing this is to define custom attributes as part of derive(Debug) and derive(Display); the derive mechanism can already do this. There's no need for a wrapper type to be used.

fanf2 · 3 years ago

I am curious how bstr relates to OsString. Is the difference that OsString can be WTF16 on Windows?

dhosek · 3 years ago

I’ve toyed with seeing about adding a feature to finl_unicode to extend or replace the bstr implementations of segmentation, etc. but I don’t need it so I probably won’t. You’re welcome to steal my code though. (And hi from reddit-land!)

IshKebab · 3 years ago

That shouldn't be the default debug implementation. Plenty of people use `Vec<u8>` to store a list of numbers, not a string.

jjice · 3 years ago

vlmutolo · 3 years ago

I've been using this library in production to handle doing operations on domain names and it's been incredible. It's one of those things that's so easy to use it almost starts to seem simple. Like of course we need a library that looks just like this. It's obvious in hindsight, which speaks to great design.

It's especially helpful that the library doesn't require you to opt into its own dedicated types, and instead defines extension methods on existing types.

Thanks, Andrew!

> It's especially helpful that the library doesn't require you to opt into its own dedicated types, and instead defines extension methods on existing types.

Fun fact: bstr 0.1 went the route of defining its own dedicated types! See: https://docs.rs/bstr/0.1.4/bstr/

But it did indeed quickly prove to be pretty annoying. Because you still really want to use &[u8] in places because it's so ubiquitous. But to get access to the byte string methods, you had to explicitly convert it to another type.

The reason why I went that route initially was so you'd always get the good Debug impl. But it ended up not being worth it sadly. This issue discusses it a bit more: https://github.com/BurntSushi/bstr/issues/5

epilys · 3 years ago

This should be in the standard library, but maybe I'm biased. In implementing ascii/utf8 plain text protocols like IMAP/SMTP just like the grep/ripgrep example in the article, I've had to limit myself to u8 slices just because the occasional byte might make a grapheme invalid.

chungy · 3 years ago

One thing I noticed in the middle, when concatenating Rust files for a demonstration:

> Note also that the files are sorted before concatenating, so that the result is guaranteed to be deterministic.

No locale was defined and the example sort command used cannot be considered deterministic. The results could vary wildly on different systems just through the locale alone!

Two solutions: define the locale ("LC_ALL=C" before the command should be sufficient), or use the -V flag on sort.

Ooo, nice catch! Fixed: https://github.com/BurntSushi/blog/commit/a1b0e40cbfd293310b...

Awesome :)

Xeoncross · 3 years ago

If you haven't read https://blog.burntsushi.net/transducers/ yet, then IDK what you're doing with your life.

Fiahil · 3 years ago

baq · 3 years ago

as usual a reminder that ripgrep is the grep tool. if you're using vscode's Find in Files, you're using ripgrep.

gwbas1c · 3 years ago

I've always wanted something similar in C#. By convention, byte arrays are used for handling arbitrary blobs; but they are mutable and don't have the same kind of support that strings have. C# strings are immutable and have lots of supporting methods.

rednab · 3 years ago

These days C# has Span<byte> ¹) and ReadOnlySpan<byte> ²) which has a whole bunch of string-like methods, but the version of C# it requires might be newer than you're happy with.

¹) https://docs.microsoft.com/en-us/dotnet/api/system.span-1

²) https://docs.microsoft.com/en-us/dotnet/api/system.readonlys...

> but the version of C# it requires might be newer than you're happy with

Oh, that's a solvable problem. Thanks!

db48x · 3 years ago

That is exactly what I need for one of my long–neglected projects. Thank you!