Readit News logoReadit News
duckerude commented on RFC 9839 and Bad Unicode   tbray.org/ongoing/When/20... · Posted by u/Bogdanp
dcrazy · 3 days ago
The Unicode spec itself is designed around UTF-16: the block of code points that surrogate pairs would map to are reserved for that purpose and explicitly given “no interpretation” by the spec. [1] An implementation has to choose how to behave if it encounters one of these reserved code points in e.g. a UTF-8 string: Throw an encoding error? Silently drop the character? Convert it to an Object Replacement character?

[1] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

duckerude · 3 days ago
RFC 3629 says surrogate codepoints are not valid in UTF-8. So if you're decoding/validating UTF-8 it's just another kind of invalid byte sequence like a 0xFF byte or an overlong encoding. AFAIK implementations tend to follow this. (You have to make a choice but you'd have to make that choice regardless for the other kinds of error.)

If you run into this when encoding to UTF-8 then your source data isn't valid Unicode and it depends on what it really is if not proper Unicode. If you can validate at other boundaries then you won't have to deal with it there.

duckerude commented on Go is still not good   blog.habets.se/2025/07/Go... · Posted by u/ustad
klodolph · 4 days ago
This is one of those things that kind of bugs me about, say, OsStr / OsString in Rust. In theory, it’s a very nice, principled approach to strings (must be UTF-8) and filenames (arbitrary bytes, almost, on Linux & Mac). In practice, the ergonomics around OsStr are horrible. They are missing most of the API that normal strings have… it seems like manipulating them is an afterthought, and it was assumed that people would treat them as opaque (which is wrong).

Go’s more chaotic approach to allow strings to have non-Unicode contents is IMO more ergonomic. You validate that strings are UTF-8 at the place where you care that they are UTF-8. (So I’m agreeing.)

duckerude · 4 days ago
The big problem isn't invalid UTF-8 but invalid UTF-16 (on Windows et al). AIUI Go had nasty bugs around this (https://github.com/golang/go/issues/59971) until it recently adopted WTF-8, an encoding that was actually invented for Rust's OsStr.

WTF-8 has some inconvenient properties. Concatenating two strings requires special handling. Rust's opaque types can patch over this but I bet Go's WTF-8 handling exposes some unintuitive behavior.

There is a desire to add a normal string API to OsStr but the details aren't settled. For example: should it be possible to split an OsStr on an OsStr needle? This can be implemented but it'd require switching to OMG-WTF-8 (https://rust-lang.github.io/rfcs/2295-os-str-pattern.html), an encoding with even more special cases. (I've thrown my own hat into this ring with OsStr::slice_encoded_bytes().)

The current state is pretty sad yeah. If you're OK with losing portability you can use the OsStrExt extension traits.

duckerude commented on Writing simple tab-completions for Bash and Zsh   mill-build.org/blog/14-ba... · Posted by u/lihaoyi
oezi · 17 days ago
Isn't there a standard flag which programs can implement to avoid writing this bash script?

Ideally this could all be part of a library such as argparse for typical cases, right?

duckerude · 16 days ago
Rust has the clap_complete package for its most popular arg parsing library: https://crates.io/crates/clap_complete

ripgrep exposes its (bespoke) shell completion and man page generation through a --generate option: rg --generate=man, rg --generate=complete-bash, etcetera. In xh (clap-based) we provide the same but AFAIK we're the only one to copy that interface.

Symfony (for PHP) provides some kind of runtime completion generation but I don't know the details.

duckerude commented on In POSIX, you can theoretically use inode zero   utcc.utoronto.ca/~cks/spa... · Posted by u/mfrw
duckerude · 3 months ago
See also: https://internals.rust-lang.org/t/can-the-standard-library-s...

A file descriptor can't be -1 but it's not 100% clear whether POSIX bans other negative numbers. So Rust's stdlib only bans -1 (for a space optimization) while still allowing for e.g. -2.

duckerude commented on A surprising enum size optimization in the Rust compiler   jpfennell.com/posts/enum-... · Posted by u/returningfory2
NoTeslaThrow · 5 months ago
> This is a great way to see why invalid UTF-8 strings and unicode chars cause undefined behaviour in Rust.

What does "undefined behavior" mean without a spec? Wouldn't the behavior rustc produces today be de-facto defined behavior? It seems like the contention is violating some transmute constraint, but does this not result in reproducible runtime behavior? In what context are you framing "soundness"?

EDIT: I'm honestly befuddled why anyone would downvote this. I certainly don't think this is detracting from the conversation at all—how can you understand the semantics of the above comment without understanding what the intended meaning of "undefined behavior" or "soundness" is?

duckerude · 5 months ago
It means that anything strange that happens next isn't a language bug.

Whether something is a bug or not is sometimes hard to pin down because there's no formal spec. Most of the time it's pretty clear though. Most software doesn't have a formal spec and manages to categorize bugs anyway.

duckerude commented on argp: GNU-style command line argument parser for Go   github.com/tdewolff/argp... · Posted by u/networked
cb321 · 5 months ago
Literally any short option key that takes a string or a char that could legitimately start with '=' has the problem, though, not just `cut`. The '=' will be "eaten" by one tool and left in another. But you know that. You write a bit as if we disagree, but I don't see any real point of contention. :-) Also, uutils has a very strict "drop-in" agenda. So, as another e.g., if you want `cp -t=foo` or `cp -S=foo` to work the same, you're going to have trouble if that '=' is eaten.

So, in this case that would seem to imply a problem for any utilities with options taking strings-or-chars, not merely `cut -d`. If uutils really wants to be strictly drop-in compatible, they may well need to roll their own option parser or twist the arm of whoever's they use to provide a mode for them.

In the more general case, "cross compatibility" may just always be limited by the reality that people just disagree on this stuff "more than they seem to think they do" (at least in my experience) and definitely more than they wish they did. I surveyed my /usr/bin once and like half of thousands of commands did not work with --help (yes, running that took some confidence in backups! but anyone could replicate on a throwaway VM or something). Consistency is nice, but consistency with what? -l=foo is consistent with --long=foo, but not (some, but not other) historical things.

I'm not sure there will ever will be a world in which you don't need to know which PLang/CL toolkit was used to make a CLI utility if you really want to know its syntax. The article's lib is going its own way from the Go stdlib. POSIX is pretty darn calcified. A 15 year old Python stdlib thing is unlikely to ever change in this regard. Python also allows "--beg" for "--beginning-long-option" if nothing else starts with "--beg" even back in its optparse days and that also tends to be controversial. cligen tools actually provide a --help-syntax. Maybe something like that could take off?

duckerude · 5 months ago
I can think of a lot of cases where it theoretically could be a problem, but `cut -d=` is the only one I've found so far where an end user ran into trouble because of this ambiguity, and I think it's the only one for which uutils bothers implementing a workaround. That's why I give it special attention.

> You write a bit as if we disagree, but I don't see any real point of contention. :-)

The `cut -d:=` spelling solves a different problem than the one I meant (and the one you're now talking about). But we're mostly on the same page!

duckerude commented on argp: GNU-style command line argument parser for Go   github.com/tdewolff/argp... · Posted by u/networked
cb321 · 5 months ago
Ha! I just said that. :-)

Anyway, one other alternative for the `cut` situation is to allow either ':' or '=' to optionally separate the key and the value. Then you can say `cut -d:=` or `cut -d=:` if you wanted to use either one. This is what https://github.com/c-blake/cligen does (for Nim, not Go).

duckerude · 5 months ago
The problem is existing shell scripts and muscle memory and command histories. `cut -d=` has always worked and works on all the other implementations so it should keep working if you switch to uutils.
duckerude commented on argp: GNU-style command line argument parser for Go   github.com/tdewolff/argp... · Posted by u/networked
nloomans · 5 months ago
I disagree with it being a minor issue. If I write a shell script around a program that accepts GNU-style arguments, I expect the following to be correct:

    ./cmd -a"$USER_CONTROLLED_DATA"
A program using this package would break that assumption, introducing a bug where this user-controlled data cannot start with an '='.

duckerude · 5 months ago
I researched this for my own argument parser (https://github.com/blyxxyz/lexopt/issues/13) and concluded that it's a minor issue.

This syntax is supported by argparse and clap, the most popular argument parsers for Python and Rust respectively, and it seems to have caused almost no problems for them. It's a problem for the uutils implementation of cut, since `cut -d=` is common, but that's the only instance I could find after a long time scouring search engines and bug trackers and asking for examples.

If anyone does know of other examples or other places this has been discussed I'd love to hear it though, maybe I just haven't found them.

(Also, the more reliable way to write this in general is `-a "$USER_CONTROLLED_DATA"`, since that'll behave correctly if $USER_CONTROLLED_DATA is empty. As will `-a="$USER_CONTROLLED_DATA"` if you know the command supports it.)

duckerude commented on A 10x Faster TypeScript   devblogs.microsoft.com/ty... · Posted by u/DanRosenwasser
tinco · 6 months ago
It's not just a pity, it's very surprising. In my eyes Go is a direct competitor of C#. Whenever you pick Go for a project, C# should have been a serious consideration. Hejlsberg designed C# and that a team that he's an authority figure in would opt to use Go, a language which frankly I would not consider to build a compiler in is astounding.

Not saying that in a judgemental way, I'm just genuinely surprised. What does this say about what Hejlsberg thinks of C# at the moment? I would assume one reason they don't pick C# is because it's deeply unpopular in the open source world. If Microsoft was so successful in making Typescript popular for open source work, why can't they do it for C#?

I have not opted to use C# for anything significant in the past decade or so. I am not 100% sure why, but there's always been something I'd rather use. Whether that's Go, Rust, Ruby or Haskell. I always enjoyed working in C#, I think it's a well designed and powerful language even if it never made the top of my list recently. I never considered that there might be something so fundamentally wrong with it that not even Hejlsberg himself would use it to build a Typescript compiler.

What's wrong with C#?

duckerude · 6 months ago
Anders Hejlsberg explains here: https://youtu.be/10qowKUW82U?t=1154. TL;DW:

- C# is bytecode-first, Go targets native code. While C# does have AOT capabilities nowadays this is not as mature as Go's and not all platforms support it. Go also has somewhat better control over data layout. They wanted to get as low-level as possible while still having garbage collection.

- This is meant to be something of a 1:1 port rather than a rewrite, and the old code uses plain functions and data structures without an OOP style. This suits Go well while a C# port would have required more restructuring.

duckerude commented on Fish 4   github.com/fish-shell/fis... · Posted by u/SteveHawk27
ivanjermakov · 6 months ago
I'm surprised Rust took more lines. Perhaps formatter makes shorter lines due to method chaining? Or improved test suite?
duckerude · 6 months ago
According to the blog post (https://fishshell.com/blog/rustport/#fn:formatting):

> A lot of the increase in line count can be explained by rustfmt’s formatting, as it likes to spread code out over multiple lines [...] The rest is additional features. Also note that our Rust code is in some places a straight translation of the C++, and fully idiomatic Rust might be shorter.

u/duckerude

KarmaCake day1848December 10, 2017View Original