It's probably worth noting that TySan currently only catches aliasing violations that LLVM would be able to exploit. For some types, e.g. unions, Clang doesn't emit accurate type-based aliasing information and therefore TySan won't catch these.
Which is fine I think, considering that union type punning is legal in C (and even in C++ where union type punning is UB I have never seen it break - theoretically it might of course).
The problem might be that Clang does not even implement type-based aliasing correctly. So I assume it checks its broken rules, instead of the one specified in the C standard.
Presumably this was converted from markdown or similar and the conversion partly failed or the input was broken.
From the PVI section onward it seems to recover, but if the author sees this please fix and re-convert your post.
[Edited, nope, there are more errors further in the text, this needed proper proofreading before it was posted, I can somewhat struggle through because I already know this topic but if this was intended to introduce newcomers it's probably very confusing]
In the section about the ambiguous provenance from synthesising pointers, it's explained that the compiler will infer the correct provenance from usage. Would it not be worth having some way for the programmer to inform the compiler directly, with something analogous to Rust's Strict Provenance ptr::with_addr?
To convert it to C syntax, it's a function with roughly this signature:
void* with_addr(void* ptr, uintptr_t addr)
Where the returned pointer has the address of `addr` and the provenance of `ptr`.
I'd also like to have builtin functions and/or function attributes for designating allocation and deallocation. malloc() and free() (and realloc()) should not be special because of their names -- they should be special because of their declared attributes or their derived attributes given their internals.
The proposal is mostly designed this way to make sure existing code is valid. One could add something "with_addr", but I am not convinced that it is really worth it.
Does C allow Unicode identifiers now, or is that pseudo code? The code snippets also contain `&`, so something definitely went wrong with the transcoding to HTML.
An identifier is an arbitrarily long sequence of digits, underscores, lowercase and uppercase Latin letters, and Unicode characters specified using \u and \U escape notation(since C99), of class XID_Continue(since C23). A valid identifier must begin with a non-digit character (Latin letter, underscore, or Unicode non-digit character(since C99)(until C23), or Unicode character of class XID_Start)(since C23)). Identifiers are case-sensitive (lowercase and uppercase letters are distinct). Every identifier must conform to Normalization Form C.(since C23)
Implementation-defined until C99, explicitly possible via UCNs aince c99, possible with explicit encoding since C23, but literals are still implementation defined.
I can't even view the post, I just get some kind of content management system-like with the page as JSON or something, in pink-on-white. I'm super confused. :|
The answer to your question seems to (still) be "no".
At least at a skim, what this specifies for exposure/synthesis for reads/writes of the object representation is concerning. One of the consequences is that dead integer loads cannot be eliminated, as they may have an exposure side effect. I guess C might be able to get away with it due to the interaction with strict aliasing rules. Still quite surprised that they are going against consensus here (and reduces the likelihood that these semantics will get adopted by implementers).
That type punning through memory does not expose or synthesize memory. There are some possible variations on this, but the most straightforward is that pointer to integer transmutes just return the address (without exposure) and integer to pointer transmutes return a pointer with nullary provenance.
> I guess C might be able to get away with it due to the interaction with strict aliasing rules.
But not for char-typed accesses. And even for larger types, I think you would have to worry about the combo of first memcpying from pointer-typed memory to integer-typed memory, then loading the integer. If you eliminate dead integer loads, then you would have to not eliminate the memcpy.
That's a great point. I initially thought we could assume no exposure for loads with non-pointer-compatible TBAA, but you are right that this is not correct if the memory has been laundered through memcpy.
I don't imagine that the exposed state would need to be represented in the final compiler output, so the optimiser could mark the pointer as exposed, but still eliminate the dead integer load.
Or from a pragmatic viewpoint, perhaps if the optimiser eliminates a dead load, then don't mark the pointer as exposed? After all, the whole point is to keep track of whether a synthesised pointer might potentially refer to the exposed pointer's storage. There's zero danger of that happening if the integer load never actually occurs.
I guess the internal exposure state would be “wrong” if the compiler removes the dead load (e.g in a pass that runs before provenance analysis).
However, if all of the program paths from that point onward behave the same as if the pointer was marked as exposed, that would be fine. It’s only “wrong” to track the incorrect abstract machine state when that would lead to a different behaviour in the abstract machine.
In that sense I suppose it’s no different from things like removing a variable initialisation if the variable is never used. That also has a side effect in the abstract machine, but it can still be optimised out if that abstract machine side effect is not observable.
(Never mind, I misread you comment at first.) Yes, the representation access needs to be discussed... I took a couple of years to publish this document. More important would be if the ptr2int exposure could be implemented.
I love Rust, but I miss C. If C can be updated to make it generally socially acceptable for new projects, I'd happily go back for some decent subset of things I do. However, there's a lot of anxiety and even angst around using C in production code.
> to make it generally socially acceptable for new projects...
Or better yet, don't let 'social pressure' influence your choice of programming language ;)
If your workplace has a clear rule to not use memory-unsafe languages for production code that's a different matter of course. But nothing can stop you from writing C code as a hobby - C99 and later is a very enjoyable and fun language.
Feels like Zig is starting to fill that role in some ways. Fewer sharp edges and a bit more safety than C, more modern approach, and even interops really well with C (even being possible to mix the two). Know a couple Rust devs that have said it seems to scratch that C itch while being more modern.
Of course it's still really nice to just have C itself being updated into something that's nicer to work with and easier to write safely, but Zig seems to be a decent other option.
(self-promotion) in principle one should be able to implement a fairly mature pointer provenance checker for zig, without changing the language. A basic proof of concept (don't use this, branches and loops have not been implemented yet):
How close are Zig's safety guarantees to Rust's? Honest question; I don't follow Zig development. I can't take C seriously because it hasn't even bothered to define provenance until now, but as far as I'm aware, Zig doesn't even try to touch these topics.
Does Zig document the precise mechanics of noalias? Does it provide a mechanism for controllably exposing or not exposing provenance of a pointer? Does it specify the provenance ABA problem in atomics on compare-exchange somehow or is that undefined? Are there any plans to make allocation optimizations sound? (This is still a problem even in Rust land; you can write a program that is guaranteed to exhibit OOM according to the language spec, but LLVM outputs code that doesn't OOM.) Does it at least have a sanitizer like Miri to make sure UB (e.g. data races, type confusion, or aliasing problems) is absent?
If the answer to most of the above is "Zig doesn't care", why do people even consider it better than C?
As usual the remark that much of the Zig's safety over C, has been present since the late 1970's in languages like Modula-2, Object Pascal and Ada, but sadly they didn't born with curly brackets, nor brought a free OS to the uni party.
If you can stomach the occasional Begin and End, and a far less confusing pointer syntax, Pascal might be the language for you. Free Pascal has some great string handling, so you never have to worry about allocating and freeing them, and they can store gigabytes of text, even Unicode. ;-)
Fil-C is a modified version of Clang that makes C and C++ memory safe. It supports things you wouldn't expect to work like signal handling or setjmp/longjmp. It can compile real C projects like SQLite and OpenSSL with minimal to no changes, today. https://github.com/pizlonator/llvm-project-deluge/blob/delug...
Fil-C does seem like a quicker route if your existing idea was something like "rewrite it in Java" and it exists today whereas both C and C++ have only vague ambitions to deliver some future language which might meet your needs.
I will be very surprised if there's widespread adoption of Fil-C for many new projects though.
https://clang.llvm.org/docs/TypeSanitizer.html
https://www.phoronix.com/news/LLVM-Merge-TySan-Type-Sanitize...
> the functions `recip` and `recip⁺` and not equivalent
Several paragraphs after this got swallowed by the code block.
Edit: Oh, I didn't realize the article is by the author of the book, Modern C. I've seen it recommended in many places.
> The C23 edition of Modern C is now available for free download from https://hal.inria.fr/hal-02383654
This seems to have been fixed now.
From the PVI section onward it seems to recover, but if the author sees this please fix and re-convert your post.
[Edited, nope, there are more errors further in the text, this needed proper proofreading before it was posted, I can somewhat struggle through because I already know this topic but if this was intended to introduce newcomers it's probably very confusing]
To convert it to C syntax, it's a function with roughly this signature:
Where the returned pointer has the address of `addr` and the provenance of `ptr`.https://github.com/protocolbuffers/protobuf/blob/ae0129fcd01...
https://godbolt.org/z/qKejzc1Kb
Whereas clang loudly complains,
https://godbolt.org/z/qWrccWzYW
An identifier is an arbitrarily long sequence of digits, underscores, lowercase and uppercase Latin letters, and Unicode characters specified using \u and \U escape notation(since C99), of class XID_Continue(since C23). A valid identifier must begin with a non-digit character (Latin letter, underscore, or Unicode non-digit character(since C99)(until C23), or Unicode character of class XID_Start)(since C23)). Identifiers are case-sensitive (lowercase and uppercase letters are distinct). Every identifier must conform to Normalization Form C.(since C23)
In practice depends on the compiler.
Definitely a questionable choice to throw off readers with unicode weirdness in the very first code example.
The answer to your question seems to (still) be "no".
But not for char-typed accesses. And even for larger types, I think you would have to worry about the combo of first memcpying from pointer-typed memory to integer-typed memory, then loading the integer. If you eliminate dead integer loads, then you would have to not eliminate the memcpy.
Or from a pragmatic viewpoint, perhaps if the optimiser eliminates a dead load, then don't mark the pointer as exposed? After all, the whole point is to keep track of whether a synthesised pointer might potentially refer to the exposed pointer's storage. There's zero danger of that happening if the integer load never actually occurs.
However, if all of the program paths from that point onward behave the same as if the pointer was marked as exposed, that would be fine. It’s only “wrong” to track the incorrect abstract machine state when that would lead to a different behaviour in the abstract machine.
In that sense I suppose it’s no different from things like removing a variable initialisation if the variable is never used. That also has a side effect in the abstract machine, but it can still be optimised out if that abstract machine side effect is not observable.
Or better yet, don't let 'social pressure' influence your choice of programming language ;)
If your workplace has a clear rule to not use memory-unsafe languages for production code that's a different matter of course. But nothing can stop you from writing C code as a hobby - C99 and later is a very enjoyable and fun language.
It’s hard. Programming is a social discipline, and the more people who work in a language, the more love it gets.
Of course it's still really nice to just have C itself being updated into something that's nicer to work with and easier to write safely, but Zig seems to be a decent other option.
https://www.youtube.com/watch?v=ZY_Z-aGbYm8
Does Zig document the precise mechanics of noalias? Does it provide a mechanism for controllably exposing or not exposing provenance of a pointer? Does it specify the provenance ABA problem in atomics on compare-exchange somehow or is that undefined? Are there any plans to make allocation optimizations sound? (This is still a problem even in Rust land; you can write a program that is guaranteed to exhibit OOM according to the language spec, but LLVM outputs code that doesn't OOM.) Does it at least have a sanitizer like Miri to make sure UB (e.g. data races, type confusion, or aliasing problems) is absent?
If the answer to most of the above is "Zig doesn't care", why do people even consider it better than C?
I will be very surprised if there's widespread adoption of Fil-C for many new projects though.
Deleted Comment