The provenance memory model for C

At least at a skim, what this specifies for exposure/synthesis for reads/writes of the object representation is concerning. One of the consequences is that dead integer loads cannot be eliminated, as they may have an exposure side effect. I guess C might be able to get away with it due to the interaction with strict aliasing rules. Still quite surprised that they are going against consensus here (and reduces the likelihood that these semantics will get adopted by implementers).

ben0x539 · 7 months ago

Can you say more about what the consensus is that this is going against?

nikic · 7 months ago

That type punning through memory does not expose or synthesize memory. There are some possible variations on this, but the most straightforward is that pointer to integer transmutes just return the address (without exposure) and integer to pointer transmutes return a pointer with nullary provenance.

comex · 7 months ago

> I guess C might be able to get away with it due to the interaction with strict aliasing rules.

But not for char-typed accesses. And even for larger types, I think you would have to worry about the combo of first memcpying from pointer-typed memory to integer-typed memory, then loading the integer. If you eliminate dead integer loads, then you would have to not eliminate the memcpy.

nikic · 7 months ago

That's a great point. I initially thought we could assume no exposure for loads with non-pointer-compatible TBAA, but you are right that this is not correct if the memory has been laundered through memcpy.

alextingle · 7 months ago

I don't imagine that the exposed state would need to be represented in the final compiler output, so the optimiser could mark the pointer as exposed, but still eliminate the dead integer load.

Or from a pragmatic viewpoint, perhaps if the optimiser eliminates a dead load, then don't mark the pointer as exposed? After all, the whole point is to keep track of whether a synthesised pointer might potentially refer to the exposed pointer's storage. There's zero danger of that happening if the integer load never actually occurs.

Hercuros · 7 months ago

I guess the internal exposure state would be “wrong” if the compiler removes the dead load (e.g in a pass that runs before provenance analysis).

However, if all of the program paths from that point onward behave the same as if the pointer was marked as exposed, that would be fine. It’s only “wrong” to track the incorrect abstract machine state when that would lead to a different behaviour in the abstract machine.

In that sense I suppose it’s no different from things like removing a variable initialisation if the variable is never used. That also has a side effect in the abstract machine, but it can still be optimised out if that abstract machine side effect is not observable.

uecker · 7 months ago

(Never mind, I misread you comment at first.) Yes, the representation access needs to be discussed... I took a couple of years to publish this document. More important would be if the ptr2int exposure could be implemented.

I love Rust, but I miss C. If C can be updated to make it generally socially acceptable for new projects, I'd happily go back for some decent subset of things I do. However, there's a lot of anxiety and even angst around using C in production code.

flohofwoe · 7 months ago

> to make it generally socially acceptable for new projects...

Or better yet, don't let 'social pressure' influence your choice of programming language ;)

If your workplace has a clear rule to not use memory-unsafe languages for production code that's a different matter of course. But nothing can stop you from writing C code as a hobby - C99 and later is a very enjoyable and fun language.

Y_Y · 7 months ago

I don't want to summon WB, but honest-to-god, D is a good middle ground here.

TimorousBestie · 7 months ago

> Or better yet, don't let 'social pressure' influence your choice of programming language ;)

It’s hard. Programming is a social discipline, and the more people who work in a language, the more love it gets.

xxs · 7 months ago

I was about the reply no amount of pressure can tell me how to program. C was totally fine for esp32

bnferguson · 7 months ago

Feels like Zig is starting to fill that role in some ways. Fewer sharp edges and a bit more safety than C, more modern approach, and even interops really well with C (even being possible to mix the two). Know a couple Rust devs that have said it seems to scratch that C itch while being more modern.

Of course it's still really nice to just have C itself being updated into something that's nicer to work with and easier to write safely, but Zig seems to be a decent other option.

dnautics · 7 months ago

(self-promotion) in principle one should be able to implement a fairly mature pointer provenance checker for zig, without changing the language. A basic proof of concept (don't use this, branches and loops have not been implemented yet):

https://www.youtube.com/watch?v=ZY_Z-aGbYm8

purplesyringa · 7 months ago

How close are Zig's safety guarantees to Rust's? Honest question; I don't follow Zig development. I can't take C seriously because it hasn't even bothered to define provenance until now, but as far as I'm aware, Zig doesn't even try to touch these topics.

Does Zig document the precise mechanics of noalias? Does it provide a mechanism for controllably exposing or not exposing provenance of a pointer? Does it specify the provenance ABA problem in atomics on compare-exchange somehow or is that undefined? Are there any plans to make allocation optimizations sound? (This is still a problem even in Rust land; you can write a program that is guaranteed to exhibit OOM according to the language spec, but LLVM outputs code that doesn't OOM.) Does it at least have a sanitizer like Miri to make sure UB (e.g. data races, type confusion, or aliasing problems) is absent?

If the answer to most of the above is "Zig doesn't care", why do people even consider it better than C?

pjmlp · 7 months ago

As usual the remark that much of the Zig's safety over C, has been present since the late 1970's in languages like Modula-2, Object Pascal and Ada, but sadly they didn't born with curly brackets, nor brought a free OS to the uni party.

mikewarot · 7 months ago

If you can stomach the occasional Begin and End, and a far less confusing pointer syntax, Pascal might be the language for you. Free Pascal has some great string handling, so you never have to worry about allocating and freeing them, and they can store gigabytes of text, even Unicode. ;-)

jvanderbot · 7 months ago

If my fellow devs cringe at C, imagine their reaction to Pascal

tgv · 7 months ago

Or try Ada.

modeless · 7 months ago

Fil-C is a modified version of Clang that makes C and C++ memory safe. It supports things you wouldn't expect to work like signal handling or setjmp/longjmp. It can compile real C projects like SQLite and OpenSSL with minimal to no changes, today. https://github.com/pizlonator/llvm-project-deluge/blob/delug...

tialaramex · 7 months ago

Fil-C does seem like a quicker route if your existing idea was something like "rewrite it in Java" and it exists today whereas both C and C++ have only vague ambitions to deliver some future language which might meet your needs.

I will be very surprised if there's widespread adoption of Fil-C for many new projects though.

bmn__ · 7 months ago

https://github.com/tsoding/crust

uecker · 7 months ago

Do you really love Rust, or do you feel pressured to say so?

grg0 · 7 months ago

He grew up in a very stringent household. Everybody was writing Rust and he was like, "damn, I wish I could write C."

Deleted Comment

gavinray · 7 months ago

Also of interest to folks looking at this might be TySan, the recently-merged LLVM Type-Based Aliasing sanitizer:

https://clang.llvm.org/docs/TypeSanitizer.html

https://www.phoronix.com/news/LLVM-Merge-TySan-Type-Sanitize...

aengelke · 7 months ago

It's probably worth noting that TySan currently only catches aliasing violations that LLVM would be able to exploit. For some types, e.g. unions, Clang doesn't emit accurate type-based aliasing information and therefore TySan won't catch these.

Which is fine I think, considering that union type punning is legal in C (and even in C++ where union type punning is UB I have never seen it break - theoretically it might of course).

The problem might be that Clang does not even implement type-based aliasing correctly. So I assume it checks its broken rules, instead of the one specified in the C standard.

lioeters · 7 months ago

Looks like a code block didn't get closed properly, before this phrase:

> the functions `recip` and `recip⁺` and not equivalent

Several paragraphs after this got swallowed by the code block.

Edit: Oh, I didn't realize the article is by the author of the book, Modern C. I've seen it recommended in many places.

> The C23 edition of Modern C is now available for free download from https://hal.inria.fr/hal-02383654

zmodem · 7 months ago

> Looks like a code block didn't get closed properly

This seems to have been fixed now.

perching_aix · 7 months ago

I still see it, even after clearing caches, visiting from a separate browser from a separate computer (even a separate network).

johnisgood · 7 months ago

It is a great book. I prefer the second edition, not the latest one though with what I call "bloated C".

laqq3 · 7 months ago

I'm wondering if you could elaborate? I'd be curious to hear more about "bloated C" and the differences between the 2nd and 3rd edition.

shakabrah · 7 months ago

It made immediate sense to me it was Jen once I saw the code samples given

Presumably this was converted from markdown or similar and the conversion partly failed or the input was broken.

From the PVI section onward it seems to recover, but if the author sees this please fix and re-convert your post.

[Edited, nope, there are more errors further in the text, this needed proper proofreading before it was posted, I can somewhat struggle through because I already know this topic but if this was intended to introduce newcomers it's probably very confusing]

gustedt · 7 months ago

The problem is that wordpress changes these things once you edit in some part. I will probably regenerate the whole.

Randomly introduced translation errors from markdown to wordpress-internal should be fixed, now. Sorry for the incovenience!

cryptonector · 7 months ago

There are some grammar errors here and there, but TFA is very nice. Thank you for your hard work!

Measter · 7 months ago

In the section about the ambiguous provenance from synthesising pointers, it's explained that the compiler will infer the correct provenance from usage. Would it not be worth having some way for the programmer to inform the compiler directly, with something analogous to Rust's Strict Provenance ptr::with_addr?

To convert it to C syntax, it's a function with roughly this signature:

    void* with_addr(void* ptr, uintptr_t addr)

Where the returned pointer has the address of `addr` and the provenance of `ptr`.

I'd also like to have builtin functions and/or function attributes for designating allocation and deallocation. malloc() and free() (and realloc()) should not be special because of their names -- they should be special because of their declared attributes or their derived attributes given their internals.

charleslmunger · 7 months ago

This is doable via this trick:

https://github.com/protocolbuffers/protobuf/blob/ae0129fcd01...

The proposal is mostly designed this way to make sure existing code is valid. One could add something "with_addr", but I am not convinced that it is really worth it.

zombot · 7 months ago

Does C allow Unicode identifiers now, or is that pseudo code? The code snippets also contain `&`, so something definitely went wrong with the transcoding to HTML.

Besides the sibling comment on C23, it does work fine on GCC.

https://godbolt.org/z/qKejzc1Kb

Whereas clang loudly complains,

https://godbolt.org/z/qWrccWzYW

qsort · 7 months ago

Quoting cppreference:

An identifier is an arbitrarily long sequence of digits, underscores, lowercase and uppercase Latin letters, and Unicode characters specified using \u and \U escape notation(since C99), of class XID_Continue(since C23). A valid identifier must begin with a non-digit character (Latin letter, underscore, or Unicode non-digit character(since C99)(until C23), or Unicode character of class XID_Start)(since C23)). Identifiers are case-sensitive (lowercase and uppercase letters are distinct). Every identifier must conform to Normalization Form C.(since C23)

In practice depends on the compiler.

dgrunwald · 7 months ago

But the source character set remains implementation-defined, so compilers do not have to directly support unicode names, only the escape notation.

Definitely a questionable choice to throw off readers with unicode weirdness in the very first code example.

Implementation-defined until C99, explicitly possible via UCNs aince c99, possible with explicit encoding since C23, but literals are still implementation defined.

unwind · 7 months ago

I can't even view the post, I just get some kind of content management system-like with the page as JSON or something, in pink-on-white. I'm super confused. :|

The answer to your question seems to (still) be "no".