I write type-safe generic data structures in C

For your level 2 code, `uint64_t data[];` is wrong for types whose alignment is greater than `uint64_t`, and also wasteful for types whose alignment is smaller (for example, under an ilp32 ABI on 64-bit architectures).

For your level 3 code, it should be `int main() { List(Foo) foo_list = {NULL};`

Note that working around a lack of `typeof` means you can't return anything. Also, your particular workaround allows `const`ness errors since `==` is symmetrical.

You can't safely omit `payload` since you need it to know the correct size. Consider a `List(int64_t)` and you try to add an `int32_t` to it - this should be fine, but you can't `sizeof` the `int32_t`. Your code is actually lacking quite a bit to make this work.

=====

There are 2 major limitations to generics in C right now:

* Delegating to a vtable (internal or external) is limited in functionality, since structs cannot contain macros, only functions.

* Delegating to an external vtable (mandatory to avoid overhead) means that you have to forward-declare all of the types you'll ever use a vtable with. So far the best approach I've found is to declare (but not define) static functions in the same forwarding header I declare the typedefs in; note that GCC and Clang differ in what phase the "undefined static" warning appears in for the case where you don't actually include that particular type's header in a given TU.

(think about writing a function that accepts either `struct SizedBuffer {void *p; size_t len;};` or `struct BoundedBuffer {void *begin; void *end;};`, and also const versions thereof - all from different headers).

rectang · 7 months ago

> Delegating to an external vtable (mandatory to avoid overhead) means that you have to forward-declare all of the types you'll ever use a vtable with.

We went down the rabbit hole of writing a compiler for this as part of a project I used to work on (Apache Clownfish[1], a subproject of the retired Apache Lucy project). We started off parsing .h files, but eventually it made sense to create our own small header language (.cfh "Clownfish Header" files).

Here's some generated code for invoking the CharBuf version of the "Clone" method defined in parent class "Obj":

    typedef cfish_CharBuf*
    (*CFISH_CharBuf_Clone_t)(cfish_CharBuf* self);

    extern uint32_t CFISH_CharBuf_Clone_OFFSET;

    static inline cfish_CharBuf*
    CFISH_CharBuf_Clone(cfish_CharBuf* self) {
        const CFISH_CharBuf_Clone_t method
            = (CFISH_CharBuf_Clone_t)cfish_obj_method(
                self,
                CFISH_CharBuf_Clone_OFFSET
            );
        return method(self);
    }

Usage:

    cfish_CharBuf *charbuf = cfish_CharBuf_new();
    cfish_CharBuf *clone = CFISH_CharBuf_Clone(charbuf);

We had our reasons for going to these extremes: the point of Clownfish was to provide a least-common-denominator object model for bindings to multiple dynamic languages (similar problem domain to SWIG), and the .cfh files also were used to derive types for the binding languages. But there was truly an absurd amount of boilerplate being generated to get around the issue you identify.

This is why almost everybody just uses casts to void* for the invocant, skipping type safety.

[1] https://github.com/apache/lucy-clownfish

zem · 7 months ago

i am firmly of the opinion that compiling to c is a better route than doing clever c tricks to sort of get what you want. the compiler can be pretty minimal and as you note it pays for itself.

kccqzy · 7 months ago

> it should be `int main() { List(Foo) foo_list = {NULL};`

In C `int main()` means the function takes an unknown number of arguments. You need `int main(void)` to mean it doesn't take any arguments. This is a fact frequently forgotten by those who write C++.

flohofwoe · 7 months ago

That had been harmonized with C++ in C23 (e.g. func() is equivalent with func(void) now).

It's not really relevant for main() though, even in older C versions main() works fine and simply means "I don't need argc and argv".

tedunangst · 7 months ago

This is incorrect. In a function definition, an empty list means it takes no parameters. 6.7.5.3 Function declarators

> 14. An empty list in a function declarator that is part of a definition of that function specifies that the function has no parameters.

EPWN3D · 7 months ago

I would love for `union`s to be federated, that is, a type could declare itself as thought it was part of a union with another type, without having to pre-declare all possible types in one place.

Joker_vD · 7 months ago

Can't you just declare anonymous unions whenever? E.g.

    struct foo { ... };
    
    struct bar { ... };

    struct bar check_this_out(struct foo foo) {
        return ((union { struct foo foo; struct bar bar; }){ .foo = foo}).bar;
    }

    struct bar *also_this(struct foo *foo) {
        union { struct foo foo; struct bar bar; } *tmp = (void*)foo;
        return &tmp->bar;
    }

o11c · 7 months ago

For layout-compatible types, you can often just include a `_base` member in each child. Maybe twice (once named and once unnamed) to avoid excess typing - I don't understand the common-initial-subsequence rule but people do this enough that compilers have to allow it.

n_plus_1_acc · 7 months ago

This is also problematic, because there might be padding and the calculated size might be too small:

`malloc(sizeof(*node) + data_size);`

o11c · 7 months ago

There's no a problem with the author's current code, since the padding is already included in the node size, but it would be a problem after doing alignment more intelligently.

Deleted Comment

Dead Comment

The casting of the function type assumes that the item pointer type (e.g. Foo*) has the same representation as void*, which the C standard doesn’t guarantee (in standardese: the two types aren’t “compatible”). Calling the function with the converted type therefore constitutes undefined behavior. It also impacts aliasing analysis by compilers (see [0], incidentally), even if the pointer representation happens to be the same.

This casting of the functions to different argument types constitutes the core of the type safety of the generic invocations; I’m not sure it can be fixed.

[0] https://news.ycombinator.com/item?id=44421185

dhooper · 7 months ago

This is addressed in the footnotes. casting is not the core of the type safety. Read the whole article.

layer8 · 7 months ago

Ah, that’s what I get for not reading the footnotes. However, the alternative solution presented evaluates the item argument twice, which is problematic as well (but could probably be worked around by passing `(list)->payload` on instead). Secondly, the assignment for type-checking doesn’t work for read-only operations on a const List, or does it? And doesn’t the assignment overwrite the head? Lastly, the do-while construction means you can’t use it for operations that return a value (without compiler extensions).

I also don’t agree it’s “squeamish” to be wary of aliasing analysis going wrong. It’s not a clean abstraction and can hide subtle bugs down the road.

el_pollo_diablo · 7 months ago

Sure, but your alternative code incorrectly assigns to (list)->payload. You have many other options. Without typeof, you can if(0) the assignment, or check type compatibility with a ternary operator like 1 ? (item) : (list)->payload and pass that to _list_prepend, etc. With typeof, you can store item in a temporary variable with the same type as (list)->payload, or build a compound literal (typeof(*(list))){.payload=(item)}, etc.

Joker_vD · 7 months ago

Working with function pointers is always finicky. I once had MSVC fold together

    int f0(int x) { return x; }

    int f1(int x, int y) { return y; }

into a single function, so at runtime, (void(*)())f0 and (void(*)())f1 would compare equal. AFAIK, you are not guaranteed by the standard that functions of different signatures would, when their addresses are taken, result in different pointers, so it's not technically a bug... but it's quite surprising, and ruins certain tricks.

gritzko · 7 months ago

Hi. I object.

The trick#0 you mention is how I made an entire C dialect. Here is a generic binary heap, for example https://github.com/gritzko/librdx/blob/master/abc/HEAPx.h The syntax is a bit heavyweight, but a huge huge advantage is: you get regular C structs in the end, very plain, very predictable, very optimizable. Compiler would eat them like donuts.

In the other cases, it is void* and runtime memory sizing and you have to define macros anyway.

Author here. Binary heaps and linked lists are different use cases. A binary heap must read the data you put in it to store it correctly, but a linked list doesn't. If I were writing a generic binary heap, maybe I would weigh my options differently. I mentioned this in the footnotes.

wordglyph · 7 months ago

And that's why I like C++ templates

variadix · 7 months ago

I agree, there are actually several reasons to prefer the header impl. Debugging is better, both because you can step through the header code where you can’t with a macro function, and because the type information available to the debugger is better. There are more opportunities for compiler optimizations because each instantiation is monomorphized and you don’t pay a runtime cost with variable sizing, generic structures can also be placed on the stack because of the fixed sizing.

There are workarounds for at least two of the problems the author mentions. Naming can be changed from Bar_func(args…) to func(Bar)(args…) with a function name macro that just does name mangling. You can avoid some of the binary bloat by using weak symbols, letting the linker deduplicate functions shared between translation units at link time.

There are other problems for generic containers of pointer types however, you can work around them by using a typedef or a type alias.

Intrusive data structures are more convenient in C still, but working with them in a debugger is a pain.

Author here. It's worth noting that no work is being done in the macros of my article, they compile down to a normal c function call which you can step through in a debugger.

There is little benefit in monomorphizing the implementation of a data structure like a linked list where its behavior doesn't depend on the contents of the data it contains (compared to, say, a max heap)

knutwannheden · 7 months ago

> Compiler would eat them like donuts.

Made me laugh out loud!

cherryteastain · 7 months ago

Why would you jump through all these hoops instead of just writing C++ if you want "C with generics"

_proofs · 7 months ago

because i work on a legacy project that is coupled to safety regulations and other quality guarantees, and we cannot just simply roll out a solution ported to c++ on the next release, or even tenth, so perhaps we make it work until we can.

however we can set a standard and expectation for new projects to use c++, and we do and set an expectation to target a specific std.

i see this sentiment quite a lot on hackernews -- feels like a lot of people saying "git gud" -- i would expect a lot more nuance applied here.

pjmlp · 7 months ago

While using old compilers in certified domains is a common problem, unless we are talking about stuff like PIC, the choice of reaching out to C++ compilers is a given, even if what they might support is between C++98 and C++11.

gnulinux996 · 7 months ago

Out of curiosity, what is an example of a regulation that specifies using a specific programming language and/or toolchain out there?

Kranar · 7 months ago

Because for many of the use cases where C is used, switching to C++ involves jumping through even more hoops.

lionkor · 7 months ago

Do you have a couple of real world examples?

Because some people really hate C++ to their bones, hence this kind of stuff that keeps coming up.

I was really disappointed that Microsoft decided to backtrack on C++'s is the future, after the new found Linux and FOSS love.

https://herbsutter.com/2012/05/03/reader-qa-what-about-vc-an...

https://devblogs.microsoft.com/cppblog/c11-and-c17-standard-...

Not that it matters much, as nowadays C and C++ have a new policy at Microsoft due to goverments and cyberlaws.

https://azure.microsoft.com/en-us/blog/microsoft-azure-secur...

https://blogs.windows.com/windowsexperience/2024/11/19/windo...

petabyt · 7 months ago

The real answer: it's more fun this way.

brunker2 · 7 months ago

Why would you write C++ if you can get the same result by jumping through a few hoops with C?

zabzonk · 7 months ago

Templates in C++ require language support - you can't simply implement them with "a few hoops" in C.

uecker · 7 months ago

It is cool trick. I already use in my experimental library though ;-) https://github.com/uecker/noplate/blob/main/src/list.h

eqvinox · 7 months ago

I guess if anyone might know it might be you—do you see any way of doing this for intrusive data structures, embedding the node struct in the data (and as side effect supporting an object to be on multiple containers) rather than the data in the node like you're doing there?

You could put the dummy member into the embedded node. But for intrusive data structures you often want them to erase the type so that you write generic algorithms as regular functions. In this case, it makes more sense to have a run-time check do to down casts. I do this with my variadic type which has an intrusive "super" member: https://godbolt.org/z/ofdKe7Pfv The overhead is often completely removed by the compiler.

teo_zero · 7 months ago

> Structurally identical types will be considered the same type in GCC 15 and Clang later in 2025 thanks to a rule change

Beware that only tagged unions are considered the same type under the new rule, provided they have the same structrure and the same tag.

The List(T) macro should be changed to generate a different tag for each different T. Which is trivial (with ##) for simple one-word types, but impossible for even mildly complex ones like pointers to char (strings).

Of course you can force yourself to typedef any type before using it in a List, but it looses much of its versatility. Example:

  typedef char *str;
  List(str) my_list_of_str;
  List(str) tokenize(str input) {...}

alcover · 7 months ago

> Beware that only tagged unions are considered the same type

I don't get it. Tagged union is just a design pattern.

wahern · 7 months ago

By tagged they mean have an identifier. Compare

  > struct { ... } foo;

and

  > struct bar { ... } foo;

The latter has an identifier, bar; the former doesn't. The standard uses tag to refer to the identifier name, if any, in an enum, struct, or union declaration.

chuckadams · 7 months ago

I believe "type witness" is the generic (hah) term for "member that doesn't do anything but hold the type". Lot less literature out there about type witnesses than I had thought though...

gizmo686 · 7 months ago

There's a similar term "phantom type", for when you have some sort of type variable that is never used as the type of an actual variable.

I've mostly seen in done in Haskell, but have used it in it Scala as well to simulate type hierarchies not present in the actual type system.

In a way, this union trick is kind of like a phantom type, since the secondary type is never actually used.

Yah, I tend to think of type witnesses as actually existing at runtime and phantom types not, but in the union trick, they don't really exist either. So thinking on it some more, seems more of a way of expressing type parameters in the first place, and well, that's what the article was about.

Now I'm wondering what phantom types would look like in C...

8-/

mfuzzey · 7 months ago

There's also the method used in the Linux kernel to embed the list information (struct list_head) within the type specific struct. https://kernelnewbies.org/FAQ/LinkedLists

nixpulvis · 7 months ago

The naming of LIST_HEAD_INIT and INIT_LIST_HEAD is confusing to me.

The way I remember it is:

INIT_LIST_HEAD is of form VERB_NOUN so is called from within a function to programatically initialise the list.

LIST_HEAD_INIT is NOUN_VERB and is used within a structure initialiser not from a function.

But my main point was to show the "embed the list in the data" approach rather than "embed the data in the list" or "point to the data from the list" and not to discuss the naming details in the kernel implementation of the concept.

Not to mention that they insist on calling every entry of the list a "list head", which makes no sense (hysterical raisins, maybe?). The structure is made of a uniform loop of entries, one of which is used as the actual head & tail, or entry point into the structure.