Writing memory efficient C structs

For completeness, this description of alignment is misleading:

> Well, dear reader, this padding is added because the CPU needs memory to be aligned in sets of 4 bytes because it’s optimized in that fashion.

> ...

> Remember: since structs are aligned to 4 bytes, any padding is therefore unnecessary if the size of the struct is a multiple of 4 without the padding.

Individual data types have their own alignment (e.g., `bool`/`char` may be 1, `short` may be 2, `int` may be 4, `long` may be 8, etc.), and the alignment of a compound type (like a struct) defaults to the maximum alignment of its constituent types.

In this article, `struct Monster` has an alignment of 4 because `int` and `float` have an alignment of 4 for the author's configuration. Expanding one of the `int`s to a `long` could increase the alignment to 8 on some CPUs, and removing the `int` and `float` fields would decrease the alignment to 1 for most CPUs.

sumtechguy · 6 months ago

Also keep in mind that is also all very CPU and compiler specific. Had one compiler where it packed everything at 4/8, usually 8. Not the 1/2/4/8 you would expect. That was because the CPU would just seg fault if you didnt play nice with the data access. The compiler hid a lot of it if you set the packing with offsets and mem moves and shifting. It was clever but slow. So they by default picked a wide enough packing that removed the extra instructions at the cost of using more memory. x86 was by far the most forgiving while at the time I was doing it. ARM was the least forgiving (at least on the platform I was using). With MIPS being OK in some cases but not others.

jandrewrogers · 6 months ago

Some of the Cray hardware was basically pure 64-bit. The systems largely didn’t recognize smaller granularity. I learned a lot of lessons about writing portable C by writing code for Cray systems.

neeeeees · 6 months ago

On one of these less forgiving architectures, how does one write programs that read some bytes off the network, bitcast them into a struct, and do something based on that?

On x86 you would use a packed struct that matches the wire protocol.

Wouldn’t this require extra copying if member reads were forced to be aligned?

gavinsyancey · 6 months ago

Um ... isn't alignment generally dictated by the platform ABI so that programs compiled by different compilers can be linked together?

flohofwoe · 6 months ago

AFAIK alignment doesn't even matter anymore (for CPU data at least) since the 'word size' of a modern CPU is the size of a cache line (32 or 64 bytes?), e.g. unaligned accesses within a 32 or 64 byte block are not different than aligned accesses.

(technically there is still an advantage of items aligned to their size in that such an item can never straddle adjacent cache lines though)

And there's also still tons of different alignment requirements when working with GPU data - and interestingly those alignment requirements may differ from C's alignment rules, so you may need to explicitly use packed structs (which are still not a standard C feature!) with manual padding.

birn559 · 6 months ago

My understanding is that C++ compilers still add padding by default for performance reasons. CPU will have to spend a few cycles to reorganize data that is not aligned in chunks of 4 bytes.

kazinator · 6 months ago

I.e. the author is wrong; struct s { char x; }; is not required to be four-byte-aligned. It can have a 1 byte size, and thus alignment.

bobmcnamara · 6 months ago

And bool may be 4, and char may be 2(m68k)!

flohofwoe · 6 months ago

Not unless you're stuck in the 90s ;) These sizes have all been standardized to 1 byte since C99.

boomlinde · 6 months ago

> Individual data types have their own alignment (e.g., `bool`/`char` may be 1, `short` may be 2, `int` may be 4, `long` may be 8, etc.), and the alignment of a compound type (like a struct) defaults to the maximum alignment of its constituent types.

I will add that this is implementation defined. IIRC the only restriction the standard imposes on the alignment of a struct is that a pointer to it is also a pointer to its first member when converted, meaning its alignment must practically be a multiple of that of its first field.

NooneAtAll3 · 6 months ago

implementation-defined means your specialized platform can be supported without needing to conform - it does not mean that common knowledge is false for common users

class mybits_t : bit::bit_array<17> { public: // slice a 1 bit sized field // returns a proxy reference to the single bit auto field1() { return (*this)(0, 1); } // slice a 3 bit sized field // returns a proxy reference to the 3 bits auto field2() { return (*this)(1, 4); } }; mybits_t mybits(7); assert(mybits.field1() == 1); mybits.field1() = 0; assert(static_cast<int>(mybits) == 6); //no implicit cast to integers, but explicit supported

This is a good article, but it bugs me that it sort of only tells half the story.

Why are you making your structs smaller. Probably for "performance". If your goal is literally just reducing memory usage, then this is all fine. Maybe you're running on a small microcontroller and you just straight up don't have much RAM.

But these days, in most areas, memory is cheap and plentiful. What's expensive is CPU cache. If you're optimizing memory use for that, it's really about runtime speed and less about total memory usage. You're trying to make things small so you have fewer cache misses and the CPU doesn't get stuck waiting as much.

In that case, the advice here is only part of the story. Using bitfields and smaller integers saves memory. But in order to access those fields and do anything with them, they need to get loaded into registers. I'm not an expert here, but my understanding is that loading a bitfield or small int into a word-sized register can sometimes have some bit of overhead, so you have to pay attention to the trade-off.

If I was optimizing for overall runtime speed and considering packing structs to do that, I'd really want to have good profiling and benchmarks in place first. Otherwise it's easy to think you're making progress when you're actually going backwards.

ack_complete · 6 months ago

> If you're optimizing memory use for that, it's really about runtime speed and less about total memory usage. You're trying to make things small so you have fewer cache misses and the CPU doesn't get stuck waiting as much.

The complication is that cache is a global resource. Code that has larger data structures, even if it runs faster, can contribute to a larger working set and a slightly higher cache miss rate in other code. This can lead to a death-by-thousand cuts scenario and it's hard to get solid data on this when profiling.

You're right, though, that there are a number of ways that smaller structs can be slower, either directly by execution time or indirectly by larger code causing more misses in the instruction cache. Some other factors include whether the compiler can coalesce accesses to multiple fields grouped together, or whether the struct hits a magic power of two size for faster array indexing or vectorization. Which means you can end up speeding up code occasionally by making the struct bigger.

ryao · 6 months ago

If you are using an intrusive data structure, keeping the key on the same cache line as the pointers can make traversals faster. Reducing padding can make this easier to do.

In general, keeping commonly accessed together things on the same cache line is a good idea and the work needed to remove excessive padding overlaps with the work needed to do this.

ljchen · 6 months ago

I most agree, only except that perf is determined by a myriad of facors. Even if this piece of data fits into a cache line, it may have no or negative effect on the following code. For complex softwares, even if it's faster after the change, it's hard to tell what's the contributing factor really is. I once read a system paper (forgot the name) from a recent system research conference on how changing the code layout, e.g., moving a block of code, randomly may have unexpected impacts on perf.

Personally, I only pay attention to the cache line and layout when the structure is a concurrent data structure and needs to sync across multiple cores.

bluGill · 6 months ago

I pack structs because the packed layout matches some hardware or network structre and thus I get direct access . This isn't a common case but it exists.

yellowapple · 6 months ago

That case is much more common in embedded/bare-metal programming. A decent chunk of the code I've written so far for getting Zig running on a Nintendo 64 is a bunch of packed structs representing various registers (and a bunch of methods on those structs representing common uses of those registers).

It's also a reasonably common case for parsing binary data formats (e.g. a file's header has fields at exact byte offsets, and it's much faster and easier to just grab all the bytes of that header and interpret those bytes as a packed struct).

Rayhem · 6 months ago

If you're really accessing those fields it's overwhelmingly likely that you're accessing the same field for all of your instances. Your access pattern is going to be

    monsters[0].health = //calculation
    monsters[1].health = //calculation
    monsters[2].health = //calculation

and not

    monsters[0].health = //calculation
    monsters[1].name = //update
    monsters[2].is_poisoned = //check

Striding over your monsters array means you have to load each monster to access one field which just blows your cache to hell. Much better is to pack all of your health values together in a structures-of-arrays layout as opposed to the usual arrays-of-structures approach here. Entity-component-system architectures are a brilliant tool for managing this with some pretty performant results.

munksbeer · 6 months ago

Could you give some intuition on why this is likely to be the case more frequently than accessing different values from a single entity?

I'm not a games developer, so I don't have a good feel for why you'd need to iterate over every monster's health property in a tightly loop vs accessing each monster and calculating their position, what properties they have and updating one or more of their values in an iteration.

zeusk · 6 months ago

The real issue with packed structs and bitfields happens when concurrency gets involved. Majority of modern CPU caches are private and only allow one core to hold the cache line - so it creates more false dependencies when cores are trying to alter information that was compacted into a cache line on another core.

charleslmunger · 6 months ago

Avoiding false sharing is a separate problem best solved by explicitly aligning the struct or relevant members to std::hardware_destructive_interference_size.

voidnap · 6 months ago

> memory is cheap and plentiful. What's expensive is CPU cache.

CPU cache is memory. Also memory isn't cheap, it is relatively expensive to access. CPU cache is way cheaper. You have it backwards.

mort96 · 6 months ago

The obvious and charitable read of their post is that they use "memory" to mean main memory/RAM, "CPU cache" to mean CPU cache, and "memory is cheap and plentiful" to mean "memory has a low dollar cost and most systems have a whole lot of it".

Your read -- that they use "memory" to mean "anything which stores bytes, including caches", and that they use "memory is cheap" to mean "accessing RAM has a low latency" -- is uncharitable, contrived and even literally self-contradictory ("CPU cache is memory" + "memory is expensive to access while CPU cache is cheap to access" is a contradiction which can only be resolved through using different definitions of "memory" in those two sentences).

writebetterc · 6 months ago

They're talking about dollars

jcelerier · 6 months ago

> But these days, in most areas, memory is cheap and plentiful

and that's why I am contemplating freaking 128G of ram for my next machine, I have 64 right now and OOM regularly, 32 is unuseable

zahlman · 6 months ago

What software are you running? I'm humming along without a problem on 8.

wbl · 6 months ago

Everything is microarch dependant but typically the memory subsystem can handle subword loads and writes. What it usually can't do is bits.

(various vector types) int128_t long double long long, int64_t, int_fast64_t, double off_t time_t (can be 64 bits on 32-bit platforms only if off_t also is) register_t, int_fast16_t, int_fast32_t (64-bit on e.g. x32) void *, void(*)(), intptr_t ptrdiff_t (theoretically wrong if fully aligned and bigger than pointer) size_t, ssize_t long (32-bit on win64; generally has 16-bit alignment if bigger than pointer) int32_t, float int (can be 16-bit) short, int16_t char, bool, int8_t, int_fast8_t (a trailing array member must come last; note when calculating combined size you should use offsetof instead of sizeof to not waste padding)