Readit News logoReadit News
_rend · 6 months ago
For completeness, this description of alignment is misleading:

> Well, dear reader, this padding is added because the CPU needs memory to be aligned in sets of 4 bytes because it’s optimized in that fashion.

> ...

> Remember: since structs are aligned to 4 bytes, any padding is therefore unnecessary if the size of the struct is a multiple of 4 without the padding.

Individual data types have their own alignment (e.g., `bool`/`char` may be 1, `short` may be 2, `int` may be 4, `long` may be 8, etc.), and the alignment of a compound type (like a struct) defaults to the maximum alignment of its constituent types.

In this article, `struct Monster` has an alignment of 4 because `int` and `float` have an alignment of 4 for the author's configuration. Expanding one of the `int`s to a `long` could increase the alignment to 8 on some CPUs, and removing the `int` and `float` fields would decrease the alignment to 1 for most CPUs.

sumtechguy · 6 months ago
Also keep in mind that is also all very CPU and compiler specific. Had one compiler where it packed everything at 4/8, usually 8. Not the 1/2/4/8 you would expect. That was because the CPU would just seg fault if you didnt play nice with the data access. The compiler hid a lot of it if you set the packing with offsets and mem moves and shifting. It was clever but slow. So they by default picked a wide enough packing that removed the extra instructions at the cost of using more memory. x86 was by far the most forgiving while at the time I was doing it. ARM was the least forgiving (at least on the platform I was using). With MIPS being OK in some cases but not others.
jandrewrogers · 6 months ago
Some of the Cray hardware was basically pure 64-bit. The systems largely didn’t recognize smaller granularity. I learned a lot of lessons about writing portable C by writing code for Cray systems.
neeeeees · 6 months ago
On one of these less forgiving architectures, how does one write programs that read some bytes off the network, bitcast them into a struct, and do something based on that?

On x86 you would use a packed struct that matches the wire protocol.

Wouldn’t this require extra copying if member reads were forced to be aligned?

gavinsyancey · 6 months ago
Um ... isn't alignment generally dictated by the platform ABI so that programs compiled by different compilers can be linked together?
flohofwoe · 6 months ago
AFAIK alignment doesn't even matter anymore (for CPU data at least) since the 'word size' of a modern CPU is the size of a cache line (32 or 64 bytes?), e.g. unaligned accesses within a 32 or 64 byte block are not different than aligned accesses.

(technically there is still an advantage of items aligned to their size in that such an item can never straddle adjacent cache lines though)

And there's also still tons of different alignment requirements when working with GPU data - and interestingly those alignment requirements may differ from C's alignment rules, so you may need to explicitly use packed structs (which are still not a standard C feature!) with manual padding.

birn559 · 6 months ago
My understanding is that C++ compilers still add padding by default for performance reasons. CPU will have to spend a few cycles to reorganize data that is not aligned in chunks of 4 bytes.
kazinator · 6 months ago
I.e. the author is wrong; struct s { char x; }; is not required to be four-byte-aligned. It can have a 1 byte size, and thus alignment.
bobmcnamara · 6 months ago
And bool may be 4, and char may be 2(m68k)!
flohofwoe · 6 months ago
Not unless you're stuck in the 90s ;) These sizes have all been standardized to 1 byte since C99.
boomlinde · 6 months ago
> Individual data types have their own alignment (e.g., `bool`/`char` may be 1, `short` may be 2, `int` may be 4, `long` may be 8, etc.), and the alignment of a compound type (like a struct) defaults to the maximum alignment of its constituent types.

I will add that this is implementation defined. IIRC the only restriction the standard imposes on the alignment of a struct is that a pointer to it is also a pointer to its first member when converted, meaning its alignment must practically be a multiple of that of its first field.

NooneAtAll3 · 6 months ago
implementation-defined means your specialized platform can be supported without needing to conform - it does not mean that common knowledge is false for common users
david2ndaccount · 6 months ago
The author relies on the compiler fitting the bitfield in unused padding bytes after “speed”, but that is implementation-defined behavior (almost everything about bitfields is implementation defined). MSVC will align the unsigned bitfield to alignof(unsigned), whereas GCC and clang will use the padding bytes. To portably pack the struct, use unsigned integers and use flags (monster.flags & CAN_FLY instead of monster.can_fly).

See https://c.godbolt.org/z/K7oY77KGj

cryptonector · 6 months ago
Bitfields have other problems. Say you have two bitfields each of `uint8_t` type and totaling 16 bits: well, you might think that's _two_ fields, but the compiler is allowed to treat them as _one_ whenever it wants to, and _that_ can be a problem if you're accessing one with atomics and the other without because the other can yield unsynchronized accesses to the one that needs synchronization.

Bitfields in C leave way too much behavior to the compiler or undefined. It's really intolerable.

atiedebee · 6 months ago
Even worse: bit fields can only be applied to int, signed int and unsigned int (maybe char as well but i dont remember)

Even crazier is the fact that an int bitfield's signedness is implementation defined

arjvik · 6 months ago
Why can't there be a standard defined for bitfields in future C releases? This is a long-discussed drawback of a feature that I really really want to be able to use in production code.
ajross · 6 months ago
There is, but it's part of the platform ABI and not the C language standard. The latter specifies syntax and behavior, the former is what's concerned with interoperability details like memory layout.

I happen to have a copy of AAPCS64 in front of me, and you can find the specification of bit packing in section 10.1.8. The sysv ABI for x86/x86_64 has its own wording (but I'm pretty sure is compatible). MSVC does something else I believe, etc...

gizmo686 · 6 months ago
There is. I'm not sure when it was added, but looking at https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3220.pdf, page 106, note 13

> An implementation may allocate any addressable storage unit large enough to hold a bit-field. If enough space remains, a bit-field that immediately follows another bit-field in a structure shall be packed into adjacent bits of the same unit. If insufficient space remains, whether a bit-field that does not fit is put into the next unit or overlaps adjacent units is implementation-defined. The order of allocation of bit-fields within a unit (high-order to low-order or low-order to high-order) is implementation-defined. The alignment of the addressable storage unit is unspecified.

In practice, I've been using this feature for ages (along with __attribute__((packed))). There comes a point where you can start depending on compiler specific features instead of relying solely on the standard.

petermclean · 6 months ago
I've been working on an open source library that, while it doesn't solve bitfields, can provide convenient ergonomic access to bit-granular types:

https://github.com/PeterCDMcLean/BitLib

You can inherit from bit_array and create a pseudo bitfield:

  class mybits_t : bit::bit_array<17> {
    public:
      // slice a 1 bit sized field
      // returns a proxy reference to the single bit
      auto field1() {
        return (*this)(0, 1);
      }

      // slice a 3 bit sized field 
      // returns a proxy reference to the 3 bits
      auto field2() {
        return (*this)(1, 4);
      } 
  };
  mybits_t mybits(7);
  assert(mybits.field1() == 1);
  mybits.field1() = 0;
  assert(static_cast<int>(mybits) == 6); //no implicit cast to integers, but explicit supported

There is minimal overhead depending on how aggressively the compiler inlines or optimizes*

munificent · 6 months ago
This is a good article, but it bugs me that it sort of only tells half the story.

Why are you making your structs smaller. Probably for "performance". If your goal is literally just reducing memory usage, then this is all fine. Maybe you're running on a small microcontroller and you just straight up don't have much RAM.

But these days, in most areas, memory is cheap and plentiful. What's expensive is CPU cache. If you're optimizing memory use for that, it's really about runtime speed and less about total memory usage. You're trying to make things small so you have fewer cache misses and the CPU doesn't get stuck waiting as much.

In that case, the advice here is only part of the story. Using bitfields and smaller integers saves memory. But in order to access those fields and do anything with them, they need to get loaded into registers. I'm not an expert here, but my understanding is that loading a bitfield or small int into a word-sized register can sometimes have some bit of overhead, so you have to pay attention to the trade-off.

If I was optimizing for overall runtime speed and considering packing structs to do that, I'd really want to have good profiling and benchmarks in place first. Otherwise it's easy to think you're making progress when you're actually going backwards.

ack_complete · 6 months ago
> If you're optimizing memory use for that, it's really about runtime speed and less about total memory usage. You're trying to make things small so you have fewer cache misses and the CPU doesn't get stuck waiting as much.

The complication is that cache is a global resource. Code that has larger data structures, even if it runs faster, can contribute to a larger working set and a slightly higher cache miss rate in other code. This can lead to a death-by-thousand cuts scenario and it's hard to get solid data on this when profiling.

You're right, though, that there are a number of ways that smaller structs can be slower, either directly by execution time or indirectly by larger code causing more misses in the instruction cache. Some other factors include whether the compiler can coalesce accesses to multiple fields grouped together, or whether the struct hits a magic power of two size for faster array indexing or vectorization. Which means you can end up speeding up code occasionally by making the struct bigger.

ryao · 6 months ago
If you are using an intrusive data structure, keeping the key on the same cache line as the pointers can make traversals faster. Reducing padding can make this easier to do.

In general, keeping commonly accessed together things on the same cache line is a good idea and the work needed to remove excessive padding overlaps with the work needed to do this.

ljchen · 6 months ago
I most agree, only except that perf is determined by a myriad of facors. Even if this piece of data fits into a cache line, it may have no or negative effect on the following code. For complex softwares, even if it's faster after the change, it's hard to tell what's the contributing factor really is. I once read a system paper (forgot the name) from a recent system research conference on how changing the code layout, e.g., moving a block of code, randomly may have unexpected impacts on perf.

Personally, I only pay attention to the cache line and layout when the structure is a concurrent data structure and needs to sync across multiple cores.

bluGill · 6 months ago
I pack structs because the packed layout matches some hardware or network structre and thus I get direct access . This isn't a common case but it exists.
yellowapple · 6 months ago
That case is much more common in embedded/bare-metal programming. A decent chunk of the code I've written so far for getting Zig running on a Nintendo 64 is a bunch of packed structs representing various registers (and a bunch of methods on those structs representing common uses of those registers).

It's also a reasonably common case for parsing binary data formats (e.g. a file's header has fields at exact byte offsets, and it's much faster and easier to just grab all the bytes of that header and interpret those bytes as a packed struct).

Rayhem · 6 months ago
If you're really accessing those fields it's overwhelmingly likely that you're accessing the same field for all of your instances. Your access pattern is going to be

    monsters[0].health = //calculation
    monsters[1].health = //calculation
    monsters[2].health = //calculation
and not

    monsters[0].health = //calculation
    monsters[1].name = //update
    monsters[2].is_poisoned = //check
Striding over your monsters array means you have to load each monster to access one field which just blows your cache to hell. Much better is to pack all of your health values together in a structures-of-arrays layout as opposed to the usual arrays-of-structures approach here. Entity-component-system architectures are a brilliant tool for managing this with some pretty performant results.

munksbeer · 6 months ago
Could you give some intuition on why this is likely to be the case more frequently than accessing different values from a single entity?

I'm not a games developer, so I don't have a good feel for why you'd need to iterate over every monster's health property in a tightly loop vs accessing each monster and calculating their position, what properties they have and updating one or more of their values in an iteration.

zeusk · 6 months ago
The real issue with packed structs and bitfields happens when concurrency gets involved. Majority of modern CPU caches are private and only allow one core to hold the cache line - so it creates more false dependencies when cores are trying to alter information that was compacted into a cache line on another core.
charleslmunger · 6 months ago
Avoiding false sharing is a separate problem best solved by explicitly aligning the struct or relevant members to std::hardware_destructive_interference_size.
voidnap · 6 months ago
> memory is cheap and plentiful. What's expensive is CPU cache.

CPU cache is memory. Also memory isn't cheap, it is relatively expensive to access. CPU cache is way cheaper. You have it backwards.

mort96 · 6 months ago
The obvious and charitable read of their post is that they use "memory" to mean main memory/RAM, "CPU cache" to mean CPU cache, and "memory is cheap and plentiful" to mean "memory has a low dollar cost and most systems have a whole lot of it".

Your read -- that they use "memory" to mean "anything which stores bytes, including caches", and that they use "memory is cheap" to mean "accessing RAM has a low latency" -- is uncharitable, contrived and even literally self-contradictory ("CPU cache is memory" + "memory is expensive to access while CPU cache is cheap to access" is a contradiction which can only be resolved through using different definitions of "memory" in those two sentences).

writebetterc · 6 months ago
They're talking about dollars
jcelerier · 6 months ago
> But these days, in most areas, memory is cheap and plentiful

and that's why I am contemplating freaking 128G of ram for my next machine, I have 64 right now and OOM regularly, 32 is unuseable

zahlman · 6 months ago
What software are you running? I'm humming along without a problem on 8.
wbl · 6 months ago
Everything is microarch dependant but typically the memory subsystem can handle subword loads and writes. What it usually can't do is bits.
rwmj · 6 months ago
No mention of pahole / dwarves? https://lwn.net/Articles/335942/ It's the standard tool used by the Linux kernel developers.
petergeoghegan · 6 months ago
Recent versions of clangd show pahole-like struct size/alignment padding overhead annotations. I find this built-in support a lot more convenient than using pahole.
uticus · 6 months ago
for somebody starting out, where do you go for a list of tools like this?
ryao · 6 months ago
pahole is an awesome tool.
kazinator · 6 months ago
Generally speaking, sort the members in descending order by alignment, if you want the tightest packing without resorting to nonportable compiler extensions. This descending order eliminates padding between members producing the smallest offset for the last byte of the last member, and that then minimizes the padding required at the end of the struct.

In C, alignment for basic types is linked to size. A type may have an alignment requirement that is as much as its size: e.g. an 8 byte data type might be aligned as strictly as addresses divisible by 8.

The alignment of an array aggregate is that of its element type.

The alignment of a struct is that of its most strictly aligned member.

In portable coding, we don't exactly know, but we can make the worst-case assumption that every basic type has the strictest possible alignment---i.e. its size. We also don't know the sizes of all the types, but we know that integers of higher ranks are at least as large as lower ranks; i.e. sizeof(char) <= sizeof(short) <= sizeof(int) <= sizeof(long), etc.

Things are tricky when you have a mixture of floating-point, pointers and integers. E.g. pointer could be the same size as a float, or bigger.

Here is a trick you use: combine together clumps of like types, and try to make combinations that are powers of two. For instance, if your structure has two or four pointers to things, try to make them adjacent. E.g.:

  struct s {
    void *p1, *p2, *p3, p4;
    short s1, s2, s3, s4;
    float f1, f2, f3, f4;
  };

Even though float is almost certainly larger than short and may have stricter alignment, its moot because I have four of everything. Even if this is on a tiny system where pointers are 16 bits wide, four of them will make 8 bytes. So when the f1 member is being placed into the struct, the offset is at least 8 byte aligned: we don't expect padding.

If you aggregate enough members of the same type in a power-of-two sized clump, you can easily create alignment which meets or exceeds the strictest alignment in the system.

So those are the main principles: group together members of the same type. Then keeping those groups together, sort by descending alignment strictness.

o11c · 6 months ago
Note that although alignment and size are platform-specific, in practice padding can be minimized simply by using a fixed order:

  (various vector types)
  int128_t
  long double
  long long, int64_t, int_fast64_t, double
  off_t
  time_t (can be 64 bits on 32-bit platforms only if off_t also is)
  register_t, int_fast16_t, int_fast32_t (64-bit on e.g. x32)
  void *, void(*)(), intptr_t
  ptrdiff_t (theoretically wrong if fully aligned and bigger than pointer)
  size_t, ssize_t
  long (32-bit on win64; generally has 16-bit alignment if bigger than pointer)
  int32_t, float
  int (can be 16-bit)
  short, int16_t
  char, bool, int8_t, int_fast8_t
  (a trailing array member must come last; note when calculating combined size you should use offsetof instead of sizeof to not waste padding)
Besides this, `int_leastN_t` goes with `intN_t`, and all unsigned types go with their signed types. Note that types like `id_t` are hard to place however.

Although there do exist many platforms where this is not in order of size, the bigger types generally have smaller alignment on such platforms. Despite this I have split out rows occasionally despite not knowing of any such platform.

variadix · 6 months ago
Some other things you can do to improve packing are using integer indices/offsets/relative pointers instead of absolute pointers (especially on 64-bit machines), overlapping fields for unions (as an example you may have a tag that can share some bits with another field in each member of the union), using a structs position in memory to encode information (for example, if you have a binary tree of array allocated nodes, you can specify that the right child always follows the parent in memory, then you only need to store the index of the left child), and of course structure packing pragmas/attributes.
cogman10 · 6 months ago
Nice little article.

The next thing to really explore is how the struct is being used. Smaller is always nice, but sometimes you need it to be fast. To do that, you'll want to group together fields that are commonly read. For example, x, y, and speed are likely always read together. So, it'd make sense to group those fields together to optimize for cache reads.

The next thing to question is if your Monster is being interacted with as an individual thing or as a group of things. If it's as a group of things then it might make sense to create a "Struct of arrays" of monsters. Imagine, for example, you have a system which removes all the dead monsters. It'd be faster to iterate through an `isAlive` array rather than an array of `Monster` structs.

And now you are on your way to reinventing an ECS :)