Memory Allocation - Readit News

> When we free memory, we should make sure that if the block we return to the free list is next to any other free blocks, we combine them together. This is called "coalescing."

A little offtopic but the default Delphi 7 memory allocator did this, except that it also merged blocks that it obtained from different OS allocation calls.

This worked fine for regular usage, but if that memory was ever used for Bitmaps for UI stuff, it wouldn't work: Since Windows does some of its UI stuff in kernel mode, before doing anything Windows would attempt to lock the entire allocation's VAD entry to prevent you from messing with it in another thread while it was using it. If the Bitmap you were working on happened to belong to two different OS-level allocations, this lock function would fail since it wasn't meant to handle that case.

This would lead to random, extremely cryptic errors such as ERROR_NO_SCROLLBARS "The window does not have scroll bars." since the lock call was deep in the stack and the callers replaced the error with another one as it bubbled up.

In the end we had to replace the allocator to avoid that issue. That was a fun day I spent debugging that.

meekaaku · 2 years ago

You mean a bug this deep took one day to debug?

alex7734 · 2 years ago

Yes and no, it took me from 8am to 3am once we decided it needed to get fixed but really it sat on the app for years, it only happened on a background process that sent print jobs on a timer, since it used Windows GDI to compose the image we sent to the printer it was affected (our "frontend" should've been affected too but never was, I guess because it had a different memory usage pattern).

We just had it restart itself and try again whenever it got one of those errors when printing but eventually we wanted to add a feature that required the process to not die, and by that time I was already 99% sure that it wasn't something in our code and I had already ruled out threading issues.

I ended up putting it in a VM with a kernel debugger attached and having a script make a snapshot and make it print over and over until it errored, then following along in IDA until I saw what was going on.

Having a way to trigger it (by restoring the snapshot) on demand helped a lot, otherwise it would have taken forever to make sense of it as it could sit without crashing for nearly an hour.

ilyt · 2 years ago

I think the far weirder part of this was the kernel-side handling of scrollbars

alex7734 · 2 years ago

If I recall correctly the kernel part of things would return an out of memory error which the user mode DLL translated to that weird error (sometimes, other times it would just say "out of system resources", it depended on what widget the bitmap that overlapped the two memory regions belonged to).

Here's a 2003 forum post from someone else having the same problem: http://www.delphigroups.info/2/1/749064.html

derefr · 2 years ago

Until Windows 95, Windows was essentially just a DOS application that grabbed the framebuffer and ran an event loop where it drew "controls" (which includes windows, buttons, text views, and yes, scrollbars.) That was the whole point of it. It wasn't an "OS" per se; DOS was the OS. Windows was what a Linux-head would think of as a combination of an X server and window manager. And Windows loaded your "application" as essentially a DLL, with the Windows global event loop calling into your application's event-loop delegate handler (WndProc) whenever it has an interesting event that your application might like to react to.

(Your application wasn't even a "process" per se. Until Windows 95, everything was just happening in one shared address space, in real mode. In fact, it was only in Windows 3.1 where user applications stopped running in ring 0!)

If you think about it, this "the kernel is a game engine and your application is the game" approach isn't necessarily a bad design... for a single-tasking OS's library exokernel, like the Wii's https://wiibrew.org/wiki/IOS.

But, of course, Windows claimed to be a multitasking OS. But it actually wasn't! And I don't mean the obvious thing about it not having pre-emption. Lots of multitasking OSes didn't have pre-emption.

No, what I mean is that the concurrency primitive for the cooperative scheduling wasn't the "task" (i.e. the process or thread. Which, again, there weren't any of.) Instead, the concurrency primitive was the window. Until Windows 95, Windows was a multi-windowing OS.

Each control was owned by a window. Each window had a WndProc. If your Windows executable (i.e. application delegate module) set up two windows, then each window participated separately in the global Windows event loop, up-to-and-including things like having its own set of loaded fonts, its own clipboard state, and its own interned strings table. In OOP terms†, your application was just a dead "class object", running no logic of its own save for one-time load and unload hooks. It was the windows themselves that were the "instances" of your class.

This might make you realize why MDI (or Multiple Document Interface, where there are multiple small per-document "windows" inside one big window) was so popular back then. The MDI "windows" weren't actually windows — they didn't have their own WndProc. They were just controls, like a tab view is a control. Only the big container window was a real window, and so all the resources within that big window were shared between all the virtual windows. MDI was a memory-saving trick!

---

† The actual more interesting analogy is that Windows was essentially a (single-threaded, cooperatively-scheduled) actor system, where windows were the actors. There is a very close parallel between (Windows 3.1 executables, Windows 3.1 windows) and (Erlang modules, Erlang processes).

Thank you for this, this is helpful.

I wrote a JIT compiler and I didn't bother calling free much, I just let the operating system free up all allocated memory.

I got into this situation often:

   return_struct = do_something(mystruct);
   return_struct->inner_struct = malloc(sizeof(struct my_inner_struct));

Now, who owns inner_struct? Who is responsible for freeing it? Do I free it when I assign to it?

I feel this ownership complicates cross-language FFI API calls, because who is responsible for freeing structures depends on the application and the platform you're running under. For example, Rust code being called from Erlang. You have to be extra careful when memory is managed by a different language runtime.

I feel I am at the beginning of intuitively understanding what memory really is: memory is just a huge contiguous region of numbered locations. Bump allocators allocate in one direction and free all at once. Arena allocators allocate to a preallocated region, I think.

Memory is a logistical problem of how you arrange and allocate finite resources.

I am thinking of alternative visualizations of understanding memory, for example, I started writing an animation of binary search:

https://replit.com/@Chronological/ProgrammingRTS

The idea is that you see values and memory locations move around with the final goal being able to command memory to move around and be computed, such as real time strategy game.

I think if we could visualise memory as cars on a road, we would see obvious traffic jams.

secondcoming · 2 years ago

> Now, who owns inner_struct?

return_struct does since it is the only thing that knows the address.

> Who is responsible for freeing it?

return_struct is, unless you hand that responsibility over to something else.

> Do I free it when I assign to it?

Yes, unless you want leaks.

> I think if we could visualise memory as cars on a road, we would see obvious traffic jams.

That visualisation is helpful for threads, where the program is the road/map and the cars are the threads. I don't see how it's useful for memory.

jacobsenscott · 2 years ago

A struct can't own something - it isn't a class with a destructor or anything. So it isn't quite so obvious. There are only two lines of code here, but the implication is a function is running that code and then returning `return_struct`, which might get passed around to more functions, and even returned further up the call stack. Somewhere there needs to be code that knows "hey - nobody else is using return_struct, and by the way you need to free return_struct->inner_struct before freeing return_struct.

dahart · 2 years ago

> I feel I am at the beginning of intuitively understanding what memory really is: memory is just a huge contiguous region of numbered locations.

There might be an analogy here that could help you reason about your nested structure allocations…

Memory is an array of bytes owned by the OS. While there are all kinds of implementation details about addressing and storage and performance and paging and virtual memory, it’s really just an array. The OS gives you a way to reserve pieces of the array for your own use, and you’re responsible for giving them back if you want to play nice and/or run for a long time, otherwise (as a safety net) the OS will take them back as soon as you exit.

This is, in a sense, very similar to the question you posed. An outer routine owns the outer structure, and an inner routine allocates some inner structure. The simplest, most intuitive, and generally best advice is that whoever allocates is also responsible for freeing memory. In other words, one way to define ownership of memory is by who allocates it. Implicitly and automatically the responsibility to free that memory belongs to owner that allocated it. It’s okay to explicitly transfer ownership, but can easily get complicated and unintuitive. You can also consider letting the parent free your struct to be similar to not calling free() in your JIT compiler - it’s a ‘lazy’ optimization to have the parent clean up - and I don’t mean that in a judgemental sense, I mean it’s valid to let the parent handle it, if you know that it will, and this can be done without getting confused about who owns the memory and who was actually responsible for freeing it. Note that when you leave the parent to clean it up, you are foregoing the ability to re-use the memory - this is true in your JIT compiler and it’s true for malloc() and free() as well. If you let the OS handle it, you’re in effect declaring that you believe you do not need to recycle the memory allocated in your program during it’s execution. (This might be true, and it might stay that way, but it’s always worth asking if it will remain true, since lots of people have been eventually bitten and had to retroactively refactor for memory management when their requirements change.)

samwho · 2 years ago

Yeah, I hear you. I've not done a lot of FFI stuff directly, it scares me.

Arena allocators are cool, the idea is you allocate a large-ish region of memory and sub-allocate into it (often with a fast, simple allocator like a bump allocator) and then free the large-ish block when you're done. It's a way to take knowing how much memory you need as a whole and optimise that to a single call to malloc/free.

You may enjoy looking through https://www.cs.usfca.edu/~galles/visualization/Algorithms.ht....

samsquire · 2 years ago

Thanks for the link to the animations.

I want an extremely performant deep copy solution, I've been thinking of using an allocator to implement it.

If we have a tree data structure or a nested hashmap, then we want to copy it cheaply, there is copy on write. But most copies of hashmaps are slow because they instantiate every child object in a recursive loop.

So I want to be able to memcpy a complicated data structure for cheap copies.

duped · 2 years ago

> You have to be extra careful when memory is managed by a different language runtime.

While it would be nice to have next to no overhead for FFI, it's not always tractable. That's why you have to serialize across boundaries, the same as if you're serializing across processes or the network. At least in a single virtual memory space you can have a caller allocate a buffer and the callee fill it, with the caller being responsible for deserializing and freeing later. That gets you pretty far, and is relatively safe.

The alternative is to be callee managed, and for the callee to return things by handle and not necessarily by pointer, but that is also fraught.

stevefan1999 · 2 years ago

That's exactly why you should use reference counting

devit · 2 years ago

The article claims that an allocator that splits memory based on allocation size is called a "buddy allocator". That's misleading: an allocator that allocates an area for each size class is usually called a "slab allocator", while a "buddy allocator" is one that when needed subdivides a memory area with a power of two size into two half-sized areas that are "buddies", does so recursively to satisfy allocations, and coalesces them again when they are free.

E.g. the Linux kernel used (not sure if it's still like this) a buddy allocator to allocate pages and power-of-two blocks of pages and slab allocators to subdivide those pages and allocate data structures.

Another thing that the article doesn't mention that is important is that most production allocators make use of thread-local storage and either have per-thread caches of free blocks or sometimes whole per-thread memory regions. This is to reduce lock contention and provide memory that is more likely to be in the current core's cache.

You're absolutely right, I've corrected this. Thank you!

I had originally written about threading and locality but it made the post too long and complicated, so I cut it out for the final draft. You can see remnants of it if you check the HTML comments in the post :D

sylware · 2 years ago

linux had a bitmap based "buddy allocator" (power of two), now it is not bitmap based anymore (complexity not worth it anymore, performance wise, namely simplicty was restored).

Then linux has various slabs(slub/slob/slab), built on top of the "buddy allocator".

Userlevel code shoud use non virtual address stable mmap-ed regions (slabs + offsets). Legacy "libc" services were built as virtual address stable services... which are kind of expensive to manage on a modern paginated system. Virtual address stable regions should be kept to a minimum (that horrible ELF static TLS). There is a workaround though (but linux overcommit default policy could kill your process): a user process would query the amount of ram on the system and mmap a region of roughly (care of the overcommit policy) the same size, only once per process life-time. Then you could have a virtual address stable region which could use most if not all the available ram (excluding hot-memory addition...). Should be very easy to manage with lists.

ogurechny · 2 years ago

In this case, one should start with implementing -Xmx switch, then gradually adding the rest.

celeritascelery · 2 years ago

This is absolute gold. When I use things like this, I am reminded how powerful digital learning can be. So much more capable then just text or video. But very little content is this well put together. Probably because it is so time intensive.

This feedback has made my day. Thank you.

I'm inspired by Bartosz Ciechanowski and Julia Evans. The web is such a powerful toolbox. So many concepts are more complex than text alone can hope to explain. Those two are so creative and full of energy.

And you're right, it's incredibly time intensive to put these together. Thousands of lines of code, plus the text content, plus reaching out to domain experts for reviews (shout out Chris Down, kernel wizard extraordinaire).

spenczar5 · 2 years ago

Such great work. I learned things and have a clearer understanding that I think I will come back to in the future. And I say that as someone who has programmed for 15 years! I think your effort paid off, and the inspiration shows through.

Also shout out Anton Verinov (https://anton.codes/): the only reason this web page doesn't drain your battery before you get to the end of it.

couchand · 2 years ago

One more name I'dd add to this list is Mike Bostock. The care and attention he gives to data visualization examples comes through in the finished product.

Communicating complex subjects through interactive visual displays is very effective -- it's worth the effort. Thank you!

pvorb · 2 years ago

I think it's a nice little example for "Explorable Explanations", a term coined by Bret Victor in his eponymous essay[1] from 2011.

[1]: http://worrydream.com/ExplorableExplanations/

ww520 · 2 years ago

The writing is very clear and the concepts are explained well. Not too much information and not too little. Excellent post.

naillo · 2 years ago

Great job Sam

dmd · 2 years ago

If you're not yet familiar with it - https://ciechanow.ski/archives/ is for you (and everyone!)

The master of this artform.

throwaway689236 · 2 years ago

Agreed, I wish I had more thing like this growing up.

OliverJones · 2 years ago

Oh man, this brings back the days when I wrote special debug-version malloc and free code to try to track down heap corruption due to malloc / free misuse (in code I had contributed to). Stuff like kbyte-long boundary buffers with bit patterns in them, and all sorts of lookaside lists run in parallel with libc's default code. Those bug-detectors worked OK. Hard-nosed code inspection worked far better.

As an old-timer in writing code, I think my generation's most-challenging legacies (=== the things we f**ked up) are the non-robust malloc/free concept and null-terminated text strings. Bugs in code using those conventions have been exploitable to a really damaging extent. I learned to program in C from K&R. And getting data-structure code right, and safe to deploy, in that language and its derivatives is HARD.

The C inventors are (were in Dennis Ritchie's case) brilliant and Bell Labs was great. Their ideas shaped the the stuff the global internet runs on. But these parts of what thy invented ..... ouch. (All OSs had the same problems.)

I wish the wonderful article presented here carried a more prominent warning about this. Many will read it as they learn to code. The history of our profession can teach about what NOT to do as well as what to do.

thecoppinger · 2 years ago

All credit goes to my wonderful friend Sam Who: https://twitter.com/samwhoo

He recently published a similar guide on load balancing that is equally intuitive and insightful: https://samwho.dev/load-balancing/

I can't wait to see what he puts out next!

terrycody · 2 years ago

https://news.ycombinator.com/item?id=35588797

We never missed good things lol!

ModernMech · 2 years ago

This is wonderful! I'm definitely going to be sharing this with my students (college sophomores studying CS).

If I were to make some suggestions, based on how I know they would receive this:

- I would make explicit reference to heap and stack. Students who are learning this material are learning about the heap/stack dichotomy, and I think it would really improve the exposition to make clear that not all memory is allocated this way in a program.

- I would remove this line: "At the end of this post, you should know everything you need to know to write your own allocator." I can confidently say that my student will not be able to write an allocator after reading this. It's nothing to do with your piece, it's just the intersection of people who don't understand hex and who could build an allocator after a short blog post is very very small. Students who read this post and at the end still feel like they can't build an allocator will end up discouraged, with a feeling that maybe they are missing something.

- Consider rethinking how you handle hex numbers throughout. You introduce them and say they are distinguished by a preceding "0x", but then you immediately omit the 0x to save space in the figure. In my experience, students have a lot of trouble with understanding that hex and dec have the same underlying representation. They will not be able to distinguish between hex and dec bases implicitly, so from a pedagogical standpoint, it's better to be consistent throughout and include the prefix.

- Finally at the end I would make mention that there are other strategies for memory management to encourage further exploration. Other languages do it differently, and for students it's important to know which other avenues are out there. Otherwise they might think heap-based malloc/free are the only way to do things, the same way they often thing imperative programming is the only way to program.

Anyway, thank you for creating this, and I'm sure it will really help my students. In a just world, "seeing" the memory like this would ideally be first-class tooling for languages like C.

Really appreciate you taking the time to write this, thank you.

I tried a couple of different ways to introduce the stack and the heap but it always felt like it made the post too long and complicated. In the end I decided to take a pure, idealistic view of memory in order to focus on the algorithms used to pack allocations effectively. You can see some of my abandoned efforts as HTML comments in the post :D

Introducing the 0x prefix and immediately dropping it hurt me as well, but I didn't have a better way to make the visualisation work on mobile. I completely agree with you that it's not ideal.

I'd like to do a post of this style about garbage collection at some point.

Maybe make this a series? I think another post just like this on the stack is in order. You could show allocating stack frames with the slider! Then you can put them both together in a third post and show the entire memory layout of a C program.

Would definitely like to see more thoughts from those cute corgis.

wrycoder · 2 years ago

Please do, this excellent post is already a good start on the issues involved in creating compacting garbage collectors.

xigoi · 2 years ago

Is it really necessary to use hexadecimal at all in the article? Decimal works just fine for pointers, it's just not conventionally used.

Damogran6 · 2 years ago

Not really related, but the feeling of elation when I alloc'd 1M of RAM in OS/2 and it _didn't_ swap changed me.

This was on a 386sx with 8M RAM and it was pretty much all the available memory after the OS was loaded and settled down.

A MILLION BYTES!!

Didn't do anything with it, but still, after DOS and EMS/XMS memory and all the other falderol of managing memory. (At the time, it was also the only x86 OS that would format a floppy drive without bringing all the other threads to a crawl. UI was still available-ish, BiModem was still BiModeming...