I’ve been trying out various LLMs for working on assembly code in my toy OS kernel for a few months now. It’s mostly low-level device setup and bootstrap code, and I’ve found they’re pretty terrible at it generally. They’ll often generate code that won’t quite assemble, they’ll hallucinate details like hardware registers etc, and very often they’ll come up with inefficient code. The LLM attempt at an AP bootstrap (real-mode to long) was almost comical.
All that said, I’ve recently started a RISC-V port, and I’ve found that porting bits of low-level init code from x86 (NASM) to RISC-V (GAS) is actually quite good - I guess because it’s largely a simple translation job and it already has the logic to work from.
I have, and to be fair that has solved the “basically incorrect code” issue with reasonable regularity. Occasionally the error messages don’t seem helpful enough for it, which is understandable, and I’ve had a few occurrences of it getting “stuck” in a loop trying to e.g. use an invalid addressing mode (it may have gotten itself out of those situations if I were more patient) but generally, with one of the Claude 4 models in agent mode in cursor or Claude code, I’ve found it’s possible to get reasonably good results in terms of “does it assemble”.
I’m still working on a good way to integrate more feedback for this kind of workflow, e.g. for the attempt it made at AP bootstrap - debugging that is just hard, and giving an agent enough control over the running code and the ability to extract the information it would need to debug the resulting triple fault is an interesting challenge (even if probably not all that generally useful).
I have a bunch of pretty ad-hoc test harnesses and the like that I use for general hosted testing, but that can only get you so far in this kind of low-level code.
Similar experience - they seem to generally have a lot more problems with ASM than structured languages. I don't know if this reflects less training data, or difficulty.
As far as i can tell they have trouble with sustained satisfaction of multiple constraints, and asm has more of that than higher level languages. (An old Boss once said his record for bug density was in asm: he'd written 3 bugs in a single opcode)
The few times I've messed with it I've noticed they're pretty bad at keeping track of registers as they move between subroutines. They're just not great at coming up with a consistent "sub language" the way human assembly programmers tend to.
A bit tangential, but I've found 4 Sonnet to be much, much better at SIMD intrinsics (in my case, in Rust) than Sonnet 3.5 and 3.7, which were kind of atrocious. For example, 3.7 would write a scalar for loop and tell you "I've vectorized...", when I explicitly asked to do the operations with x86 intrinsics and gave it the capabilities of the hardware. Also, telling it to use AVX2 as supported would not make it use SSE or it would make conditionals to use them, which makes no sense. Seems Claude 4 solves most of that.
This fits my experience. I’m definitely getting considerably better results with 4 than previous Claudes. I’d essentially dropped sonnet from my rotation before 4 became available, but now it’s a go-to for this sort of thing.
Tangent: godbolt.org greeted me with a popup but boy, I have never seen a clearer privacy notice, minimal possible data retention, including a diff with the last version. Great job, Matt!
It does.... I was just surprised that it turned up as terminal output - for some reason I was expecting something in some form of GUI window for some OS or other but I guess that's orders of magnitude more complex and more likely to not work. But he did actually ask for ASCII output, so that does make sense - unlike my assumption!
I think that opening a window and rendering something inside it using the native Win32 API from assembly code on Windows would not be so terrifyingly complex. It's just more code as it needs to call the appropriate GUI APIs (not just syscalls), and it's OS-specific... but such code is anyway always OS-specific (the one mentioned here seems to be for Linux, given the used syscalls). No idea how complex it would be with X or on Mac, as I don't know their low-level GUI APIs.
Mmmmyeah, well, one thing LLM are very decent at is translating, it being from human language to human language or from code to code, so not sure your point stands.
Llm is useless in real world codebase. Tons of hallucination and nonsense. Garbagd everywhere. The danger thing is they messed things up rdomly, o consistence at all.
It is fine to treat it as a better autocompletion tool.
All that said, I’ve recently started a RISC-V port, and I’ve found that porting bits of low-level init code from x86 (NASM) to RISC-V (GAS) is actually quite good - I guess because it’s largely a simple translation job and it already has the logic to work from.
Have you tried using a coding agent that can run the compiler itself and fix any errors in a loop?
The first version I got here didn't compile. Firing up Claude Code and letting it debug in a loop fixed that.
I’m still working on a good way to integrate more feedback for this kind of workflow, e.g. for the attempt it made at AP bootstrap - debugging that is just hard, and giving an agent enough control over the running code and the ability to extract the information it would need to debug the resulting triple fault is an interesting challenge (even if probably not all that generally useful).
I have a bunch of pretty ad-hoc test harnesses and the like that I use for general hosted testing, but that can only get you so far in this kind of low-level code.
Edit: that -> than
First try worked but didn't use correct terminal size.
[0]: https://mathr.co.uk/mandelbrot/book-draft-2017-11-10.pdf
[1]: https://mathr.co.uk/web/mandelbrot.html
If you'd like to join our great little fractal community, here's a Discord invite link: https://discord.gg/beKyJ8HSk5
Dead Comment
[0] https://code.golf/mandelbrot#assembly
It is fine to treat it as a better autocompletion tool.