That's an incredible find and once I saw the assembly I was right along with them on the debug path. Interestingly it doesn't need to be assembly for this to work, it's just that that's where the split was. The IR could've done it, it just doesn't for very good reasons. So another win for being able to read arm assembly.
Unsure if this would be another way to do it but to save an instruction at the cost of a memory access you could push then pop the stack size maybe? Since presumably you're doing that pair of moves on function entry and exit. I'm not really sure what the garbage collector is looking for so maybe that doesn't work, but I'd be interested to hear some takes on it
You would normally use the “LDR Rd, =expr” pseudo-instruction form [1]. For immediates not directly constructible, it puts a copy of the immediate value in a PC-relative memory location, then does a PC-relative load into register.
So that would turn the whole sequence of “add constant to SP” into 2 executable instructions, 1 for constructing immediate and 1 for adding for a total of 8 bytes, and a 4 byte data area for the 17-bit immediate for a total of 12 bytes of binary which is 3 executable instructions worth.
I've usually seen compilers handle large constants with MOV/MOVK sequences (encoding 16 bits of data per 32-bit instruction) instead of loading them from memory. Loading from memory was more common on 32-bit ARM.
I'm a little surprised that this bug wasn't fixed in the assembler as a special case for immediate adds to RSP. If the patch was to the compiler only, other instances of the bug could be lurking out there in aarch64 assembly code.
Would that be wise? The implemented solution uses a temporary register to hold the full value being added to rsp.
I don't know enough about how people use the go assembler, but I imagine it would be very surprising if `add $imm, rsp, rsp` clobbered an unrelated register when `$imm` is large enough. Especially since what's clobbered is the designated "temporary register", which I imagine is used all the time in handwritten go assembly.
Is that possible? I think you would have [1] to use a register to build up the immediate value. The assembler cannot/should not default to one, so I think the best one could do is having another macro for ADD that takes that helper register as an argument. That wouldn’t fix other instances in the AArch64 assembly code.
[1] I’m not familiar with AMD64, but maybe, you could use a thread local (edit: wouldn’t work with M:N threads. You’d need a coroutine-local. That would tie the assembler to golang, and thus would, even on that alone, be a very bad idea) or reserve space in the stack frame for it, too, but I don’t see those as realistic options
Still, I find great that Go got back the 1990's tradition that compiled languages have an assembler as part of their tooling, regardless of the syntax.
I've never heard of that rule (though tbh I'm not allocating > 64KB of stack when I'm in assembly) and it seems Google hasn't either. While I'm sure it makes sense, I don't think I've ever seen that be enforced. At least in C/C++. Maybe it makes more sense for these stack inspecting garbage collectors but I've also heard of ones that just scan the stack without unwinding anything. I did a test asking Google's AI to generate a complicated C function, put it in godbolt, and there's plenty of push push push push ..... Pop Pop Pop Pop going on
Yeah but we have codegen bugs in .NET as well. The biggest difference that stood out to me in this write up, is we would have gone straight for “coredump” instead of other investigation tools. Our default mode of investigating memory corruption issues is dumps.
I think the right fix is that the compiler should, e.g. load the constant into a register using two moves and then emit a single add. It's one more instruction, but then the adjustment is atomic (i.e. a single instruction). Another option is to do the arithmetic in a temp register and then move it back.
One thing I worry about, probably unnecessarily, is anything with a sense of urgency.
HEY GUYS WE JUST FOUND A GOLANG COMPILER BUG AND FATAL PANICS!
Everyone is like “Hmm. I need to fix this now.”
So, 99% probability it’s what it is. 1% it’s some secret defensive thing because there was a bad stupid zero day someone would get fired over or that could leave the world in shambles if uncovered, or maybe something else needed to be swept under the rug, or maybe someone wants to distract while they introduce a new vulnerability.
I don’t think this with CVEs, but when someone’s like “install this patch everybody!” the dim red light flickers on.
It's an open source project — and quite a popular one, at that — and you are literally replying to a comment that specifies the changes made to fix this particular issue — you can see for yourself what is occurring here. Anyone can.
This issue, and the fix, has perfectly good visibility. Even if you personally can't understand the code, plenty of others can and do.
All of which makes your claims seem like quite unnecessary paranoia — to a lot of folk... and I suspect that is probably why your comment is getting heavily downvoted.
One thing that often gets missed is how hard it is to even suspect the compiler as the root cause. Most engineers waste hours chasing bugs in their own code because we’re trained to trust our tools. This mindset alone can make these rare compiler bugs much trickier to find.
In the early PC days we suspected them a lot given how manually writting Assembly was still much better, in many cases.
I found out a bug on Turbo Pascal 6, where if you declare a variable with the same name as the function name, then the result was random garbage.
For those that don't know Pascal, the function name has to be assigned for the result value, so if a local variable with the same name is possible, then you cannot set the return value.
(* In Turbo Pascal 6 this would compile *)
function Square(num: Integer): Integer;
var
Square: Integer;
begin
Square := num * num; (* Here the local variable gets used instead *)
end;
succ(seg(x)) and pred(seg(x)) turned out to be equivalent of just seg(x) in TP6.
Earlier versions of Turbo Pascal (and Poly Pascal) generated poor code for "... + 1" but better code with succ(...) and doing memory access via memw[s:o] was common for speed for certain kinds of code. Allocating whatever size you needed + 16 guaranteed you had allocated enough to have paragraph aligned allocation (16 instead of 15 so you could just use the segment + 1).
I think it took a day or two to find this bug in some text-mode windowing code I'd written.
In the past it was more common to suspect the compiler, as others mention here.
On a minicomputer I worked with in the late eighties, early nineties, I occasionally found errors in the compiler output. This was a Pascal compiler and because of that it didn't take too long to figure out that the code was actually correct and something else must be going on. Then firing up the debugger/tracer and scrutinizing and analyzing what happens in the disassembly.. when the problem was found, send a fax (yes!) to the head designer of the compiler, get a fixed test compiler back on a set of floppies.. went through this several times. I still have a printout somewhere with my pen marks pointing out a bug in the generated code.
Great technical blog. Good pathway for narrative, tight examples, description so clear it makes me feel smarter than I am because so easy to follow though the last time I even read assembly seriously was x86 years ago.
Also, fulfills the marketing objective because I cannot help but think that this team is a bunch of hotshots who have the skill to do this on demand and the quality discipline to chase down rare issues.
I assume these are Ampere Altra? I was considering some of those for web servers to fill out my rack (more space than power) but ended up just going higher on power and using Epyc.
This problem strikes me more as a debuginfo generation bug than a "compiler" bug.
> After this change, stacks larger than 1<<12 will build the offset in a temporary register and then add that to rsp in a single, indivisible opcode. A goroutine can be preempted before or after the stack pointer modification, but never during. This means that the stack pointer is always valid and there is no race condition.
Seems silly to pessimize the runtime, even slightly, to account for the partial register construction. DWARF bytecode ought to be powerful enough to express the calculations needed for restoring the true stack pointer if we're between immediate adjustments.
> This problem strikes me more as a debuginfo generation bug than a "compiler" bug.
But isn't that the same thing here? The bug occurred in their production workflows, not in some specific debug builds, so with that seems pretty reasonable to call it a compiler bug?
Thanks. I think of unwinder information as debuginfo even though, as you point out, it's used outside of debugging contexts all the time. :-)
As for the actual bug:
Unless you're unwinding the stack by walking the linked list of frames threaded through the frame pointer, then each time you unwind a level of the stack, you need to consult a table keyed on instruction pointer to look up how to compute the register contents of the previous frame based on register content of the current frame. One of the registers you can compute this way is the previous frame's stack pointer.
I haven't looked in depth at what the Go runtime is doing exactly, but at a glance, I don't see mention of frame pointers in the linked article, so I'm guessing Go uses the SP-and-unwind-table approach? If so, the real bug here is that the table didn't have separate entries for the two ADDs and so gave incorrect reconstruction instructions for one of them.
If, however, frame pointers are a load-bearing part of the Go runtime, and that runtime failed to update frame pointer (not just the stack pointer) in the contractually mandatory manner, well, that's a codegen bug and needs a codegen fix.
I guess I just don't like, as a matter of philosophy if not practical engineering, having frame pointers at all. Without the frame pointer, the program already contains all the information you need to unwind, at no runtime cost --- you pay for table lookups only when you unwind, not all the time, on straight-line code.
The purist in me doesn't like burning a register for debugging, but you have to use the right tool for the job I guess.
What ARM64 machines are you using and what are they used for? Last year you were announcing Gen 12 servers on AMD EPYC (https://blog.cloudflare.com/gen-12-servers/), but IIRC there weren’t any mentions of ARM64. But now it seems you’re running ARM64 in full production.
Yeah but those are pretty dated. I was under the impression those old Ampere servers are not efficient compared to modern EPYC anymore. So I’m wondering what their current generation of arm64 servers look like :p
I wonder if Go had a mode where you make it single step every instruction and trigger a GC interrupt on every opcode. That would make it easier to find these kinds of bugs.
As an aside, this is the type of a problem that I think model checkers can't help with. You can write perfect and complicated TLA+/Lean/FizzBee models and even if somehow these models can generate code for you from your correct models you can still run into bugs like these due to platform/compiler/language issues. But, thankfully, such bugs are rare.
Unsure if this would be another way to do it but to save an instruction at the cost of a memory access you could push then pop the stack size maybe? Since presumably you're doing that pair of moves on function entry and exit. I'm not really sure what the garbage collector is looking for so maybe that doesn't work, but I'd be interested to hear some takes on it
So that would turn the whole sequence of “add constant to SP” into 2 executable instructions, 1 for constructing immediate and 1 for adding for a total of 8 bytes, and a 4 byte data area for the 17-bit immediate for a total of 12 bytes of binary which is 3 executable instructions worth.
[1] https://developer.arm.com/documentation/dui0801/l/A64-Data-T...
I don't know enough about how people use the go assembler, but I imagine it would be very surprising if `add $imm, rsp, rsp` clobbered an unrelated register when `$imm` is large enough. Especially since what's clobbered is the designated "temporary register", which I imagine is used all the time in handwritten go assembly.
[1] I’m not familiar with AMD64, but maybe, you could use a thread local (edit: wouldn’t work with M:N threads. You’d need a coroutine-local. That would tie the assembler to golang, and thus would, even on that alone, be a very bad idea) or reserve space in the stack frame for it, too, but I don’t see those as realistic options
Yes, though that weird stuff with dollars in it is not normal AArch64 assembly!
The article could have mentioned the "stack moves once" rule.
https://go.dev/doc/asm
Still, I find great that Go got back the 1990's tradition that compiled languages have an assembler as part of their tooling, regardless of the syntax.
See the AT&T vs Intel syntax since you aren't familiar with assembly:
https://en.wikipedia.org/wiki/X86_assembly_language#Syntax
Dead Comment
Does the Go team have a natural language bot or is this just comment.contains(“backport”) type stuff?
(found via https://go.dev/wiki/gopherbot)
HEY GUYS WE JUST FOUND A GOLANG COMPILER BUG AND FATAL PANICS!
Everyone is like “Hmm. I need to fix this now.”
So, 99% probability it’s what it is. 1% it’s some secret defensive thing because there was a bad stupid zero day someone would get fired over or that could leave the world in shambles if uncovered, or maybe something else needed to be swept under the rug, or maybe someone wants to distract while they introduce a new vulnerability.
I don’t think this with CVEs, but when someone’s like “install this patch everybody!” the dim red light flickers on.
This issue, and the fix, has perfectly good visibility. Even if you personally can't understand the code, plenty of others can and do.
All of which makes your claims seem like quite unnecessary paranoia — to a lot of folk... and I suspect that is probably why your comment is getting heavily downvoted.
I found out a bug on Turbo Pascal 6, where if you declare a variable with the same name as the function name, then the result was random garbage.
For those that don't know Pascal, the function name has to be assigned for the result value, so if a local variable with the same name is possible, then you cannot set the return value.
Something like this https://godbolt.org/z/s6srhTW66
Earlier versions of Turbo Pascal (and Poly Pascal) generated poor code for "... + 1" but better code with succ(...) and doing memory access via memw[s:o] was common for speed for certain kinds of code. Allocating whatever size you needed + 16 guaranteed you had allocated enough to have paragraph aligned allocation (16 instead of 15 so you could just use the segment + 1).
I think it took a day or two to find this bug in some text-mode windowing code I'd written.
The reporter actually spent the effort to track it down, turns out it _was_ a Go compiler bug. (https://github.com/golang/go/issues/20427)
Deleted Comment
In the HFT sphere i haven't talked to a company that hasn't reported (bragged about finding) a super weird gcc/clang bug.
Well, also, at my last job we used a snapshot version of the compiler, bc... Any nanoseconds matters.
Also, fulfills the marketing objective because I cannot help but think that this team is a bunch of hotshots who have the skill to do this on demand and the quality discipline to chase down rare issues.
I assume these are Ampere Altra? I was considering some of those for web servers to fill out my rack (more space than power) but ended up just going higher on power and using Epyc.
> After this change, stacks larger than 1<<12 will build the offset in a temporary register and then add that to rsp in a single, indivisible opcode. A goroutine can be preempted before or after the stack pointer modification, but never during. This means that the stack pointer is always valid and there is no race condition.
Seems silly to pessimize the runtime, even slightly, to account for the partial register construction. DWARF bytecode ought to be powerful enough to express the calculations needed for restoring the true stack pointer if we're between immediate adjustments.
But isn't that the same thing here? The bug occurred in their production workflows, not in some specific debug builds, so with that seems pretty reasonable to call it a compiler bug?
As for the actual bug:
Unless you're unwinding the stack by walking the linked list of frames threaded through the frame pointer, then each time you unwind a level of the stack, you need to consult a table keyed on instruction pointer to look up how to compute the register contents of the previous frame based on register content of the current frame. One of the registers you can compute this way is the previous frame's stack pointer.
I haven't looked in depth at what the Go runtime is doing exactly, but at a glance, I don't see mention of frame pointers in the linked article, so I'm guessing Go uses the SP-and-unwind-table approach? If so, the real bug here is that the table didn't have separate entries for the two ADDs and so gave incorrect reconstruction instructions for one of them.
If, however, frame pointers are a load-bearing part of the Go runtime, and that runtime failed to update frame pointer (not just the stack pointer) in the contractually mandatory manner, well, that's a codegen bug and needs a codegen fix.
I guess I just don't like, as a matter of philosophy if not practical engineering, having frame pointers at all. Without the frame pointer, the program already contains all the information you need to unwind, at no runtime cost --- you pay for table lookups only when you unwind, not all the time, on straight-line code.
The purist in me doesn't like burning a register for debugging, but you have to use the right tool for the job I guess.
Dead Comment
As an aside, this is the type of a problem that I think model checkers can't help with. You can write perfect and complicated TLA+/Lean/FizzBee models and even if somehow these models can generate code for you from your correct models you can still run into bugs like these due to platform/compiler/language issues. But, thankfully, such bugs are rare.
For the implementation, you can use certified compilers like CompCert [1], but:
- you still have to show your code is correct
- there are still parts of CompCert that are not certified
[1] https://compcert.org/