War story: the hardest bug I ever debugged

The basic operation of this program is as follows: 1. Panic. You usually do so anyways, so you might as well get it over with. Just don't do anything stupid. Panic away from your machine. Then relax, and see if the steps below won't help you out. 2. ...

Interesting writeup, but 2 days to debug “the hardest bug ever”, while accurate, seems a bit overdone.

Though abs() returning negative numbers is hilarious.. “You had one job…”

To me, the hardest bugs are nearly irreproducible “Heisenbugs” that vanish when instrumentation is added.

I’m not just talking about concurrency issues either…

The kind of bug where a reproduction attempt takes a week, not parallelizable due to HW constraints, and logging instrumentation makes it go away or fail differently.

2 days is cute though.

userbinator · 5 months ago

The kind of bug where a reproduction attempt takes a week, not parallelizable due to HW constraints, and logging instrumentation makes it go away or fail differently.

The hardest one I've debugged took a few months to reproduce, and would only show up on hardware that only one person on the team had.

One of the interesting things about working on a very mature product is that bugs tend to be very rare, but those rare ones which do appear are also extremely difficult to debug. The 2-hour, 2-day, and 2-week bugs have long been debugged out already.

gmueckl · 5 months ago

That reminded me of a former colleague at the desk next to me randomly exclaiming one day that he had just fixed a bug he had created 20 years ago.

The bug was actually quite funny in a way: it was in the code displaying the internal temperature of the electronics box of some industrial equipment. The string conversion was treating the temperature variable as an unsigned int when it was in fact signed. It took a brave field technician in Finland in winter, inspecting a unit in an unheated space to even discover this particular bug because the units' internal temperatures were usually about 20C above ambient.

devsda · 5 months ago

During the time I was working on a mature hardware product in maintenance, if I think about the number of customer bugs we had to close due to being not-reproducible or were only present for a brief amount of time in specific setup, it was really embarassing and we felt like a bunch of noobs.

dharmab · 5 months ago

Bryan Cantril did a talk about this phenomenon called "Zebras all the way down" some years back

jakevoytko · 5 months ago

Author here! I debugged a fair number of those when I was a systems engineer in soft real time robotics systems, but none of them felt as bad in retrospect because you're just reading up on the system and mulling over it and eventually you get the answer in a shower thought. Maybe I just find the puzzle of them fun, I don't know why they don't feel quite so bad. This was just an exhausting 2-day brute-force grind where it turned out the damn compiler was broken.

gertlex · 5 months ago

I also came to the comments to weigh in on my perception of how rough this was, but instead will ask:

Regarding "exhausting 2-day brute-force grind": is/was this just how you like to get things done, or was there external pressure of the "don't work on anything else" sort? I've never worked at a large company, and lots of descriptions of the way things get done are pretty foreign to me :). I am also used to being able to say "this isn't getting figured out today; probably going to be best if I work on something else for a bit, and sleep on it, too".

jffhn · 5 months ago

>Though abs() returning negative numbers is hilarious.

Math.abs(Integer.MIN_VALUE) in Java very seriously returns -2147483648, as there is no int for 2147483648.

eterm · 5 months ago

You inspired me to check what .NET does in that situation.

It throws an OverflowException: ("Negating the minimum value of a twos complement number is invalid.")

rhaps0dy · 5 months ago

Oh no, Pytorch does the same thing:

a = torch.tensor(-2*31, dtype=torch.int32) assert a == a.abs()

adrian_b · 5 months ago

Unchecked integer overflow strikes again.

bobbylarrybobby · 5 months ago

Rust does the same in release, although it panics in debug.

Dead Comment

efortis · 5 months ago

Same here, we had an IE8 bug that prevented the initial voice over of the screen reader (JAWS). No dev could reproduce it because we all had DevTools open.

gsck · 5 months ago

I had a similar issue, worked fine when I was testing it on my machine, but I had dev tools open to see any potential issues.

Turns out IE8 doesn't define console until the devtools are open. That caused me to pull a few hairs out.

smrq · 5 months ago

I can't remember the actual bug now, but one of my early career memories was hunting down an IE7 issue by using bookmarklets to alert() values. (Did IE7 even have dev tools?)

lukan · 5 months ago

"To me, the hardest bugs are nearly irreproducible “Heisenbugs” that vanish when instrumentation is added."

My favourite are bugs, that not only don't appear in the debugger - but also don't reproduce anymore on normal settings after I took a closer look in the debugger (Only to come back later at a random time). Feels like chasing ghosts.

btschaegg · 5 months ago

Terminology proposal: "Gremlins" :)

Adverblessly · 5 months ago

> To me, the hardest bugs are nearly irreproducible “Heisenbugs” that vanish when instrumentation is added.

A favourite of mine was a bug (specifically, a stack corruption) that I only managed to see under instrumentation. After a lot of debugging turns out that the bug was in the instrumentation software itself, which generated invalid assembly under certain conditions (calling one of its own functions with 5 parameters even though it takes only 4). Resolved by upgrading to their latest version.

Terr_ · 5 months ago

This repro was a few times per day, but try fixing a Linux kernel panic when you don't even have C/C++ on your resume, and everyone who originally set stuff up has left...

https://news.ycombinator.com/item?id=37859771

Point being that the difficulty of a fix can come from many possible places.

rowanG077 · 5 months ago

I don't think taking how long something took to debug in number of days is at all interesting. Trivial bugs can take weeks to debug for a noob. Insanely hard bugs takes hours to debug for genius devs, maybe even without any reproducer, just by thinking about it.

mystified5016 · 5 months ago

In hardware, you regularly see behavior change when you probe the system. Your oscilloscope or LA probes affect the system just enough to make a marginal circuit work. It's absolutely maddening.

fuzzfactor · 5 months ago

The closer you get to natural science, eventually reliance on logical troubleshooting can be "illogical".

The more abundant the undefined (mis)behavior, the more you're going to be tearing your hair out.

Almost the kind of frustration where you're supposed to have a logic-based system, and it rears it ugly head and defies logic anyway :\

steveBK123 · 5 months ago

Yes ! I've dealt with complex issues that turned out to be vendor-swapped-hardware-woopsie which we spent over a month trying to solve in software before finally figuring it out.

Part of it was difficulty of pinpointing the actual issue - fullness of drive vs throughput of writes.

A lot of it was unfortunately organizational politics such that the system spanned two teams with different reporting lines that didn't cooperate well / had poor testing practices.

voidifremoved · 5 months ago

> A lot of it was unfortunately organizational politics

The hardest bugs in my experience are those where your only source of vital information is a third party who is straight-up lying to you.

sesm · 5 months ago

For stuff like this we used in-memory ring buffer logger that printed the logs on request. And it didn't save the strings, just necessary data bits and a pointer to formatting function. Writing to this logger didn't affect any timings.

dismalpedigree · 5 months ago

I always refer to them as “quantum bugs” because the act of observing the bug changes the bug. Absolutely infuriating. I like “heisenbug” better. Has a better ring to it.

Dead Comment

FWIW: this type of bug in Chrome is exploitable to create out-of-bounds array accesses in JIT-compiled JavaScript code.

The JIT compiler contains passes that will eliminate unnecessary bounds checks. For example, if you write “var x = Math.abs(y); if(x >= 0) arr[x] = 0xdeadbeef;”, the JIT compiler will probably delete the if statement and the internal nonnegative array index check inside the [] operator, as it can assume that x is nonnegative.

However, if Math.abs is then “optimized” such that it can produce a negative number, then the lack of bounds checks means that the code will immediately access a negative array index - which can be abused to rewrite the array’s length and enable further shenanigans.

Further reading about a Chrome CVE pretty much exactly in this mold: https://shxdow.me/cve-2020-9802/

saghm · 5 months ago

> which can be abused to rewrite the array’s length and enable further shenanigans.

I followed all of this up until here. JavaScript lets you modify the length of an array by assigning to indexes that are negative? I'm familiar with the paradigm of negative indexing being used to access things from the end of the array (like -1 being the last element), but I don't understand what operation someone could do that would somehow modify the length of the array rather than modifying a specific element in-place. Does JIT-compiled JavaScript not follow the usual JavaScript semantics that would normally happen when using a negative index, or are you describing something that would be used in combination with some other compiler bug (which honestly sounds a lot more severe even in the absence of an usual Math.abs implementation).

nneonneo · 5 months ago

Normally, there would be a bounds check to ensure that the index was actually non-negative; negative indices get treated as property accesses instead of array accesses (unlike e.g. Python where they would wrap around).

However, if the JIT compiler has "proven" that the index is never non-negative (because it came from Math.abs), it may omit such checks. In that case, the resulting access to e.g. arr[-1] may directly access the memory that sits one position before the array elements - which could, for example, be part of the array metadata, such as the length of the array.

You can read the comments on the sample CVE's proof-of-concept to see what the JS engine "thinks" is happening, vs. what actually happens when the code is executed: https://github.com/shxdow/exploits/blob/master/CVE-2020-9802.... This exploit is a bit more complicated than my description, but uses a similar core idea.

bryanrasmussen · 5 months ago

>I followed all of this up until here. JavaScript lets you modify the length of an array by assigning to indexes that are negative?

This is my no doubt dumb understanding of what you can do, based on some funky stuff I did one time to mess with people's heads

do the following const arr = []; arr[-1] = "hi"; console.log(arr) this gives you "-1": "hi"

length: 0

which I figured is because really an array is just a special type of object. (my interpretation, probably wrong)

now we can see that the JavaScript Array length is 0, but since the value is findable in there I would expect there is some length representation in the lower level language that JavaScript is implemented in, in the browser, and I would then think that there could even be exploits available by somehow taking advantage of the difference between this lower level representation of length and the JS array length. (again all this is silly stuff I thought and have never investigated, and is probably laughably wrong in some ways)

I remember seeing some additions to array a few years back that made it so you could protect against the possibility of negative indexes storing data in arrays - but that memory may be faulty as I have not had any reason to worry about it.

ongy · 5 months ago

This is after the jit.

I.e. don't think fancy language shenanigans that do negative indexing. But negative offset from the beginning of the array memory access.

When there's some inlining, there will be no function call into some index operator function

PhilipRoman · 5 months ago

For example if arrays were implemented like this (they're not)

    struct js_array {
        uint64_t length;
        js_value *values[];
    }

Because after bound checks have been taken care of, loading an element of a JS array probably compiles to a simple assembly-level load like mov. If you bypass the bounds checks, that mov can read or write any mapped address.

aetimmes · 5 months ago

(disclaimer: I know OP IRL.)

I'm seeing a lot of comments saying "only 2 days? must not have been that bad of a bug". Some thoughts here:

At my current day job, our postmortem template asks "Where did we get lucky?" In this instance, the author definitely got lucky that they were working at Google where 1) there were enough users to generate this Heisenbug consistently and 2) that they had direct access to Chrome devs.

Additionally - the author (and his team) triaged, root caused and remediated a JS compiler bug in 2 days. The sheer amount of complexity involved in trying to narrow down where in the browser code this could all be going wrong is staggering. Consider that the reason it took him "only" two days is because he is very, _very_ good at what he does.

marginalia_nu · 5 months ago

Days-taken-to-fix is kind of a weird measure for how difficult a bug is. It's clearly a factor of a large number of things that's not the bug itself, including experience and whether you have to go it alone or if you can talk to the right people.

The bug ticks most of the boxes for a tricky bug:

* Non-deterministic

* Enormous haystack

* Unexpected "1+1=3"-type error with a cause outside of the code itself

Like sure it would have been slower to debug if it took 30 hours of to reproduce, and harder he had to be going down the Niagara falls in a barrel while debugging it, but I'm not quite sure those things quite count.

I had a similar category of bug I was struggling with the other year[1] that was related to a faulty optimization in the GraalVM JVM leading to bizarre behavior in very rare circumstances. If I'd been sitting next to the right JVM engineers over at Oracle I'm sure we'd figured it out in days and not the weeks it took me.

[1] https://www.marginalia.nu/log/a_104_dep_bug/

seeingnature · 5 months ago

I'd love to see the rest of your postmortem template! I never thought about adding a "Where did we get lucky?" question.

I recently realized that one question for me should be, "Did you panic? What was the result of that panic? What caused the panic?"

I had taken down a network, and the device led me down a pathway that required multiple apps and multiple log ins I didn't have to regain access. I panicked and because the network was small, roamed and moved all devices to my backup network.

The following day, under no stress, I realized that my mistake was that I was scanning a QR code 90 degrees off from it's proper orientation. I didn't realize that QR codes had a proper orientation and figured that their corner identifiers handled any orientation. Then it was simple to gain access to that device. I couldn't even replicate the other odd path.

somat · 5 months ago

One of my favorite man pages is scan_ffs https://man.openbsd.org/scan_ffs

srejk · 5 months ago

The standard SRE one recommended by Google has a lucky section. We tend to use it to talk about getting unlucky too.

nathan_douglas · 5 months ago

A good section to have is one on concept/process issues you encountered, which I think is a generalization of your question about panic.

For instance, you might be mistaken about the operation of a system in some way that prolongs an outage or complicates recovery. Or perhaps there are complicated commands that someone pasted in a comment in a Slack channel once upon a time and you have to engage in gymnastics with Sloogle™ to find them, while the PM and PO are requesting updates. Or you end up saving the day because of a random confluence of rabbit holes you'd traversed that week, but you couldn't expect anyone else on the team to have had the same flash of insight that you did.

That might be information that is valuable to document or add to training materials before it is forgotten. A lot of postmortems focus on the root cause, which is great and necessary, but don't look closely at the process of trying to stop the bleeding.

parliament32 · 5 months ago

No, QR codes are auto-orienting[1]. If you're getting a different reading at different orientations, there is a bug in your scanner.

[1] https://en.wikipedia.org/wiki/QR_code#Design

Suppafly · 5 months ago

> I didn't realize that QR codes had a proper orientation and figured that their corner identifiers handled any orientation.

Same, I assumed they were designed to always work. I suspect it was whatever app or library you were using that wasn't designed to handle them correctly.

Deleted Comment

ivraatiems · 5 months ago

Imagine if you weren't working at Google and were trying to convince the Chromium team you found a bug in V8. That'd probably be nigh-impossible.

One thing I notice is that Google has no way whatsoever to actually just ask users "hey, are you having problems?", a definite downside of their approach to software development where there is absolutely no communication between users and developers.

saagarjha · 5 months ago

I think you could, but you'd need a very convincing bug report.

jbs789 · 5 months ago

I suspect that by minimising someone else’s work it allows the commenters to feel better about themselves. As a general rule/perspective.

lesuorac · 5 months ago

> In this instance, the author definitely got lucky that they were working at Google where 1) there were enough users to generate this Heisenbug consistently and 2) that they had direct access to Chrome devs.

I'm not sure this is really luck.

The fix is to just not use Math.abs. If they didn't work at Google they still would've done the same debugging and used the same fix. Working at Google probably harmed them as once they discovered Math.abs didn't work correctly they could've just immediately used `> 0` instead of asking the chrome team about it.

There's nothing lucky about slowly adding printf statements until you understand what the computer is actually doing; that's just good work.

jdwithit · 5 months ago

I wish I could recall the details better but this was 20+ years ago now. In college I had an internship working at Bose, doing QA on firmware in a new multi CD changer addon to their flagship stereo. We were provided discs of music tracks with various characteristics. And had to listen to them over and over and over and over and over and over, running through test cases provided by QA management as we did. But also doing random ad-hoc testing once we finished the required tests on a given build.

At one point I found a bug where if you hit a sequence of buttons on the remote at a very specific time--I want to say it was "next track" twice right as a new track started--the whole device would crash and reboot. This was a show stopper; people would hit the roof if their $500 stereo crashed from hitting "next". Similar to the article, the engineering lead on the product cleared his schedule to reproduce, find, and fix the issue. He did explain what was going on at the time, but the specifics are lost to me.

Overall the work was incredibly boring. I heard the same few tracks so many times I literally started to hear them in my dreams. So it was cool to find a novel, highest severity bug by coloring outside the lines of the testcases. I felt great for finding the problem! I think the lead lost 20% of his hair in the course of fixing it, lol.

I haven't had QA as a job title in a long time but that job did teach me some important lessons about how to test outside the happy path, and how to write a reproducible and helpful bug report for the dev team. Shoutout to all the extremely underpaid and unappreciated QA folks out there. It sucks that the discipline doesn't get more respect.

That is great QAing. It also speaks to why QA should be a real role in more orgs, rather than a shrinking discipline. Engineers LOVE LOVE LOVE to test the happy path.

It's not even malice/laziness, it's their entire interpretation of the problem/requirements drives their implementation which then drives their testing. It's like asking restaurants to self-certify they are up to food safety codes.

z3t4 · 5 months ago

If you do not follow the happy path something will break 100% of the time. That's why engineers always follow the happy path. Some engineers even think that anything outside the happy path is an exception and not even worth investigating. These engineers only thrives if the users are unable to switch to another product. Only competition will lead to better products.

fatnoah · 5 months ago

> That is great QAing. It also speaks to why QA should be a real role in more orgs, rather than a shrinking discipline.

As a software engineer, I've always been very proud of my thoroughness and attention to detail in testing my code. However, good QA people always leave me wondering "how did they even think to do that?" when reviewing bug reports.

QA is both a skillset AND a mindset.

philk10 · 5 months ago

Pedantically pointing out the difference between doing some exploratory testing "testing outside the test cases" and QA which is setting up processes/procedures part of which should be "do exploratory testing as well as running the test cases" but the Testing is not QA distinction has been fought over for decades...

But, love the story and I collect tales like this all the time so thanks for sharing

HdS84 · 5 months ago

A friend of mine has near PTSD from watching some movie over and over and over at a optician where she worked. Was on rotation so that their customers could gauge their eyesight.

kridsdale1 · 5 months ago

I imagine flight attendants are pretty tired of the Delta Broadway Show video.

BobbyTables2 · 5 months ago

perihelions · 5 months ago

My own story: I spent >10 hours debugging an Emacs project that would occasionally cause a kernel crash on my machine. Proximate cause was a nonlocal interaction between two debug-print statements. (Wasn't my first guess). The Elisp debug-print function #'message has two effects: it appends to a log, and also does a small update notification in the corner of the editor window. If that corner-of-the-window GUI object is thrashed several hundred times in a millisecond, it would cause the GPU driver on my specific machine to lock up, for a reason I've never root-caused.

Emacs' #'message implementation has a debounce logic, that if you repeatedly debug-print the same string, it gets deduplicated. (If you call (message "foo") 50 times fast, the string printed is "foo [50 times]"). So: if you debug-print inspect a variable that infrequently changes (as was the case), no GUI thrashing occurs. The bug manifested when there were *two* debug-print statements active, which circumvented the debouncer, since the thing being printed was toggling between two different strings. Commenting out one debug-print statement, or the other, would hide the bug.

chrismorgan · 5 months ago

> If that corner-of-the-window GUI object is thrashed several hundred times in a millisecond, it would cause the GPU driver on my specific machine to lock up, for a reason I've never root-caused.

Until comparatively recently, it was absurdly easy to crash machines via their graphics drivers, even by accident. And I bet a lot of them were security concerns, not just DoS vectors. WebGL has been marvellous at encouraging the makers to finally fix their drivers properly, because browsers declared that kind of thing unacceptable (you shouldn’t be able to bring the computer down from an unprivileged web page¹), and developed long blacklists of cards and drivers, and brought the methodical approach browsers had finally settled on to the graphics space.

Things aren’t perfect, but they are much better than ten years ago.

—⁂—

¹ Ah, fond memories of easy IE6 crashes, some of which would even BSOD Windows 98. My favourite was, if my memory serves me correctly, <script>document.createElement("table").appendChild(document.createElement("div"))</script>. This stuff was not robust.

friendzis · 5 months ago

My hardest bug story, almost circling back to the origin of the word.

An intern gets a devboard with a new mcu to play with. A new generation, but mostly backwards compatible or something like that. Intern gets the board up and running with embedded equivalent of "hello world". They port basic product code - ${thing} does not work. After enough hair are pulled, I give them some guidance - ${thing} does not work. Okay, I instruct intern to take mcu vendor libraries/examples and get ${thing} running in isolation. Intern fails.

Okay, we are missing something huge that should be obvious. We start pair programming and strip the code down layer by layer. Eventually we are at a stage where we are accessing hand-coded memory addresses directly. ${thing} does not work. Okay, set up a peripheral and read state register back. Assertion fails. Okay, set up peripheral, nop some time for values to settle, read state register back. Assertion fails. Check generated assembly - nopsled is there.

We look at manual, the bit switching peripheral into the state we care about is not set. However we poke the mcu, whatever we write to control register, the bit is just not set and the peripheral never switches into the mode we need. We get a new devboard (or resolder mcu on the old one, don't remember) and it works first try.

"New device - must be new behavior" thinking with lack of easy access to the new hardware led us down a rabbit hole. Yes, nothing too fancy. However, I shudder thinking what if reading the state register gave back the value written?

GianFabien · 5 months ago

what if reading the state register gave back the value written?

I've had that experience. Turned out some boards in the wild didn't have the bodge wire that connected the shift register output to the gate that changed the behavior.

jason_tko · 5 months ago

Reminds me of the classic bug story where users couldn’t send emails more than 500 miles.

https://web.mit.edu/jemorris/humor/500-miles

decimalenough · 5 months ago

Crashes only on Wednesdays:

https://gyrovague.com/2015/07/29/crashes-only-on-wednesdays/

BoorishBears · 5 months ago

I experienced "crashes after 16 hours if you didn't copy the mostly empty demo Android project from the manufacturer and paste the entire existing project into it"

Turned out there was an undocumented MDM feature that would reboot the device if a package with a specific name wasn't running.

Upon decompilation it wasn't supposed to be active (they had screwed up and shipped a debug build of the MDM) and it was supposed to be 60 seconds according to the variable name, but they had mixed up milliseconds and seconds

sgarland · 5 months ago

This deserves more upvotes. Absolute classic.

latexr · 5 months ago

It’s amusing how so many of the comments here are like “You think two days is hard? Well, I debugged a problem which was passed down to me by my father, and his father before him”. It reminds me of the Four Yorkshiremen sketch.

https://youtube.com/watch?v=sGTDhaV0bcw

The author’s “error”, of course, was calling it “the hardest bug I ever debugged”. It drives clicks, but comparisons too.

markrages · 5 months ago

Of course the comments section is going to be full of war stories about everyone's hardest bug.

This is how humans work, and this is why I am reading the comments.

Yes, of course, I greatly enjoy the stories and it’s why I opened this thread. But that’s not what my comment is about, I was specifically referencing the parts of the comments which dismiss the difficulty and length of time the author spent tracking down this particular bug. I found that funny and my comment was essentially one big joke.