Ask HN: Strange bug workarounds?

I worked on a social news product and part of our look was to have an icon for every story - either an image pulled from the page, a user-uploaded image, or, in the case of Flash content (say, a video player), a screen capture.

We had it all up and running - loading the content, waiting for the player to initialize, taking the snapshot, generated sizes - on a windows machine when, one day, the request came in to migrate that machine to a VM. After the migration, things were fine - until we disconnected RDP. Snapshots were coming back at the right size, but totally white.

The eventual "solution" was a laptop in the engineering area RDP'ed into this VM to keep the snapshots from going white. It got unplugged one holiday weekend, earning it a red hand-sharpied sign - "PRODUCTION LAPTOP: DO NOT UNPLUG". It was unplugged again one fateful weekend, this time prompting a healthcheck to be written that looked for all-white images in its output.

That rig ran that way, I believe, until someone had the insight to make a second VM, this one RDP'ed into the first.

Turtles, all the way down!

mdip · 9 years ago

That's awesome and the solution is not as uncommon as you'd imagine.

At "a large telecom" I used to work at, we had a specific process that handled billing that relied on a DOS application which was written targeting a specific modem's hardware. They'd tried to migrate it to something else for quite some time but the guy who wrote it lived in a different state and was let go from the company when we closed that site down and moved all of its equipment to Detroit. It ran on an old Compaq (not HP Compaq, Compaq) desktop PC and in 2014 or our VP received a frantic call that the drive had failed and the computer wouldn't boot (from a younger tech who was used to working on server class hardware). The code for this application had been lost forever and nobody had any idea how it actually worked but my understanding was that with it not functional, we were losing enough money to make it a "drop everything priority".

They brought the machine over to my building and the VP of my department called me to assist[0]. Sure enough, the system wouldn't even see the drive. It was at this point that I noticed three numbers with the letters "C", "H", "S" next to each. This had happened before, apparently, and someone discovered the BIOS battery had died. Thankfully, they were kind enough to put the drive parameters on a label for me. I popped into the BIOS, put 'em in and it booted. The computer remained powered on in the cubicle I repaired it in (just outside said VP's office) for a year until the dev team got around to modernizing the code.

[0] I was not a support person at this time but was in the past and it wasn't unusual for them to call me in on strange problems. I was also known for having recovered a hard drive with important data on it using the break-room fridge (though I'm not sure this VP was aware of that).

dlinder · 9 years ago

You sound like a kindred spirit. I have put hard drives in freezers to release stiction; I have baked motherboards in the oven to re-flow questionable solder. I wonder if anything in our kitchen is sacred! Sometimes I wish I had "MacGyvering goofy tech junk" as a full time job!

Deleted Comment

The Motorola iDEN [1] series of phones were pretty sweet back in their day and had a JVM you could actually write and deploy apps on.

I worked on Loopt, an early mobile location sharing app, and we talked to our server over HTTPS. Things were working great on a few LG and Sanyo phones, and worked fine in the iDEN emulator, but POSTs would fail consistently on the device itself. GETs worked fine.

After watching traffic on the server for a bit, I noticed the POST requests all advertised HTTP/1.1 and sent the Expect: 100-Continue header. On a whim I configured the server to treat all incoming connections as HTTP/1.0 so it would never send the 100 (Continue) response [2].

It worked!

Or did it? Turns out the iDEN phones were now happy, but the other phones were not and would refuse to send POST bodies if they didn't receive the 100 (Continue).

This well and truly sucked, and we thought for a bit we'd need to have two different endpoints with different configurations to support the differently incompatible phones. Lame.

But then I remembered the format of an HTTP request:

    POST /path HTTP/1.1\r\n
    Expect: 100-Continue\r\n
    [Header: Value]\r\n
    \r\n
    [Body]

What if I supplied a malformed URL? Something like "/path HTTP/1.0\r\nX-iDEN-Ignore:"? Then, if there's no validation or encoding, the request will look like this:

    POST /path HTTP/1.0\r\n
    X-iDen-Ignore: HTTP/1.1\r\n
    Expect: 100-Continue\r\n
    [Header: Value]\r\n
    \r\n
    [Body]

Turns out that worked. The JVM was never updated or fixed, the hack shipped, and it worked consistently for the lifetime of those phones.

[1] https://en.wikipedia.org/wiki/IDEN

[2] "An origin server ... MUST NOT send a 100 (Continue) response if such a request comes from an HTTP/1.0 (or earlier) client" https://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html#sec8....

nhf · 9 years ago

I remember iDEN phones! I had a Motorola i355 for a while. Software sucked but the thing was an absolute tank. Plus, it was one of the rare "dumbphones" of the time to have integrated GPS, so I remember using it as a GPS tracker for a while afterwards.

mbrock · 9 years ago

I worked on health record software. An elusive bug in the custom SQL Server crypto plugin led to very occasional corrupted entries, which was very bad.

The guy who wrote the crypto plugin had of course quit and nobody knew how it worked.

Fine-combing the C++, I found an off-by-one error that would cause the predicted failures: after rebooting SQL Server, the first entry would get encrypted with a zero key. (Hooray, we could now also fix all the corrupted data.)

For various reasons it would have been difficult to ship new DLLs to the affected customers. Only a handful used this particular crypto and it would be much easier to patch the existing binary DLLs on their servers.

Well... looking at the machine code, I found that the troublesome off-by-one operations were actually in the printable ASCII range... so I just taught my friend in tech support to do a particular obscure search and replace in Notepad++, something like changing ",}" into ",~" in the binary DLL... and then hot-reload it with an SQL Server command... worked perfectly.

obituary_latte · 9 years ago

Nice! Must have made that tech feel and look like a hero :)

tobiaswk · 9 years ago

What a great little story. Thanks for sharing!

throwaway_yy2Di · 9 years ago

Not my workaround:

http://spectrum.ieee.org/aerospace/space-flight/titan-callin...

http://descanso.jpl.nasa.gov/seminars/abstracts/viewgraphs/H...

This was an extremely serious bug in NASA/ESA's Cassini-Huygens probe, in the S-band link between Huygens (landing on Saturn's moon Titan) and Cassini (acting as radio relay).

It was a timing bug. There'd be a very high relative velocity between Cassini and Huygens, creating a significant (~2e-5) Doppler shift in the link. This shifted the frequency of the 2 GHz carrier (by 38 kHz). Likewise, it shifted the symbol rate of the 16 kbps bit stream (by 0.3 bps). The second effect was overlooked. On the demodulating end (Cassini), the bit-synchronizer expected the nominal bit rate, not the Doppler-shifted bit rate. Since its bandwidth was narrower than the 0.3 bps Doppler shift, it was unable to recognize frame syncs; this was proven in experiments post-launch. The parameter that set the bitrate was stored in non-modifiable firmware.

As it was when launched, Huygens would be unable to return any instrument data. For some context, this was the only probe that's ever visited Titan, at a cost of about $400 million.

The workaround

[spoiler]

The workaround was a major change in the orbit trajectory of Cassini (a $3 billion probe). Details aside, it set up an orbit geometry with this feature: at the time Huygens was descending in Titan's atmosphere, Cassini would be flying at a ~90° angle to their separation. The relative velocity was still 20,000 kph, but tangential velocity doesn't contribute to Doppler shift.

bbcbasic · 9 years ago

That's a truly epic workaround!

iamcreasy · 9 years ago

Do they always use star tracker when making these kind of trajectory changes?

frereubu · 9 years ago

Not so much a software bug, but back in my early days (late 1990s) supporting an office network in London there was a computer where the mouse was making the cursor behave erratically during roughly the same period every afternoon. We swapped out the mouse, the controller card, even the computer - effectively replacing all the physical equipment - and nothing seemed to stop it. We went through all sorts of ideas - too near the microwave, heavy fax machine usage, someone's mobile phone - until we realised that it was optical mouse, and the sun would shine through that window each afternoon at the same time and screw up the sensor in the mouse. We stuck a bit of cardboard to the side of the desk and it never happened again.

nom · 9 years ago

Haha awesome. I once was fooled by the sun, too. I noticed an unusual high power consumption of several KWh in my logs. They always appeared at the same time, almost up to the same minute.

So it turns out there is a very small time slot where the sun can reach through a window into the hallway. That was enough to offset the light sensor that I attached to the power meter inside the closet. The threshold was set too tight.

Think about the possible sources that influence this 'bug': - the month - the time of day - the weather / state of the clouds - open/close state of the bathroom door - reflectivity of the hallway (objects, doors open/closed)

heywire · 9 years ago

Towards the end of summer, one of my Raspberry-pi security cameras starts detecting "motion" in the form of the sunlight dancing on the wall when the fluffy clouds float by :)

prawn · 9 years ago

We had an office alarm system that would occasionally trigger incorrectly. It was movement and heat sensitive. The alarm would trigger on weekend afternoons. We were puzzled for weeks until it turned out to be a combination of the second hand on a clock and the sun shining through a window and warming it up.

oaktowner · 9 years ago

Sun shining through a window in London? Everything else in the story is believable, but... ;-)

Too funny, though, I'd be willing to bet "the same time every afternoon" was more that it happened at the same time in the afternoon when it failed which probably made it even more painful to isolate since it was reliant on the sun appearing in an area not known for sunlight.

mjg59 · 9 years ago

Samsung laptops would fail to boot if the UEFI variable store was 100% full. The original solution to this in Linux was to leave at least 5K of free space. However, on several systems, removing UEFI variables didn't actually free up space - it was marked as free internally, but the reported amount of free space didn't increase, and so Linux would refuse to allow you to create new variables. The "solution" was to attempt to create a variable larger than the available free space, which forced the firmware to trigger a garbage collection run and re-synchronise the internal and external views of the amount of available free space. Doing something that we knew would fail was a requirement for avoiding killing laptops.

yuhong · 9 years ago

Interestingly, there recently has been https://github.com/Microsoft/BashOnWindows/issues/976

romanhn · 9 years ago

Many years back, I was working on a web application that, among other things, could generate PDF user reports. These reports were generated from HTML web pages using a third-party library. Normally this worked well (as well as such a tool could be expected to work anyways), however once a month or so the fonts on the reports would come out super tiny. This would then happen in random reports until we rebooted all of the app servers. The bug occurred in production only, never in our dev, staging or QA environments.

Many hours of investigations were committed, many emails to the vendor were written, much hair was torn out. No luck whatsoever. Months passed, and the bug reoccurred at random intervals and did not consistently affect all reports. One day I logged in remotely to one of the Windows app boxes as an admin/console user and was annoyed to once again discover that it forced my screen resolution to change. That's when I had an epiphany and 10 minutes later was able to reproduce the bug in my local environment.

Turns out the third-party library had some funky rasterization logic that took into account both the resolution of the machine when the library/service was started as well as the current resolution, pretty much expecting both to be the same. Logging in remotely as a console user has the behavior of taking on the resolution of my local machine, which was always higher than what the remote box ran at. Another thing to note is that the console user logged into the same running instance of Windows that was generating the PDFs. BAM! The cached value used by the library no longer matched the runtime resolution and the reports now generated screwy tiny fonts. This happened rarely because logging in as admin/console was not the recommended approach, and it was inconsistent because we had multiple app boxes and the other ones continued to work OK.

Solution - disallow admin/console remote logins. This was one of the most obscure bugs I have had the pleasure of solving.

kogir · 9 years ago

josu · 9 years ago

My favorite:

>Wing Commander was originally titled Squadron and later renamed Wingleader. As development for Wing Commander came to a close, the EMM386 memory manager the game used would give an exception when the user exited the game. It would print out a message similar to "EMM386 Memory manager error..." with additional information. The team could not isolate and fix the error and they needed to ship it as soon as possible.

>As a work-around, one of the game's programmers, Ken Demarest III, hex-edited the memory manager so it displayed a different message. Instead of the error message, it printed "Thank you for playing Wing Commander."However, due to a different bug the game went through another revision and the bug was fixed, meaning this hack did not ship with the final release.

https://en.wikipedia.org/wiki/Wing_Commander_(video_game)#De...