Software bugs are a fact of life and, sadly, many never see a (timely) fix. This can lead to some some unusual workarounds in order to continue using the software.
What are some unusual/quirky/bizarre workarounds to software bugs that have been encountered by the HN crowd?
A recent one I struck was with Google Earth desktop app on Linux. It has a tendency to crash on startup unless your mouse is contained within a small rectangle in the middle of the screen [1].
[1] http://askubuntu.com/questions/642027/google-earth-crashes-when-opened#comment1071599_677717
The guy who wrote the crypto plugin had of course quit and nobody knew how it worked.
Fine-combing the C++, I found an off-by-one error that would cause the predicted failures: after rebooting SQL Server, the first entry would get encrypted with a zero key. (Hooray, we could now also fix all the corrupted data.)
For various reasons it would have been difficult to ship new DLLs to the affected customers. Only a handful used this particular crypto and it would be much easier to patch the existing binary DLLs on their servers.
Well... looking at the machine code, I found that the troublesome off-by-one operations were actually in the printable ASCII range... so I just taught my friend in tech support to do a particular obscure search and replace in Notepad++, something like changing ",}" into ",~" in the binary DLL... and then hot-reload it with an SQL Server command... worked perfectly.
http://spectrum.ieee.org/aerospace/space-flight/titan-callin...
http://descanso.jpl.nasa.gov/seminars/abstracts/viewgraphs/H...
This was an extremely serious bug in NASA/ESA's Cassini-Huygens probe, in the S-band link between Huygens (landing on Saturn's moon Titan) and Cassini (acting as radio relay).
It was a timing bug. There'd be a very high relative velocity between Cassini and Huygens, creating a significant (~2e-5) Doppler shift in the link. This shifted the frequency of the 2 GHz carrier (by 38 kHz). Likewise, it shifted the symbol rate of the 16 kbps bit stream (by 0.3 bps). The second effect was overlooked. On the demodulating end (Cassini), the bit-synchronizer expected the nominal bit rate, not the Doppler-shifted bit rate. Since its bandwidth was narrower than the 0.3 bps Doppler shift, it was unable to recognize frame syncs; this was proven in experiments post-launch. The parameter that set the bitrate was stored in non-modifiable firmware.
As it was when launched, Huygens would be unable to return any instrument data. For some context, this was the only probe that's ever visited Titan, at a cost of about $400 million.
The workaround
[spoiler]
The workaround was a major change in the orbit trajectory of Cassini (a $3 billion probe). Details aside, it set up an orbit geometry with this feature: at the time Huygens was descending in Titan's atmosphere, Cassini would be flying at a ~90° angle to their separation. The relative velocity was still 20,000 kph, but tangential velocity doesn't contribute to Doppler shift.
Deleted Comment
Deleted Comment
We had it all up and running - loading the content, waiting for the player to initialize, taking the snapshot, generated sizes - on a windows machine when, one day, the request came in to migrate that machine to a VM. After the migration, things were fine - until we disconnected RDP. Snapshots were coming back at the right size, but totally white.
The eventual "solution" was a laptop in the engineering area RDP'ed into this VM to keep the snapshots from going white. It got unplugged one holiday weekend, earning it a red hand-sharpied sign - "PRODUCTION LAPTOP: DO NOT UNPLUG". It was unplugged again one fateful weekend, this time prompting a healthcheck to be written that looked for all-white images in its output.
That rig ran that way, I believe, until someone had the insight to make a second VM, this one RDP'ed into the first.
Turtles, all the way down!
At "a large telecom" I used to work at, we had a specific process that handled billing that relied on a DOS application which was written targeting a specific modem's hardware. They'd tried to migrate it to something else for quite some time but the guy who wrote it lived in a different state and was let go from the company when we closed that site down and moved all of its equipment to Detroit. It ran on an old Compaq (not HP Compaq, Compaq) desktop PC and in 2014 or our VP received a frantic call that the drive had failed and the computer wouldn't boot (from a younger tech who was used to working on server class hardware). The code for this application had been lost forever and nobody had any idea how it actually worked but my understanding was that with it not functional, we were losing enough money to make it a "drop everything priority".
They brought the machine over to my building and the VP of my department called me to assist[0]. Sure enough, the system wouldn't even see the drive. It was at this point that I noticed three numbers with the letters "C", "H", "S" next to each. This had happened before, apparently, and someone discovered the BIOS battery had died. Thankfully, they were kind enough to put the drive parameters on a label for me. I popped into the BIOS, put 'em in and it booted. The computer remained powered on in the cubicle I repaired it in (just outside said VP's office) for a year until the dev team got around to modernizing the code.
[0] I was not a support person at this time but was in the past and it wasn't unusual for them to call me in on strange problems. I was also known for having recovered a hard drive with important data on it using the break-room fridge (though I'm not sure this VP was aware of that).
Deleted Comment
So it turns out there is a very small time slot where the sun can reach through a window into the hallway. That was enough to offset the light sensor that I attached to the power meter inside the closet. The threshold was set too tight.
Think about the possible sources that influence this 'bug': - the month - the time of day - the weather / state of the clouds - open/close state of the bathroom door - reflectivity of the hallway (objects, doors open/closed)
Many hours of investigations were committed, many emails to the vendor were written, much hair was torn out. No luck whatsoever. Months passed, and the bug reoccurred at random intervals and did not consistently affect all reports. One day I logged in remotely to one of the Windows app boxes as an admin/console user and was annoyed to once again discover that it forced my screen resolution to change. That's when I had an epiphany and 10 minutes later was able to reproduce the bug in my local environment.
Turns out the third-party library had some funky rasterization logic that took into account both the resolution of the machine when the library/service was started as well as the current resolution, pretty much expecting both to be the same. Logging in remotely as a console user has the behavior of taking on the resolution of my local machine, which was always higher than what the remote box ran at. Another thing to note is that the console user logged into the same running instance of Windows that was generating the PDFs. BAM! The cached value used by the library no longer matched the runtime resolution and the reports now generated screwy tiny fonts. This happened rarely because logging in as admin/console was not the recommended approach, and it was inconsistent because we had multiple app boxes and the other ones continued to work OK.
Solution - disallow admin/console remote logins. This was one of the most obscure bugs I have had the pleasure of solving.
I worked on Loopt, an early mobile location sharing app, and we talked to our server over HTTPS. Things were working great on a few LG and Sanyo phones, and worked fine in the iDEN emulator, but POSTs would fail consistently on the device itself. GETs worked fine.
After watching traffic on the server for a bit, I noticed the POST requests all advertised HTTP/1.1 and sent the Expect: 100-Continue header. On a whim I configured the server to treat all incoming connections as HTTP/1.0 so it would never send the 100 (Continue) response [2].
It worked!
Or did it? Turns out the iDEN phones were now happy, but the other phones were not and would refuse to send POST bodies if they didn't receive the 100 (Continue).
This well and truly sucked, and we thought for a bit we'd need to have two different endpoints with different configurations to support the differently incompatible phones. Lame.
But then I remembered the format of an HTTP request:
What if I supplied a malformed URL? Something like "/path HTTP/1.0\r\nX-iDEN-Ignore:"? Then, if there's no validation or encoding, the request will look like this: Turns out that worked. The JVM was never updated or fixed, the hack shipped, and it worked consistently for the lifetime of those phones.[1] https://en.wikipedia.org/wiki/IDEN
[2] "An origin server ... MUST NOT send a 100 (Continue) response if such a request comes from an HTTP/1.0 (or earlier) client" https://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html#sec8....
>Wing Commander was originally titled Squadron and later renamed Wingleader. As development for Wing Commander came to a close, the EMM386 memory manager the game used would give an exception when the user exited the game. It would print out a message similar to "EMM386 Memory manager error..." with additional information. The team could not isolate and fix the error and they needed to ship it as soon as possible.
>As a work-around, one of the game's programmers, Ken Demarest III, hex-edited the memory manager so it displayed a different message. Instead of the error message, it printed "Thank you for playing Wing Commander."However, due to a different bug the game went through another revision and the bug was fixed, meaning this hack did not ship with the final release.
https://en.wikipedia.org/wiki/Wing_Commander_(video_game)#De...