Studying the relationship between exception handling and post-release defects

zaptheimpaler · 4 years ago

At my first job I wrote a new Java application to ingest/transform/store some data and after about 6 months it was done. It chugged along for years after without ever really requiring any maintenance or monitoring, it just kept working. One day the VP eng just showed up at my desk and started chatting, I didn't really know why. Turns out he was interested in knowing why it was so stable but it was an obvious part of software to me - every line I wrote where an exception was possible, I would look up what exceptions could be thrown and why, and then structure try catches to do the right thing in each case - sometimes that meant logging a warning, sometimes propagating to a higher level, sometimes shutting down the app etc. He then sent out some paper about the relationship between good error handling and software defects to the org.

Like I don't get why this is so difficult for so many devs. Your job is to think about both the happy case and the error cases, and usually there are many error cases and 1 happy case so the bulk of time and code is dedicated to the errors. Since then I've seen this is not at all the way most devs write things. They are interested in writing the happy case and treat error handling like a chore to be done as quickly as possible. The tools and language constructs don't matter much, it's just do you dedicate time and care to thinking through errors or not.

Yeah it sounds boastful but fuck it it's true. Same is true regardless of this study it's not about the structure and syntax of error handling,it's how much thought was given to understanding the failure modes and how to handle them.

metb · 4 years ago

>One day the VP eng just showed up at my desk and started chatting, I didn't really know why. Turns out he was interested in knowing why it was so stable

Kudos to him, he paid attention, and noticed something that was stable. Usually something has to be broken to get attention at that level. Can't imagine leadership at most places thinking "hmm, this project we did several years ago had been running without bugs, let me go bring best practices from this developer."

MattPalmer1086 · 4 years ago

Similarly, I've written systems that ran for years without needing much maintenance. I believe good error handling (and negative testing as well as positive testing) are the primary reasons for that. Unfortunately, it takes me longer to write code than my peers as a result. I'm not a professional developer anymore...

zaptheimpaler · 4 years ago

ouch i feel that. I do have the same problem all the time. if you don't mind me asking, what do you do now?

journey_16162 · 4 years ago

I'm refactoring a (not so big) TypeScript codebase from exceptions to returning error results for this reason.

I don't know about Java, but TypeScript does not have a way to annotate that a function throws, not to mention what it throws - so carefully handling every error case would be difficult. With the new approach I know exactly what kind of error codes each function can return and how to handle them.

Thanks for your comment, now I have more reassurance that I'm not wasting time with the refactor.

TheCoelacanth · 4 years ago

I think the hard part of handling all error cases like that is not the actual handling of the error cases, but making your code well structured enough that at any given place, you can actually enumerate all possible errors in a reasonably small list and so that you actually have a reasonable way of handling each error.

In poorly structured code, all possible errors could often include just about anything and for many of the errors you will have no reasonable way of handling them.

hk-im-ad · 4 years ago

You assume that the only errors a program can encounter are logic errors. Logic errors are really the easiest class of error to fix. Here are examples of other errors that can bring down stable systems:

1) An API returns a list of items. In a new version of the API, the data structure used to generate the response list is changed from a list to a set, invalidating any implicit assumptions about response ordering. This happened to a service I worked on where there was an implicit assumption by callers that the results were ordered by date. This was true prior to using a set.

2) Resource leaks exposed by failure conditions. A network outage might cause infrequently tested code paths to leaks resources. In Go, I’ve seen this happen with network requests not tied to a context. This is an interesting case because the fact that an error occurs causes the leak even though the success path works correctly.

3) Missing back offs. A dependency may return errors in the case where a dependency becomes overloaded. This can cause a backup in something like an ingestion job. Without adequate back offs, the dependency may never be able to recover.

There are plenty more examples where simply handling every error case is not sufficient for stability.

As a question to you: why is this system more stable than other systems you write? Can’t you apply your error handling philosophy to everything?

zaptheimpaler · 4 years ago

1. Implicitly assuming something not specifically guaranteed is a logic error. Its like assuming hashcode(x) = x just because it happens to be true for small numbers in python. Its not part of the spec, and just observing it to be true a few times doesn’t change anything.

You could a. document the implicit assumption that the list is sorted or b. sort the list before continuing further.

2. A resource not being freed in all cases is also a logic error.

3. In the context of basically calling any service/API repeatedly, not using backoffs and setting a hard limit on the number of tries is also a mistake. I dont know if its a “logic error” but its certainly an error.

I don’t know why you think these are not error cases. Your program fails, and can be written in a way that does not fail or fails slightly more gracefully - thats an error case. There is some judgement involved - i might not implement backoff in a context where i know its not yet needed, i might just log a particularly obscure exception instead of trying to recover if i can’t imagine why it would happen. I’m not saying your program should be perfectly bug free from the moment its written - i’ve written plenty of buggy code too and we're all human. There are even a few errors that are truly insane like literally hitting a compiler bug that silently corrupts your program. Just that calling these things not error cases (and what, random acts of misfortune instead?) is a weirdly defeatist attitude that impedes further progress. They're all error cases, the only question is how much time and knowledge do you have to dedicate to handling them.

blub · 4 years ago

Those errors can also be considered and mitigated though, if one thinks about what could go wrong instead of only thinking about what exceptions can be thrown:

1. One must either encode the assumption into a precondition or transfer the incoming data into a sorted data structure. But the gist is to always validate assumptions.

2. RAII’s pretty good at handling resources. Then one can inject failures to test the error handling paths and combine with code coverage measurements.

3. That sounds like an issue in the wider system which may be handled in the subsystem under development assuming that it can throttle back its requests, drop them, etc. But it may just as well be handled in another part of the larger system. It belongs more to the architecture realm, but it’s absolutely possible to foresee such issues.

ravenstine · 4 years ago

I'm not exactly sure why that is either, but I do think most developers are appealed by the idea that a computer program is a closed system and thus any error that happens is a fluke, unexpected, and should be treated as catastrophic, when that is simply not true. Programs take input from the parent reality, asynchronously, and therefore aren't just subject to cosmic rays flipping bits in the hardware. Personally, I like to handle the possible errors intelligently, but many developers, especially web developers, love to just have any error result in an error page. Even worse, they sprinkle `try { ... } catch (err) { // do nothing }` all over the place, and that results in problems that are hard to track down or detect early.

kinow · 4 years ago

Interesting article and great blog post too.

I was lucky to start my career with a mentor who considered well written exception handling core to software development.

And he was always happy to stop his work (maintaining a core middleware in a telco with millions of transactions every minute) to review this intern's poorly written try/catch, show how he would write it, and why.

To this day I am thankful and believe his concerns over exception handling, good and simple deployment techniques, knowing networking to OS/kernel level, helped me becoming a better developer (or at least knowing where I need to improve).

jimbob45 · 4 years ago

To be clear, did he consider exceptions something to be avoided when possible or did he consider them to be a natural element of programming? I know exceptions do cause slow-downs on systems and most programming does its best to minimize occurrences within reason.

gwbas1c · 4 years ago

The downvotes are unfair for an extremely good question.

(I used to ask questions along this line when interviewing candidates.)

Exceptions are used to communicate an unexpected error. Specifically, an error that the caller doesn't expect to handle during normal operations of the program. These errors can range from unusual situations like a network failure, to even more perverse situations like true bugs in the program.

The example I discussed with job candidates was implementing a database access function called GetUserById. (In C#, pre compiler enforced null checking.) I would ask if the function should return null or throw an exception when there was no user with that ID.

What followed, (with the candidates who passed,) was a discussion about the trade-offs of returning null versus throwing an exception. Null allows the caller to know that there was no user with that ID without the overhead of the exception. But, returning null increases the risk of a NullReferenceException. This is risky, because it's harder to debug then a strongly typed exception with a useful error message. Thus, the "right" approach depended on if it was anticipated that someone calling GetUserById expected the user ID to always be for a valid user.

When there was time, we'd even get into the TryGet pattern that the .Net dictionaries use.

(By the way, now it's a good time to check out how Rust's enum type is used with error handling. It's really slick with no overhead.)

kinow · 4 years ago

I think he never considered the exceptions to the something to be avoided since he wrote Java most of his professional career - which ended when he decided to quit programming and go sell ornamental flowers in his family business in his hometown.

The middleware we worked had libraries from other companies (e.g. online prepaid transactions were handled by code from a jar/lib from Ericsson). But in the case the platform threw an exception, we could still charge the customer using a slower process (i.e. there's a window of time where the customer could use data/talk in excess until we realize there's no more credit when processing the slower transaction).

For him it was important to write these exceptions well. If something went wrong in production, we couldn't turn off the system, and if we were not charging users when we should to, the company would be losing millions (it was in Brazil, ~200 mi at the time, and the telco had ~60mi users I think, with mother's day and big brother being the craziest days with millions of messages per second.)

kaetemi · 4 years ago

Throwing exceptions is slow. Adding exception handling code itself has little to no overhead.

Only throw exceptions in exceptional cases, not for frequently expected outcomes.

lmilcin · 4 years ago

> The longer the exception handling blocks in a file, the more likely the file is to contain bugs.

It is difficult to design a sophisticated error handling system. For various reasons. For example it is rarely talked about -- everybody focuses on how to get stuff working and errors aren't really hot topic people find interesting.

I think, given the above, you have most chance of success with really simple error handling systems. The simplest is to only handle things you can really handle at the current frame and let everything else filter to the top and interrupt entire process. It is not perfect but it has the virtue of being simple and easy to make foolproof.

> The Ignoring Interrupted Exception and Log and Throw patterns corelated with post-release defects in one of the projects they studied, but not all.

Interrupted exception does not happen in real usage in a lot of applications, especially backends, where the environment is very well controlled.

Log and throw can be a legitimate pattern. It makes sense when the exception travels outside some kind of boundary like module/library boundary. A REST API is an example of boundary and it is normal to log the error but still throw it (to the client). It also may possibly make sense to log additional information available in local scope and throw the error up the stack. It doesn't always make sense to convert the type of the exception to be able to add more information to it.

gwbas1c · 4 years ago

> Log and throw can be a legitimate pattern. It makes sense when the exception travels outside some kind of boundary like module/library boundary. A REST API is an example of boundary and it is normal to log the error but still throw it (to the client). It also may possibly make sense to log additional information available in local scope and throw the error up the stack. It doesn't always make sense to convert the type of the exception to be able to add more information to it.

Of course it's going to correlate with a bug. The whole point of log and throw is to make bugs easier to find!

TeMPOraL · 4 years ago

> It also may possibly make sense to log additional information available in local scope and throw the error up the stack.

Like, time. This is why I often do "log and throw" or "log unexpected value and return it", depending on whether the code is exception-oriented or uses sum types instead. It's because time is an important piece of information to encode in an error log. When your error is caused by an external resource (e.g. RPC failure), you want to log it immediately to later be able to correlate it with external resource's logs or some other monitoring. If you don't the error may take some time to reach its final logging place - or, it may never reach it at all, if the application crashes in the meantime.

cies · 4 years ago

>The simplest is to only handle things you can really handle at the current frame and let everything else filter to the top and interrupt entire process.

I find this approach creates exactly what you try to avoid: error handling complexity.

The problem is with the Exceptions I believe. They mess with normal program flow. They create code that's hard to reason about, that's less explicit.

Sometimes Exceptions can clean up some code, but usually it just sweeps dirt under the carpet to blow up in your face later.

I rather use sum types and actual return statements (or implicit returns when all code is expressions) than Exceptions.

Arech · 4 years ago

The statistician in me is severely troubled by a conclusion like "statistical relationship" when the conclusion is drawn from studying just 3 (three) projects. At worst case this means studying programming habits of just 3 (three) developers (appended: actually, 1 dev is the worst case). You just can't make any significant conclusions on such basis that generalize on whole population.

Even if it was huge projects with hundreds of devs, you still can't make generalizable conclusions, because habits of all this devs are clearly not independent.

mindwok · 4 years ago

Personally I really dislike exceptions and try/catch in languages. I don't like having to worry about whether some function call is going to surprise me with an exception, and handling them with try/catch really breaks the flow of the program.

I'm sure there's probably a lot I can learn to make this better for myself because I don't work with exception heavy code very often, but I find working with simple error value returns like in Go or error types in Haskell/Rust to be so much more ergonomic and comfortable to work with.

hackinthebochs · 4 years ago

The benefit of exceptions and try/catch is that it lets you separate your exception handling logic from the mainline logic of the function. Having error logic weaved in and out of mainline logic just obscures what is going on and increases cognitive load. I much prefer writing and reading code with exceptions than explicit error handling control flow.

marcosdumay · 4 years ago

> it lets you separate your exception handling logic from the mainline logic of the function

So do the error monads (on Rust or Haskell). In fact, they offer a lot more flexibility on how to separate them, and can put even more distance between the happy path and error handling (if you need it, often people use the extra flexibility to place them closer).

cies · 4 years ago

> I much prefer writing and reading code with exceptions than explicit error handling control flow.

Can you let us know what languages that use "explicit error handling control flow" you have used?

I've extensive experience with both an much prefer the "explicit error handling control flow" in Rust/Haskell/Elm/Kotlin/ReScript, than the exceptions in Java/C++/C#/JS/Ruby/Python.

Interesting that the use of implicit nulls (another of my annoyances in langs) is also split along these lines!

adrian_b · 4 years ago

The separation between the mainline logic and the exception handling logic does not require an exception mechanism like in Java or C++.

A restricted form of GOTO, like in the language Mesa (from Xerox) is good enough.

Mesa also had exceptions for things where exceptions are appropriate, e.g. numeric overflow, out-of-bounds access or memory allocation errors, i.e. errors that are normally caused by program bugs and which can be solved only by someone who knows the internals of the program.

For errors whose cause can be determined and removed by the user, e.g. files that cannot be found or mistyped user input, the appropriate place of handling the error is the place where the function that failed had been invoked.

Neither inside the function that failed nor several levels above the place where the error has been detected it is possible to provide really informative error messages that can enable corrective action, because only at the place of invocation the precise reason is known why the failed function had been called.

The restricted GOTO from Mesa, whose purpose was error handling, could not jump backwards and it could not enter a block, which eliminated the possible abuses of the feature.

Moreover the labelled targets of the GOTO could exist only inside a delimited section at the end of a block.

The keyword GOTO is not needed, because it is redundant. At the place of the jump it is enough to write the target label, as no other action is possible with it.

So in a language like Mesa, the mainline logic would be something like (in languages with "then", no parentheses are needed, unlike in C and its followers):

if err_code := function_that_can_fail(...) then Error_Name

if err_code := function2_that_can_fail(...) then Error2_Name

and the error handlers will be grouped in a section similar with the error handlers used with the exception mechanism of Java or C++.

The difference is that the section with the error handlers must be in the same file with the function invocations that can return errors and the error handlers will be invoked only from there, not from random places inside who knows what 3rd party library might have been used somewhere.

Because for such handlers there is no COME-FROM problem, you know exactly what has happened and you can easily determine what must be done.

initplus · 4 years ago

The challenge with exceptions is there is zero indication at the call site that a function can throw. Does myFunc() throw? Only way to know is to dig down through the entire call stack. Meanwhile with (value, error)/Result etc. it's obvious right at the call site whether a function can potentially error or not.

mindwok · 4 years ago

Interesting, thanks for sharing. I find there to be more cognitive load with the separate control flow. Different strokes I guess!

throw_m239339 · 4 years ago

Why do people praise Go when it comes to error handling? It's the worst of both worlds since its std libs returns errors sure, but anything can panic as well, which is basically a poor's man exception system.

Rust on the other hand, like many other things got it mostly right. If you're going to use errors as a values then you need some constructs to deal with that at the language level.

Go's solution isn't more sophisticated than C error codes.

mindwok · 4 years ago

That's not exactly true. There's very few cases where the standard library will panic, and the handful of cases where it does is for programming errors (like passing a non-nil, zero length buffer into io.CopyBuffer).

comex · 4 years ago

Rust also has panics as a separate feature from errors-as-values, same as Go…

rgharris · 4 years ago

Error handling in Go is simple and verbose at the same time. Working in it alongside Python and JavaScript makes me rethink a lot of patterns I’m accustomed to - it’s a nice exercise. Though of course the use cases often vary significantly between the 3 languages so it is tough to even compare.

planet-and-halo · 4 years ago

I think the biggest place where it's really necessary is integration points. Within your system it's reasonable to try to define exceptions out of existence. Once you start accepting user input or depending upon some system outside of your control, though, you better have some kind of mechanism to handle whatever unknown asteroids come flying in from deep space ready to annihilate your entire planet and civilization.

gizmo686 · 4 years ago

That is still provided by error values or error types. The defining feature of exceptions is that they provide a secondary control flow path, and that control flow path automatically flows up the stack until it reaches an explicit catch.

With error values/types, control flow happens normally, and the programmer is expected to explicitly branch on the the value using normal control flow constructs.

jessikat · 4 years ago

I see a lot of exceptions versus error types/values, as if there's no in-between of exceptions and error types/values. There exist languages that support both. OCaml has both optional values and exceptions. C++ with std::expected. C# with the work on nullability/nullable reference types, and already with nullable value types.

branko_d · 4 years ago

On the exception receiving side...

The key, IMO, is to assume that every line of code can throw until proven otherwise, and then make sure you clean-up after yourself, using RAII in C++, `using` in C#, savepoints in SQL or whatever else mechanism is available in your language.

After that, in 99% cases, you just let the exception propagate to the higher level. Eventually, it will result in an error dialog, or be logged or whatever, but the program is still in consistant state. It's a kind of a "soft reset" that doesn't destroy the state and lets you keep working.

---

On the exception producing side...

Don't throw a new exception unles you expect it can be reasonably treated as described above. If your immediate caller needs a `try...catch` wrapped directly around the call, this is probably a sign you should have returned an error result instead of throwing an exception.

Too · 4 years ago

  > there exist anti-patterns that can provide significant explanatory power to the probability 
  > of post-release defects. 
  > Therefore, development teams should consider allocating
  > more resources to improving their exception handling practices

While an interesting correlation. The conclusion that plugging this particular hole might not be the best goal to focus on improving.

It's two variables that are not statistically independent of each other. A novice developer is more likely to write bugs as they are to use anti-patterns of exception handling, or anti-patterns for anything else for that matter. Basically correlation is not causation.