By far the largest issue with errno is that we don't record where inside the kernel the error gets set (or "raised" if this was a non-C language). We had a real customer case recently where a write call was returning ENOSPC, even though the filesystem did not seem to have run out of space, and searching for the place where that error got raised was a multi-week journey.
In Linux it'd be tough to implement this because errors are usually raised as a side effect of returning some negative value, but also because you have code like:
But instrumenting every function that returns a negative int would be impossible (and wrong). And there are also cases where the error is saved in (eg) a bottom half and returned in the next available system call.
This is such a debugging timesink that I've added two things to bcachefs to address it.
- private error codes; error codes are unique, corresponding to a specific source location. They're mapped to standard error codes when we exit from the bcachefs module; in bcachefs they make for much more useful error messages.
- error_throw tracepoint: any time we throw an error, we invoke a standard tracepoint. Extremely useful for debugging wonky-but-not-broken behavior.
Same. I usually end up adding a ton of pr_warn statements to kernel code and progressively dig deeper and deeper as I figure out which path every function ends up taking, with a kernel recompile + re-flashing + rebooting in between every step. I sorely wish the kernel had better instrumentation for figuring out where error returns originate from.
There are debug prints in some places which can be enabled (if you're lucky, your kernel is even compiled to let you enable those at runtime, without recompiling!), however most of the kernel clearly does not have a culture of debug logging whenever it returns an error.
I don't think this behaviour is "peculiar" as the author says it is; why does the error number matter if you know the call succeeded? GetLastError() on Windows works similarly, although with the additional weird caveat that (undocumentedly) some functions may set it to 0 on success.
The system call wrappers could all have explicitly set errno to 0 on success, but they didn't.
Because it's plainly unnecessary. It'd be a waste today, and even more so on a PDP-11 in the 1970s.
I agree with you, its not so peculiar as one might think. (Disclaimer: been writing software for Unix since before POSIX...)
This design choice reflects the POSIX philosophy of minimizing overhead and maximizing flexibility for system programming. Frequent calls to write(), for example, would be hindered by having to reset errno with each call/check of write() return value - especially in cases where a lot of write()'s are queued.
Or .. a library function like fopen() might internally call multiple system calls (e.g., open(), stat(), etc.). If one of these calls sets errno but the overall fopen() operation succeeds, the library doesn’t need to clear errno. For instance, fopen() might try to open a file in one mode, fail with an errno of EACCES (permission denied), then retry in another mode and succeed. The final errno value is irrelevant since the call succeeded.
This mechanism minimizes overhead by not requiring errno to be reset on success.
It allows flexible, efficient implementations of library and system calls and encourages robust programming practices by mandating return-value checks before inspecting errno.
It supports complex, retry-based logic in libraries without unnecessary state management - and it preserves potentially useful debugging information.
You only care about errno when you know an actual error occurred. Until then, ignore it.
This is similar to other systems-level things that can occur in such environments, for example when setting a hard Reset-Reason or Fail-Reason register in static/non-volatile memory somewhere, for later assessment.
IMHO, the thing thats most peculiar about this is that folks these days think of it as weird/quaint - when in fact, it makes a lot of sense if you think about it.
I once worked on a project that had openssl as a dependency. We found that about 90 seconds in, the connection would drop a high percentage of the time. I laboriously worked through both the openssl and dependency code until finding that the bug was due to some code in the dependency not manually clearing out errno for openssl which caused a later read to hard fail when it should have soft failed with ewouldblock (or whatever).
Global errno is a bad design that is a consequence of bad design in c. If c supported tagged unions and pattern matching, there would be no way to access the result or error value without first checking what type of value it is. But since c doesn't support this, all these apis require programmer discipline that cannot be relied upon. As a consequence, it becomes almost inevitable that you encounter some inscrutable bug that someone else wrote due to their lack of skill and/or discipline. That doesn't make sense to me though I do understand why it is the way it is.
I think it would have been better if they had designed it so that the error message from the kernel came in a seperate register. That would mean you didnt have to use signed int for the return value. The issue is that one register now is sort of disambigious. It either returns the thing you want or the error but these are seperate types. If you had them in seperate registers you would have the natural type of the thing you are interested in without having to convert it. This would however force you to first check the value in the error register before using the value in the return register but that makes more sense to me than the opposite.
That is quite expensive. Obviously you need to physically add the register to the chip.
After that the real work comes. You need to change your ISA to make the register addressible by machine code. Pdp11 had 8 general purpose registers so they used 3 bits everywhere to address the registers. Now we need 4 sometimes. Many op codes can work on 2 registers, so we need to use 8 out of 16 bits to address both where before we only needed 6. Also pdp11 had fixed 16 bits for instruction encoding so either we change it to 18 bit instructions or do more radical changes on the ISA.
This quickly spirals into significant amounts of work versus encoding results and error values into the same register.
There are quite a few registers (in all the ISAs I'm familiar with) that are defined as not preserved across calls; kernels already have to wipe them in order to avoid leaking kernel-specific data to userland, one of them could easily hold additional information.
EDIT: additionally, it's been a long time since the register names we're familiar with in an ISA actually matched the physical registers in a chip.
Yeah I am not advocating creating a new seperate register, even though that would be nice. Like the poster below said, there are usually some unpreserved registers to choose from but if you for some reason cant spare a register you could instead write the error code to any virtual address instead, or send a signal, a message or anything else you could come up with. Just some way that does away with this intermix of return types and error types.
Imho, this is an area where the limitations of C shine through.
Some kernels return error status as a CPU flag or otherwise separately from the returned value. But that's very hard to use in C, so the typical convention for a syscall wrapper is to return a non-negative number for success and -error for failure, but if negative numbers are valid as the return, you've got to do something else.
Hard‽ Small POD structures returned in register pairs has been a feature of C compiler calling conventions for over 3 decades, and was around in C compilers in the 16-bit era.
Something worth mentioning would have been those libc calls where the only way to tell if a return value of 0 is an error is to check errno. And of course, as the article says, errno is only set in error, you need to set it to 0 before making that libc call.
I think strtol was one such function, but there were others.
strtol() isn't actually such a case - it sets errno on range errors but returns 0 for valid input "0"; the classic examples are actually getpriority() and sched_yield() which can legitimately return -1 on success.
It is such a case, it's just not about zero. If the input is too large to be represented in a signed long, strtol(3) returns LONG_MAX and sets errno to ERANGE. However, if the input was the string form of LONG_MAX, it returns LONG_MAX and doesn't set errno to anything. In fact my strtol(3) manpage explicitly states "This function does not modify errno on success".
Thus, to distinguish between an overflow and a legitimate maximum value, you need to set errno to 0 before calling it, because something else you called previously may have already set it to ERANGE.
kqueue() can apparently return the error right in the data of the kevent, but I'm still using poll() so cannot confirm; whilst I can confirm that kqueue/kevent is alas not as truly consistent as one might expect. (Someone recently tried to move FreeBSD devd to kqueue, and hit various problems of FreeBSD devices that are still, even in version 14, not yet kqueue-ready.)
i had hopes to find a elegant solution to this issue in modern async io libs. Those I have found just simply ignore the error and forward it to the user.
For thread-safety, POSIX needs a mulligan migration to an API with an opaque context structure. Globals are bad and were never going to work with threads. The gotcha would be the need for mutexes on process shared state like env vars.
If going through all this effort, might as well adopt a fine-grained, capability-based permissions system too so that resources (i.e., fs, process, network) can be sandboxed.
In Linux it'd be tough to implement this because errors are usually raised as a side effect of returning some negative value, but also because you have code like:
But instrumenting every function that returns a negative int would be impossible (and wrong). And there are also cases where the error is saved in (eg) a bottom half and returned in the next available system call.- private error codes; error codes are unique, corresponding to a specific source location. They're mapped to standard error codes when we exit from the bcachefs module; in bcachefs they make for much more useful error messages.
- error_throw tracepoint: any time we throw an error, we invoke a standard tracepoint. Extremely useful for debugging wonky-but-not-broken behavior.
There are debug prints in some places which can be enabled (if you're lucky, your kernel is even compiled to let you enable those at runtime, without recompiling!), however most of the kernel clearly does not have a culture of debug logging whenever it returns an error.
ext4 reserved blocks?
The system call wrappers could all have explicitly set errno to 0 on success, but they didn't.
Because it's plainly unnecessary. It'd be a waste today, and even more so on a PDP-11 in the 1970s.
This design choice reflects the POSIX philosophy of minimizing overhead and maximizing flexibility for system programming. Frequent calls to write(), for example, would be hindered by having to reset errno with each call/check of write() return value - especially in cases where a lot of write()'s are queued.
Or .. a library function like fopen() might internally call multiple system calls (e.g., open(), stat(), etc.). If one of these calls sets errno but the overall fopen() operation succeeds, the library doesn’t need to clear errno. For instance, fopen() might try to open a file in one mode, fail with an errno of EACCES (permission denied), then retry in another mode and succeed. The final errno value is irrelevant since the call succeeded.
This mechanism minimizes overhead by not requiring errno to be reset on success.
It allows flexible, efficient implementations of library and system calls and encourages robust programming practices by mandating return-value checks before inspecting errno.
It supports complex, retry-based logic in libraries without unnecessary state management - and it preserves potentially useful debugging information.
You only care about errno when you know an actual error occurred. Until then, ignore it.
This is similar to other systems-level things that can occur in such environments, for example when setting a hard Reset-Reason or Fail-Reason register in static/non-volatile memory somewhere, for later assessment.
IMHO, the thing thats most peculiar about this is that folks these days think of it as weird/quaint - when in fact, it makes a lot of sense if you think about it.
Global errno is a bad design that is a consequence of bad design in c. If c supported tagged unions and pattern matching, there would be no way to access the result or error value without first checking what type of value it is. But since c doesn't support this, all these apis require programmer discipline that cannot be relied upon. As a consequence, it becomes almost inevitable that you encounter some inscrutable bug that someone else wrote due to their lack of skill and/or discipline. That doesn't make sense to me though I do understand why it is the way it is.
That is quite expensive. Obviously you need to physically add the register to the chip.
After that the real work comes. You need to change your ISA to make the register addressible by machine code. Pdp11 had 8 general purpose registers so they used 3 bits everywhere to address the registers. Now we need 4 sometimes. Many op codes can work on 2 registers, so we need to use 8 out of 16 bits to address both where before we only needed 6. Also pdp11 had fixed 16 bits for instruction encoding so either we change it to 18 bit instructions or do more radical changes on the ISA.
This quickly spirals into significant amounts of work versus encoding results and error values into the same register.
Classic worse is better example.
There are quite a few registers (in all the ISAs I'm familiar with) that are defined as not preserved across calls; kernels already have to wipe them in order to avoid leaking kernel-specific data to userland, one of them could easily hold additional information.
EDIT: additionally, it's been a long time since the register names we're familiar with in an ISA actually matched the physical registers in a chip.
Some kernels return error status as a CPU flag or otherwise separately from the returned value. But that's very hard to use in C, so the typical convention for a syscall wrapper is to return a non-negative number for success and -error for failure, but if negative numbers are valid as the return, you've got to do something else.
* https://jdebp.uk/FGA/function-calling-conventions.html#Watca...
I think strtol was one such function, but there were others.
Thus, to distinguish between an overflow and a legitimate maximum value, you need to set errno to 0 before calling it, because something else you called previously may have already set it to ERANGE.
https://cr.yp.to/docs/connect.html
* https://github.com/jdebp/nosh/blob/trunk/source/socket_conne...
kqueue() can apparently return the error right in the data of the kevent, but I'm still using poll() so cannot confirm; whilst I can confirm that kqueue/kevent is alas not as truly consistent as one might expect. (Someone recently tried to move FreeBSD devd to kqueue, and hit various problems of FreeBSD devices that are still, even in version 14, not yet kqueue-ready.)
If going through all this effort, might as well adopt a fine-grained, capability-based permissions system too so that resources (i.e., fs, process, network) can be sandboxed.