Yes the code is bad but does not need to be completely rewritten. It should interpret the data as an array uint8_t (no casting to structure/integer pointers!), use simple helper functions to read out (u)int(8/16/32) values (there was a post on HN about this recently, https://commandcenter.blogspot.com/2012/04/byte-order-fallac...), and be careful to check things like sizes.
The code is also wrong because of strict aliasing. This is a real problem, your program can in fact exhibit undefined behavior because of this (it happened to me).
Some time ago I wrote some code to make these things simpler in C++. It allows you to define "structures" (which however are not real structs, it's an abstraction) and then directly access an arbitrary buffer through setters/getters. The code is here with documentation: https://github.com/ambrop72/aipstack/blob/master/src/aipstac... . In fact this is part of my own TCP/IP stack and it includes DHCP code (here are my DHCP structure definitions: https://github.com/ambrop72/aipstack/blob/master/src/aipstac...).
Interestingly (and pedantically), the code from the otherwise excellent "Byte order fallacy" article technically contains undefined behavior in C on machines where int is 32 bits wide (which is almost everything nowadays).
Consider this example of extracting a 32-bit int encoded as little endian:
#include <stdio.h>
int main(void)
{
unsigned char data[4] = { 0, 0, 0, 128 };
unsigned int i;
// the line of code from the article:
i = (data[0]<<0) | (data[1]<<8) | (data[2]<<16) | (data[3]<<24);
printf("%x\n", i);
}
Compiling this with either gcc or clang using "-fsanitize=undefined" triggers this runtime error:
test.c:9:61: runtime error: left shift of 128 by 24 places cannot be represented in type 'int'
That happens because, even though the bytes are unsigned as the article requires (using uint8_t instead of unsigned char has the same problem), they are promoted to (signed) int before the left shift as part of automatic integer promotion. When the left shift by 24 happens, it puts an 1 into the sign bit of the int, which is undefined behavior. Later the int is converted to unsigned (using implementation-defined behavior, which luckily is sane on all popular compilers), but by then it's already too late.
The solution is to change the code slightly:
i = (data[0]<<0) | (data[1]<<8) | (data[2]<<16) | ((unsigned)data[3]<<24);
This prevents the promotion from unsigned char (or uint8_t) to int by explicitly casting it to an unsigned int, which can then be safely shifted left by 24.
Yeah casting to struct pointers is a bright red flag. It also rarely saves you anything since packed structs are slow, and you still have to validate the contents.
You're right, these streams should be interpreted as uint8_t to avoid undefined behavior when shifting, masking, and ORing, and anyone doing this should tuck it behind some abstraction layer. It should never bleed into parsing.
My second year at google, I fixed undefined behavior that essentially appeared in every google binary because a core library did this without care for the alignment of the memory in question. Thank god with PODs you can just memcpy
As an array of uint8_t? I'm pretty sure that's wrong, and that it should be char instead. Usual OS network APIs are char based, and the fact that the network is technically about "octets" shouldn't matter behind those APIs. Or is my thinking wrong?
There's a different consideration: as far as I know, it's an open question whether uint8_t* and int8_t* have the same special aliasing rules that apply to char* and unsigned char. A strict reading suggests that they do not. So there may be cases where using uint8_t is UB but using unsigned char* is not.
I think that all compilers consider this to be a bit of defect in the relevant standards, and they'll compile your code the way you expect.
On virtually all modern computers chars are 8-bit and casting the address to uint8_t* is perfectly fine (due to specific provisions in the standards that I am too lazy to find). And you need unsigned so that expressions like x + (y << 16) do the right thing. With that in mind, I use uint8_t as opposed to unsigned char to emphasize that it is 8-bit and also for less typing :)
uint8_t is guaranteed to always be an octet, and stdint types (uint8_t, int32_t, uint32_t etc.) are also the recommended types to use when you want to make the size and signedness of the data clear. An array of uint8_t is definitely correct, and it's also the right thing to do in 2018. An array of char is something you should only use to store human-readable character in array form.
Many OS network APIs are "char based" because the C standard requires that sizeof char be exactly one byte, so it used to be a convenient shorthand, but that's not the case anymore.
Downvoters, my question was wether int8_t* and char* are interchangeable. I can't see why that's a bad question to ask. Given "char* buf", there might well be a difference between "(uint8_t) buf[1]" and "((uint8_t* )buf)[1]".
Now, upon further researching it looks like uint8_t always has size 1 (as does char), but I'm still not positive that the second expression is technically valid.
I don't think I disagree, but could you articulate how strict aliasing makes this code wrong? Maybe I'm not looking at the right code, but it's not immediately obvious from the post to me.
> The byte order of the computer doesn't matter much at all except to compiler writers and the like
This statement is to generic to mean anything helpful.
I think the point our friend Rob Pike is making is that you could write general purpose code which accomplishes the task, and on a sufficiently advanced compiler, the swap may be eliminated where applicable; but when the point is made this way, it surely just pisses off people who know better.
C programmers are not (quite) children, you don't need to eradicate nuance to keep people reading.
The article isn't really about the bug but about the (lack of) quality in the code. Spoiler alert: He doesn't like it.
"This code has no redeeming features. It must be thrown away and rewritten yet again. This time by an experienced programmer who know what error codes mean, how to use asserts properly, and most of all, who has experience at network programming."
If the rest of the code looks anywhere near similar to the examples he shows he’s right, this is terrible code and it should never have been allowed in a serious project in this day and age.
It seems like this day and age is precisely when this sort of code is normal. We had a brief window of hopeful optimism in the mid to late 90s and even into the early 2000s, but since then it seems like code quality in general has declined tremendously.
I agree with most of the critique, but I don't quite get the part about the use of assert.
The code in question quite clearly doesn't use C's standard assert macro, but a custom macro called assert_return (which I guess checks for a condition, and returns from the function instead of aborting the program).
So basically:
if (!condition) {
return err;
}
How else would one check for invalid input to the function?
I don't know the details of DHCP, but there is a difference between values supplied by the programmer and those supplied by the user. According to the author, checking if a pointer is not-null is checking a programmer error (EINVAL) so it makes sense to use asserts. But the optval string in the example is supplied by the user so it is insane to use asserts to check for null-termination.
Asserts should only be used to signal "this code is broken" but that is not how they are used here.
A long while ago the GTK+ library used to suffer from a similar affliction. Mainstay GTK+ programs such as GEdit and Totem would print lots of warnings on the console. Since the GTK+ developers used the same facility (g_warning) for reporting both coding errors and for user errors like "file not found," debugging code was hard. Was the warning "supposed to be there" or was it a bug? GTK+ programs are much cleaner these days even if some programs still fails to make the distinction between user and programmer errors.
I may have a different understanding than the author, but for me the difference in the name means a lot. The fragment you wrote is clear and simple. But if I see assert_..., then I immediately have questions like: Where's the message printed and does it have enough context for debugging at that point? Is this affected by NDEBUG? Will it run in release builds? Does assert_se() kill the whole application?
Assert already has a meaning and using it in this context is not great.
I can see the argument to be made about overloading the word "assert" in the language though.
i.e. "assert" meaning "check something in this code is algorithmically correct"
Whereas when you're checking user input, use a very different term - i.e. validate or something.
So a fix would be to rename the macro to be obvious about it's usage - "validate_return" or "valid_or_return" or something.
To be fair I can see the problem because I do tend to write code comments saying "assert everything makes sense" when checking user input sometimes and "assert" itself doesn't properly convey the outcome if it fails - i.e. "abort_if" or "crash_if" would be a much more descriptive term.
In my opinion diluting existing, well-established terms is extremely dangerous and harmful for a project. If you're diluting established terms - like assert - you're kinda forcing readers to stop assuming anything about your names.
If assert isn't assert, and if return values are also abused, why can I assume get is get, sort is sort and so on? Sort might check inputs, get might delete the customer account, who knows at this point. And that makes anything but a tiny code base pretty much unmanageable, or at least very very hard to learn.
I used assert_* for code bugs and then verify_* for the user checking ones, my projects would have a verify.h,c file that was pretty short/sweet/easy for the next dev to grok.
At one point and older coder who I looked up to even complemented and adopted this pattern. 1998 was a good year.
I sometimes use a bunch of Ensure*() methods in C# if I need to validate input data. They throw exceptions if condition fails. If I only do this once I don't create those methods.
One drawback is the stacktrace has these Ensure names in it.
Probably returns a value if the assert works, and aborts otherwise.
I would go so far as to say that using asserts in input validation is a good thing, in very limited quantities. I use them in places where the input should have been validated so long ago, so many functions back, that if something isn't as I expect here, I seriously fucked up somewhere else and the whole system has a good likelihood of being compromised by something unrelated to the condition I'm asserting. In that scenario I'd rather log and abort.
In this case, though, you're not using asserts to validate. You're using asserts to capture the post-validation guarantees in your code. Basically, you assert that your validation code did the right thing and rejected invalid input through appropriate mechanisms - not that the input was valid.
In D, we call the two functions "assert" and "enforce". The point is that assert is the wrong term; to say "assert(condition)" says "I assert that condition should hold here". If there's an input event that causes condition to be false then your assert is not just false but wrong, since you assert something that is not necessarily true.
So this is a whole rant, and it's pretty much spot on probably, but the reader is left with a huge elephant of a question in the room: what is the proper way to do it, then? It has samples all over the place of what's bad and telling thhose should be rewritten. Why not show some C/C++ code samples of the proper way to do it, then?
I wonder why the author chooses to write buf[offset]*256 + buf[offset+1] instead of the more common method to load bytes using bitwise arithmetic, e.g. b[k]<<8 | b[k+1]. This is far clearer instead of using potentially magic numbers. Does Javascript not have bitwise arithmetic? Why would you write C code with Javascript in mind?
The main issue to me is that conversion, parsing, and validation are scattered throughout the code when they should be in separate, local, sequential steps. You should never be in a situation where you're halfway through allocating memory for an internal structure when suddenly you realize your input is malformed, for three reasons:
- In order to make a good decision about what to do next, you need a lot of context: what you need to clean up, what any calling function expects you to do, any internal state you need to update, etc. You either end up passing a lot of extraneous context info around, or you make everything global.
- Alternatively you bail with a status code and mutter "yolo" to yourself, or you throw an assert -- but crashing a system daemon with input is the essence of a DoS, so only the truly ignorant or lazy resort to it.
- Building a reasonable framework is hard when all your concerns are entwined like this. If at some point you decide asserts on input are bad, it's very hard to change that convention in this code. Essentially all your function signatures and your entire call graph change. There's an extraordinary amount of technical debt here; just look at all the places Lennart had to change the size/offset stuff. The logic for this is scattered all over the place.
People have focused on some of the tactical issues -- asserts on input, byte order conversion, etc. -- but the real issue is that when you write code in this style those things don't jump out at you because you have to do them everywhere. If your code had a better separation of concerns, as soon as you saw 'offsetof' in a parsing function you'd immediately see a red flag. It's so much easier to think in these specific contexts, where you're assured previous steps are complete, rather than in every single function having to get the entire conversion/validation/parsing flow correct every single time, keeping the entire thing in your head always.
As an aside, I find it a little ironic that I found it hard to parse the headline "Systemd is bad parsing". No, Systemd is not bad parsing, Systemd is parsing badly.
If this were not buried in systemd, but was part of a standalone dhcpd package, would it have been done better or caught earlier?
One of the major criticisms of systemd is that it wants to take over all system management functions. In order to do that it needs to offer at least minimally functional versions of all the services it pushes out. It should not be news to anyone that trying to do everything instead of integrating with existing projects leads to sloppiness.
This incident is evidence suggesting that other serious issues are lurking in systemd as a result of the actions naturally arising from their policy.
One of the more popular DHCP clients is 'dhcpcd', which was created and is maintained by Roy Marples, the creator of OpenRC. It's fine when someone wants to work on DHCP clients and init systems, unless that init system is systemd?
(And for the record, systemd integrates wonderfully with dhcpcd.)
Rewriting everything is fine, as long as it's not in C or similar insecure by design programming language, it doesn't invent DSLs on top of INI/YAML/JSON files, uses proper architecture with clearly defined boundaries and protocols, allowing custom application-specific management instead of forcing everyone into author's narrow worldview full of incorrect assumptions. So as long as you are not doing what systemd does.
Centralized management/monitoring in general sucks. People don't seem to understand the problem and keep reinventing new centralized architectures suitable for their own special cases, instead of addressing that on architecture level.
>> One question: any smell of intentional obfuscation for this vulnerability, or is it impossible to tell?
I always think that anytime there's an interesting bug in ANYTHING. I'm not an expert, but I would think it would be nearly impossible to tell the difference between "someone made a mistake" and "someone made this look like a mistake". I would think any decent programmer with malicious intent could do that anytime they wanted.
I suspect in advance of the 1990s it was a problem to people who specifically cared about such things, and would read or write papers on the topic. In the 1990s it started to become a problem for a lot of people who really only cared about putting their programs on a widely-accessible network. ;)
The code is also wrong because of strict aliasing. This is a real problem, your program can in fact exhibit undefined behavior because of this (it happened to me).
Some time ago I wrote some code to make these things simpler in C++. It allows you to define "structures" (which however are not real structs, it's an abstraction) and then directly access an arbitrary buffer through setters/getters. The code is here with documentation: https://github.com/ambrop72/aipstack/blob/master/src/aipstac... . In fact this is part of my own TCP/IP stack and it includes DHCP code (here are my DHCP structure definitions: https://github.com/ambrop72/aipstack/blob/master/src/aipstac...).
Consider this example of extracting a 32-bit int encoded as little endian:
Compiling this with either gcc or clang using "-fsanitize=undefined" triggers this runtime error: That happens because, even though the bytes are unsigned as the article requires (using uint8_t instead of unsigned char has the same problem), they are promoted to (signed) int before the left shift as part of automatic integer promotion. When the left shift by 24 happens, it puts an 1 into the sign bit of the int, which is undefined behavior. Later the int is converted to unsigned (using implementation-defined behavior, which luckily is sane on all popular compilers), but by then it's already too late.The solution is to change the code slightly:
This prevents the promotion from unsigned char (or uint8_t) to int by explicitly casting it to an unsigned int, which can then be safely shifted left by 24.But I think it is better to cast all the bytes to the final type (e.g. uint32) before shifting, not unknown types like unsigned int.
You're right, these streams should be interpreted as uint8_t to avoid undefined behavior when shifting, masking, and ORing, and anyone doing this should tuck it behind some abstraction layer. It should never bleed into parsing.
Packed structs are slow on 1990's RISC processors. Packed structs are not slow on 2010 era super scalar processors.
I think that all compilers consider this to be a bit of defect in the relevant standards, and they'll compile your code the way you expect.
Many OS network APIs are "char based" because the C standard requires that sizeof char be exactly one byte, so it used to be a convenient shorthand, but that's not the case anymore.
Now, upon further researching it looks like uint8_t always has size 1 (as does char), but I'm still not positive that the second expression is technically valid.
This statement is to generic to mean anything helpful.
I think the point our friend Rob Pike is making is that you could write general purpose code which accomplishes the task, and on a sufficiently advanced compiler, the swap may be eliminated where applicable; but when the point is made this way, it surely just pisses off people who know better.
C programmers are not (quite) children, you don't need to eradicate nuance to keep people reading.
"This code has no redeeming features. It must be thrown away and rewritten yet again. This time by an experienced programmer who know what error codes mean, how to use asserts properly, and most of all, who has experience at network programming."
Dead Comment
The code in question quite clearly doesn't use C's standard assert macro, but a custom macro called assert_return (which I guess checks for a condition, and returns from the function instead of aborting the program).
So basically:
How else would one check for invalid input to the function?Asserts should only be used to signal "this code is broken" but that is not how they are used here.
A long while ago the GTK+ library used to suffer from a similar affliction. Mainstay GTK+ programs such as GEdit and Totem would print lots of warnings on the console. Since the GTK+ developers used the same facility (g_warning) for reporting both coding errors and for user errors like "file not found," debugging code was hard. Was the warning "supposed to be there" or was it a bug? GTK+ programs are much cleaner these days even if some programs still fails to make the distinction between user and programmer errors.
Assert already has a meaning and using it in this context is not great.
cf. http://www.cplusplus.com/reference/cassert/assert/
i.e. "assert" meaning "check something in this code is algorithmically correct"
Whereas when you're checking user input, use a very different term - i.e. validate or something.
So a fix would be to rename the macro to be obvious about it's usage - "validate_return" or "valid_or_return" or something.
To be fair I can see the problem because I do tend to write code comments saying "assert everything makes sense" when checking user input sometimes and "assert" itself doesn't properly convey the outcome if it fails - i.e. "abort_if" or "crash_if" would be a much more descriptive term.
If assert isn't assert, and if return values are also abused, why can I assume get is get, sort is sort and so on? Sort might check inputs, get might delete the customer account, who knows at this point. And that makes anything but a tiny code base pretty much unmanageable, or at least very very hard to learn.
At one point and older coder who I looked up to even complemented and adopted this pattern. 1998 was a good year.
I sometimes use a bunch of Ensure*() methods in C# if I need to validate input data. They throw exceptions if condition fails. If I only do this once I don't create those methods.
One drawback is the stacktrace has these Ensure names in it.
I would go so far as to say that using asserts in input validation is a good thing, in very limited quantities. I use them in places where the input should have been validated so long ago, so many functions back, that if something isn't as I expect here, I seriously fucked up somewhere else and the whole system has a good likelihood of being compromised by something unrelated to the condition I'm asserting. In that scenario I'd rather log and abort.
He provides constructive examples for all of the items that he criticizes.
- Instead of pointer arithmetic, base_ptr, offset, len.
- Instead of casting to internal structs, parsing into internal structs.
- Instead of network byte order, buf[offset] + buf[offset+1] * 256 .
- Instead of the wrong error codes, the right ones.
- Instead of random asserts over internal data structures, validation at parse time.
And you got it wrong. Shows that there is no silver bullet.
- In order to make a good decision about what to do next, you need a lot of context: what you need to clean up, what any calling function expects you to do, any internal state you need to update, etc. You either end up passing a lot of extraneous context info around, or you make everything global.
- Alternatively you bail with a status code and mutter "yolo" to yourself, or you throw an assert -- but crashing a system daemon with input is the essence of a DoS, so only the truly ignorant or lazy resort to it.
- Building a reasonable framework is hard when all your concerns are entwined like this. If at some point you decide asserts on input are bad, it's very hard to change that convention in this code. Essentially all your function signatures and your entire call graph change. There's an extraordinary amount of technical debt here; just look at all the places Lennart had to change the size/offset stuff. The logic for this is scattered all over the place.
People have focused on some of the tactical issues -- asserts on input, byte order conversion, etc. -- but the real issue is that when you write code in this style those things don't jump out at you because you have to do them everywhere. If your code had a better separation of concerns, as soon as you saw 'offsetof' in a parsing function you'd immediately see a red flag. It's so much easier to think in these specific contexts, where you're assured previous steps are complete, rather than in every single function having to get the entire conversion/validation/parsing flow correct every single time, keeping the entire thing in your head always.
Possibly a bad idea on systemd's part to reuse the name assert, but the code isn't doing what the blog assumes it does.
I guess someone felt like it shouldn’t be a meme headline on HN, but the result is nonsense.
;-D
One question: any smell of intentional obfuscation for this vulnerability, or is it impossible to tell?
If this were not buried in systemd, but was part of a standalone dhcpd package, would it have been done better or caught earlier?
One of the major criticisms of systemd is that it wants to take over all system management functions. In order to do that it needs to offer at least minimally functional versions of all the services it pushes out. It should not be news to anyone that trying to do everything instead of integrating with existing projects leads to sloppiness.
This incident is evidence suggesting that other serious issues are lurking in systemd as a result of the actions naturally arising from their policy.
If one is going to rewrite all of these things, I would expect one to learn from the history and not make the same mistakes over again.
One of the more popular DHCP clients is 'dhcpcd', which was created and is maintained by Roy Marples, the creator of OpenRC. It's fine when someone wants to work on DHCP clients and init systems, unless that init system is systemd?
(And for the record, systemd integrates wonderfully with dhcpcd.)
Centralized management/monitoring in general sucks. People don't seem to understand the problem and keep reinventing new centralized architectures suitable for their own special cases, instead of addressing that on architecture level.
I always think that anytime there's an interesting bug in ANYTHING. I'm not an expert, but I would think it would be nearly impossible to tell the difference between "someone made a mistake" and "someone made this look like a mistake". I would think any decent programmer with malicious intent could do that anytime they wanted.
What? It was known well in advance.