Just reading the title it sounded very much like "Single Point of Failure: The (Fictional) Day Google Forgot To Check Passwords": https://www.youtube.com/watch?v=y4GB_NDU43Q
Fortunately (or not, depending if you have a Chromebook and store files locally there), the inverse has happened and it locked up your book (fail-deadly versus fail-safe).
> The line should read "if (key_data_.has_value() && !key_data_->label().empty()) {" but instead of "&&"—the C++ version of the "AND" operator—the bad update used a single ampersand, breaking the second half of the conditional statement.
Someone correct me if I'm wrong, but this seems like it wouldn't be possible in a language that didn't conflate boolean values with bitvectors.
You can use `and` and its siblings instead of && and similar in C++11.
Most people here, and in other boards, will try to convince you that it hurts readability because 'that‘s how we always did it' (read I‘m used to it and don‘t like change).
While it is true that you can use `and` in C++11, that's a bit of an understatement: these keywords have in fact been present since the very first ISO C++ standard C++98!
I am also of the opinion that `and` is more readable than `&&` (and isn't as easy to typo in a catastrophic way) - although my main point was about the weaker type system.
Dereferencing via key_label_-> has undefined behavior when !key_label_.has_value(), so the compiler treats that case the same as if key_label_.has_value(). Or to put it another way, the compiler reasons that key_label_ must have had a value (since you dereferenced it), and thus optimizes out the check for `has_value()` (since clearly it was unnecessary).
If I had to guess, the second expression depends on the first being true. So if `has_value()` is false, perhaps `key_data_->label()` is undefined behavior.
This is one of those times when, for all its verbosity, Ada got it right. For boolean conjunction you must choose between writing "A and B", or "A and then B". The "and then" version is short circuiting, while the bare "and" version is not.
Other than this particular edge case it's not obvious to me why bitwise operators shouldn't be applicable to boolean values. Aren't they just single-bit vectors?
Of course it makes sense that logical operators wouldn't be applicable to numeric types, but why the reverse?
If you stop integer values from being implicitly convertible to booleans, you break the language and its compatibility with C.
The inverse is debatable, though. I honestly don't see why someone would do bitwise operations on boolean, unless they really wish to die a slow death from thousand cuts. Integer promotion rules make even the simplest operations a wild mess, and there's nothing in the standard that prevents bool from having a sizeof equal to int (thus further complicating everything).
It might be sensible for newer language revisions to consider deprecating certain operators from being applied to bool, given that there is a strong change that 99% of the time their use is not on purpose, but I guess it depends on how much disruptive such change would turn out to be.
>> Other than this particular edge case it's not obvious to me why bitwise operators shouldn't be applicable to boolean values.
That is a really good question. At some point they decided that true=1 and false=0 but I'm not sure how deep that goes. In particular what does ! actually do in this context?
if (key_data_.has_value() && !key_data_->label().empty())
return key_data_->label();
to be less readable than:
if let Some(label) = key_data_ {
label
}
It's notable that people complain Rust adds too many odd punctuation marks, and yet in this example it's C++ with an exclamation mark, question marks, and two different structure member operators (dot and arrow) while Rust just does the pattern match.
It's not, because that would involve re-engineering and re-writing millions of lines of code.
The simpler solution is to fix this gap in their automated testing. Another one is to configure a linter to error on bitwise operations on boolean values.
In fact you can apply the single ampersand & operator to booleans in Java, and just like in the example here it has the same behaviour as && but with no short-circuiting.
I bet the post-mortem for this is going to be fun. How this wasn't covered by multiple unit tests, a code review and a roll out strategy is.... impressive.
Bitwise & versus logical && is a classic, right up there with an assignment in a comparison (when an equality check is intended), = for ==.
That this was missed is pretty surprising, given that it's Google
and the stakes involved in the encryption/key management code in a secure platform device.
I wonder if we'll even get a postmortem, as this simply cannot happen unless several someones all
Seriously Fucked Up simultaneously.
- code error (expected, humans are fallible)
- review error (less expected, this is the kind of thing code review exists for)
- static analysis / linting check failure
- integration test failure (interactive login no longer works on this build)
This is a massive, overlapping fuckup. Heads should probably roll here, especially given that when you opt in to ChromeOS as a user, Google now has "one job" (DFIU).
0) We want to fix the situation. By the time the post-mortem happens, that should already be done.
1) We want to make sure this never happens again. In order to do that we need to understand as much as we can about what happened.
2) We want everyone to share everything they know about the event. If the participants worry about fallout, they will be compelled to cover up.
3) Most failures happen at a number of points, so you need to dig in deeply if you want to get full value out of the "event".
4) Good post-mortems lead to process fixes to reduce (or eliminate?) the chance of similar events, or to make fixes throughout the code base.
The other thing is that after the post-mortem, all of the folks involved have learned from it. Usually they're better engineers because of that expensive lesson you've just paid for.
Disclaimer: Googler, participant in several post-mortems, opinions my own.
I disagree. It's obviously a major balls up by multiple people, but at the end of the day this is an organizational failure. You don't fire people for this, you learn from it and fix the organizational holes that caused it.
Especially since it's so easy to lint for. This isn't some cross-file memory mishandling bug, it's a single token that's being used where it shouldn't be
RedHat once committed wrong value of MTWO variable, it was 2, there was another variable called TWO which also had value of 2. It was fixed a few months later with a commit like "change value of MTWO to -2".Sorry, but can't find it find now
I'm quite surprised there isn't a compiler warning for a bitwise operator at the root of a condition. I've been writing `if ((flags & FLAG) != 0)` for decades years for nothing?!?
Reading stuff like "heads gonna roll" in the comments is kind of shocking to me. Have you never screwed up in your job?
How long have they shipped a secure Linux to a couple of hundred different devices simultaneously, while improving security, fixing critical bugs soon and delivering a pretty great user experience since they first launched Chrome OS? Sure, were soft bricks before, critical fixes introduced other bugs, duh. The same happens on Windows, macOS, Linux distributions - we're all human, we make mistakes, we try to learn from each mistake.
Don't give them a hard time. They already were in panic mode already.
Yup. My second or third month at my first job, I accidentally knocked out internet access for thousands of people for a couple of hours because I deleted something in production rather than in dev. It was terrifying and I was sure I would be fired.
Rather than being canned, of course, the team managing the tool that allowed me to delete the resource learned the following:
1) people should not be able to delete things without confirmation in production
2) it should be very obvious which environment you're in
3) normal humans shouldn't have access to production except for operations folks.
I'm baffled that there's no test for this, manual or otherwise, before this rolls out to production. That you can change ChromeOS code in the login path, and the very first time it's exercised is by an end customer.
bool good(bool a, bool b) { return a && b; }
good(bool, bool):
mov eax, edi
and eax, esi
ret
bool bad(bool a, bool b) { return a & b; }
bad(bool, bool):
mov eax, edi
and eax, esi
ret
key_data is probably something like an std::optional, so short-circuit evaluation guards the dereferencing of the optional. If you write this with a binary and, no short-circuit evaluation can happen, so undefined behavior ensues.
I wanna point out that std::optional, in the typical C++ stance, has a "I know what I'm doing" API and a "Humans make mistakes API": optional->... is UB if the optional is empty, optional.value()... raises an exception. Arguably, in code like this you probably only want to use the latter kind of API. (The pattern of "operator can UB, equivalent method asserts pre-condition" is very common in the C++ standard library)
Except that Google uses C++ with exceptions disabled - IIUC, the throwing code would effectively behave like a call to std::terminate(), I haven't looked at the code to know if it would actually be a preferred behavior in this case.
Where is the undefined behavior? AFAU, UB is a compile time concept. As other commenter has asked, how can the compiler infer that has_value() call has any relation to key_data_ non-NULL-ness?
EDIT:
> optional->... is UB if the optional is empty
I understand what you mean: dereferencing optional is undefined if has_value() is not checked for true in the same thread before. But that undefined behavior happens at runtime, i.e. compile time UB is not invoked, the statements are kept without any optimizations. Right? The terminology is confusing.
Interestingly I couldn't get it to fail on GCC (even with -fnoexceptions and -Ofast), only on Clang. Maybe someone else has some info as to why this is?
There is a clear difference in the assembly output for the correct and incorrect implementations, with a testb instruction run much earlier in the correct version. You can see this at line 22 of the output, if you highlight it the && in the source is highlighted in bold.
The difference is your example uses bool values, the code in question uses expressions that (should) evaluate to bool. && is short-circuited if a is false, such that b is not evaluated.
This is clearly a typo that happened to stomp some logic that is not covered by a test. The review 3002282 includes the bug, which deletes one of the & from a && operator. It says it was cherry-picked from review 2994464, which does not change that line. So, slip of the backspace key plus poor testing.
Yeah, the biggest thing to me here is that there's no "login succeeded" unit test, integration test, manual test, etc. Whatever customer logged in first was the first person to actually run the code.
I heard elsewhere that its market share is closer to 10% (and that it's now the second most popular desktop OS). 2% would be considering mobile OS too, I think
Speaking of... https://www.youtube.com/watch?v=TUKQfMFKfjk
> The line should read "if (key_data_.has_value() && !key_data_->label().empty()) {" but instead of "&&"—the C++ version of the "AND" operator—the bad update used a single ampersand, breaking the second half of the conditional statement.
Someone correct me if I'm wrong, but this seems like it wouldn't be possible in a language that didn't conflate boolean values with bitvectors.
Edit: I was wrong[1].
[1] https://news.ycombinator.com/item?id=27923785
Never seen them used, but I use them in my C++ code when I have to write it to accomplish something else:
https://en.cppreference.com/w/cpp/language/operator_alternat...
... which, frankly, is a very good reason. Don't needlessly change something people are used to.
Why is the blinker control on the left side and the wiper control on the right side? Because that's what people expect.
There's also linters and other quality control analyzers that will warn on a bitwise operator on booleans.
http://www.ada-auth.org/standards/12rm/html/RM-A-1.html#I538...
Of course it makes sense that logical operators wouldn't be applicable to numeric types, but why the reverse?
The inverse is debatable, though. I honestly don't see why someone would do bitwise operations on boolean, unless they really wish to die a slow death from thousand cuts. Integer promotion rules make even the simplest operations a wild mess, and there's nothing in the standard that prevents bool from having a sizeof equal to int (thus further complicating everything).
It might be sensible for newer language revisions to consider deprecating certain operators from being applied to bool, given that there is a strong change that 99% of the time their use is not on purpose, but I guess it depends on how much disruptive such change would turn out to be.
That is a really good question. At some point they decided that true=1 and false=0 but I'm not sure how deep that goes. In particular what does ! actually do in this context?
The simpler solution is to fix this gap in their automated testing. Another one is to configure a linter to error on bitwise operations on boolean values.
Deleted Comment
Dead Comment
That this was missed is pretty surprising, given that it's Google and the stakes involved in the encryption/key management code in a secure platform device.
I wonder if we'll even get a postmortem, as this simply cannot happen unless several someones all Seriously Fucked Up simultaneously.
- code error (expected, humans are fallible)
- review error (less expected, this is the kind of thing code review exists for)
- static analysis / linting check failure
- integration test failure (interactive login no longer works on this build)
This is a massive, overlapping fuckup. Heads should probably roll here, especially given that when you opt in to ChromeOS as a user, Google now has "one job" (DFIU).
Google has a blameless post-mortem process (see https://sre.google/sre-book/postmortem-culture/) which essentially boils down to:
0) We want to fix the situation. By the time the post-mortem happens, that should already be done.
1) We want to make sure this never happens again. In order to do that we need to understand as much as we can about what happened.
2) We want everyone to share everything they know about the event. If the participants worry about fallout, they will be compelled to cover up.
3) Most failures happen at a number of points, so you need to dig in deeply if you want to get full value out of the "event".
4) Good post-mortems lead to process fixes to reduce (or eliminate?) the chance of similar events, or to make fixes throughout the code base.
The other thing is that after the post-mortem, all of the folks involved have learned from it. Usually they're better engineers because of that expensive lesson you've just paid for.
Disclaimer: Googler, participant in several post-mortems, opinions my own.
I disagree. It's obviously a major balls up by multiple people, but at the end of the day this is an organizational failure. You don't fire people for this, you learn from it and fix the organizational holes that caused it.
This is exactly how any and every bug enters a mature software product: it slips past multiple layers.
> This is a massive, overlapping fuckup. Heads should probably roll here
Some process should change, to fix the problem. But firing people is not necessarily the best solution.
How long have they shipped a secure Linux to a couple of hundred different devices simultaneously, while improving security, fixing critical bugs soon and delivering a pretty great user experience since they first launched Chrome OS? Sure, were soft bricks before, critical fixes introduced other bugs, duh. The same happens on Windows, macOS, Linux distributions - we're all human, we make mistakes, we try to learn from each mistake.
Don't give them a hard time. They already were in panic mode already.
Rather than being canned, of course, the team managing the tool that allowed me to delete the resource learned the following:
1) people should not be able to delete things without confirmation in production
2) it should be very obvious which environment you're in
3) normal humans shouldn't have access to production except for operations folks.
That's the value of blameless post mortems.
I don't understand how that broke anything?
The C++ codegen for && and & is the same:
[0] https://godbolt.org/z/84MMqTehsI wanna point out that std::optional, in the typical C++ stance, has a "I know what I'm doing" API and a "Humans make mistakes API": optional->... is UB if the optional is empty, optional.value()... raises an exception. Arguably, in code like this you probably only want to use the latter kind of API. (The pattern of "operator can UB, equivalent method asserts pre-condition" is very common in the C++ standard library)
Source: I'm a Chrome contributor.
EDIT:
> optional->... is UB if the optional is empty
I understand what you mean: dereferencing optional is undefined if has_value() is not checked for true in the same thread before. But that undefined behavior happens at runtime, i.e. compile time UB is not invoked, the statements are kept without any optimizations. Right? The terminology is confusing.
That is, yes, it is the same with booleans. Now try with expressions.
Interestingly I couldn't get it to fail on GCC (even with -fnoexceptions and -Ofast), only on Clang. Maybe someone else has some info as to why this is?
There is a clear difference in the assembly output for the correct and incorrect implementations, with a testb instruction run much earlier in the correct version. You can see this at line 22 of the output, if you highlight it the && in the source is highlighted in bold.
Deleted Comment
https://trends.google.com/trends/explore?q=chromebook
Chrome OS has less than 2% market share.
https://www.maketecheasier.com/chrome-os-tops-macos-2nd-most...