Python supports arithmetic on mixed numeric types, so it makes sense that floats and ints should have a hash function that behaves somewhat consistently. I don't write a lot of python, but having used other scripting languages it wouldn't surprise me if numeric types get mixed up by accident often enough. You probably want int(2) and float(2) to be considered the same key in a dictionary to avoid surprises.
Yeah IEEE 754 floating point numbers should probably not be hashable, and the weird (but standard-defined) behaviour with respect to NaN equality is one good reason for this.
Nor is that inequality an oddity at all. If you were to think NaN should equal NaN, that thought would probably stem from the belief that NaN is a singular entity which is a misunderstanding of its purpose. NaN rather signifies a specific number that is not representable as a floating point. Two specific numbers that cannot be represented are not necessarily equal because they may have resulted from different calculations!
I'll add that, if I recall correctly, in R, the statement NaN == NaN evaluates to NA which basicall means "it is not known whether these numbers equal each other" which is a more reasonable result than False.
It's the only "primitive type" that does that. If I deserialize data from wire, I'll be very surprised when the same bits deserialize as unequal variables. If it cannot be represented, then throwing makes more sense than trying to represent it.
> "it is not known whether these numbers equal each other"
Equality, among other operations, are not defined for these inputs. NaN's really are a separate type of object embedded inside another objects value space. So you get the rare programmers gift of being able to construct a statement that is not always realizable based solely on the values of your inputs.
It is not an IEEE 754 oddity. It is the correct mathematical behavior.
When you define an order relation on a set, the order may be either a total order or a partial order.
In a totally ordered set, there are 3 values for a comparison operation: equal, less and greater. In a partially ordered set, there are 4 values for a comparison operation: equal, less, greater and unordered.
For a totally ordered set you can define 6 relational operators (6 = 2^3 - 2, where you subtract 2 for the always false and always true predicates), while for a partially ordered set you can define 14 relational operators (14 = 2^4 - 2).
For some weird reason, many programmers have not been taught properly about partially-ordered sets and also most programming languages do not define the 14 relational operators needed for partially ordered sets, but only the 6 relational operators that are sufficient for a totally ordered set.
It is easy to write all 14 relational operators by combinations of the symbols for not, less, greater and equal, so parsing this in a programming language would be easy.
This lack of awareness about partial order relations and the lack of support in most programming languages is very bad, because practical applications need very frequently partial orders instead of total orders.
For the floating-point numbers, the IEEE standard specifies 2 choices. You can either use them as a totally-ordered set, or as a partially-ordered set.
When you encounter NaNs as a programmer, that is because you have made the choice to have partially-ordered FP numbers, so you are not allowed to complain that this is an odd behavior, when you have chosen it. Most programmers do not make this choice consciously, because they just use the default configuration of the standard library, but it is still their fault if the default does not do what they like, but nonetheless they have not changed the default settings.
If you do not want NaNs, you must not mask the invalid operation exception. This is actually what the IEEE standard recommends as the default behavior, but lazy programmers do not want to handle exceptions, so most libraries choose to mask all exceptions in their default configurations.
When invalid operations generate exceptions, there are no NaNs and the FP numbers are totally ordered, so the 6 relational operators behave as naive programmers expect them to behave.
If you do not want to handle the invalid operation exception and you mask it, there is no other option for the CPU than to use a special value that reports an invalid operation, and which is indeed not-a-number. With not-numbers added to the set of FP numbers, the set becomes a partially-ordered set and all relational operators must be interpreted accordingly.
If you use something like C/C++, with only 6 relational operators, then you must do before any comparison tests to detect any NaN operand, because otherwise the relational operators do not do what you expect them to do.
In a language with 14 relational operators, you do not need to check for NaNs, but you must choose carefully the relational operator, because for a partially-ordered set, for example not-less is not the same with greater-or-equal (because not-less is the same with greater-or-equal-or-unordered).
If you do not expect to do invalid operations frequently, it may be simpler to unmask the exception, so that you will never have to do any test for NaN detection.
>With not-numbers added to the set of FP numbers, the set becomes a partially-ordered set and all relational operators must be interpreted accordingly.
The same not-number, produced by the same computation, occupying the same memory, is still not equal to itself. It is true that I haven't been able to brush up my knowledge on partial ordering, but isn't being identical is the same as being equal in math?
This particular defect is a Quality of Implementation problem that's cheerfully allowed by the standard.
Rust's standard sorts not only lack the API footgun which causes so many C++ programmers to blow their feet off this way, but even if you go out of your way to provide a nonsensical comparison both sorts are safe anyway because it's a safe language and yet they're also faster than any popular implementation of std::sort / std::stable_sort as appropriate.
It's kinda silly, as late as the Biden administration one of the popular C++ stdlib implementations wasn't even O(n log n) because apparently twenty years was not enough for them to have introduced an introsort yet... That one isn't the standard's fault, the C++ 11 standard does say you should provide an O(n log n) sort at least.
This got me curious, and yeah it turns out Elm's dictionary implementation uses values, not pointers when retrieving values.
elm repl
---- Elm 0.19.1 ----------------------------------------------------------------
Say :help for help and :exit to exit! More at <https://elm-lang.org/0.19.1/repl>
--------------------------------------------------------------------------------
> import Dict exposing (Dict)
> nan = 0/0
NaN : Float
> nan
NaN : Float
> nan == nan
False : Bool
> naan = 0/0
NaN : Float
> d = Dict.fromList [(nan, "a"), (naan, "b")]
Dict.fromList [(NaN,"a"),(NaN,"b")]
: Dict Float String
> Dict.toList d
[(NaN,"a"),(NaN,"b")]
: List ( Float, String )
> Dict.keys d
[NaN,NaN] : List Float
> Dict.get nan d
Nothing : Maybe String
> Dict.get naan d
Nothing : Maybe String
I guess because the hash of an instance stays consistent (which is used to retrieve the value from the dict). The `__eq__` method must disregard the hash and return False for all nans.
But the hash alone shouldn't be enough to match the key. Isn't an equality check also needed to avoid a false positive? That's the idea behind a hash table, as I understand it. (I'm not a Python programmer.)
> Last week in the Python Discord we had an unusual discussion about a Python oddity.
Oh, I missed it. But yes, this is more to do with NaN than Python.
> But, of course, you can't actually get to those values by their keys: ... That is, unless you stored the specific instance of nan as a variable:
Worth noting that sets work the same way here, although this was glossed over: you can store multiple NaNs in a set, but not the same NaN instance multiple times. Even though it isn't equal to itself, the insertion process (for both dicts and sets) will consider object identity before object equality:
>>> x = float('nan')
>>> {x for _ in range(10)}
{nan}
And, yes, the same is of course true of `collections.Counter`.
If I could change one thing in computing, it'd be how SQL handles NULL. But if I got a second thing, it'd be how IEEE handles NaN. I probably wouldn't even allow NaN as a representation. If some mathematical operation results in what would be NaN, I'd rather force the programming language to throw some sort of interrupt or exception. Much like what happens when you divide an integer by 0. Heck, I'd probably even stop infinity from being represented with floats. If someone did 1/0 or 0/0, I'd interrupt rather than generating an INF or NaN.
In my experience, INF and NaN are almost always an indicator of programming error.
If someone want's to programmatically represent those concepts, they could do it on top of and to the side of the floating point specification, not inside it.
People who criticize the IEEE standard typically have never read it.
Nobody forces you to use NaNs or to have partially-ordered floating-point numbers.
The default settings that have been recommended since the beginning by the standard were to not use NaNs and to have totally-ordered FP numbers.
For this choice, the invalid operation exception must not be masked.
However, it turns out that most programmers are lazy and do not want to bother to write an exception handler, so the standard libraries of C and of most other languages have chosen to mask by default this exception, which causes the generation of NaNs.
This bad default configuration in combination with the bad feature that most programming languages have only the 6 relational operators sufficient for a total order, instead of the 14 operators required for a partial order, makes necessary frequent tests for detecting whether a value is a NaN, to avoid surprising behavior in relational expressions.
I think that in most cases the extra checks for NaNs add up to much more work than writing at exception handler, so the better solution is to unmask the invalid operation exception, which makes the NaN problem go away.
IEEE 754 prescripes, for better or worse, that any mathematical comparison operator (==, <, > ….) involving at least one NaN must always return false, including comparison against itself. This is annoying for something like dictionaries or hashtables. C# has a solution: if you call a.Equals(b) on two floats a and b, it will return true also if both are NaN. I think this is a cool solution: it keeps the meaning of math operators the same identical with other languages, but you still have sensible behavior for containers. I believe this behavior is copied from Java.
I consider this as a very bad solution, because it can lead to very subtle bugs.
The correct solution for any programming language is to define all the 14 relational operators that are required by any partially-ordered set, instead of defining only the 6 of them that are sufficient for a totally-ordered set.
If the programming language fails to define all 14 operators, then you must always test the operands for NaNs, before using any of the 6 ALGOL relational operators. If you consider this tedious, then you must unmask the invalid operation exception and take care to handle this exception.
If invalid operations generate exceptions, then the floating-point numbers become a totally-ordered set and NaN cannot exist (if a NaN comes from an external source, it will also generate an exception, while internally no NaN will ever be generated).
This is also why Rust has separate PartialEq and Eq traits - the latter is only available for types that don't have weird not-self-equal values like floating point NaNs or SQL NULLs. If you lie to Rust and create a wrapper type over f32 or f64 that has Eq, then you'd get unindexable NaN keys that just sit in your hashmap forever.
The real surprise to me is that Python can index NaN keys sometimes, at least by reference to the original NaN. I knew CPython does some Weird Shit with primitive values, so I assume it's because the hashmap is comparing by reference first and then by value.
> unindexable NaN keys that just sit in your hashmap forever.
Although you shouldn't put nonsense in a HashMap, HashMap::drain will give you (an iterator to get) back everything you put into the HashMap even if it's nonsense and HashMap::retain will let you provide a predicate to throw away the nonsense if you want to keep everything else, while HashMap::extract_if even allows you to get the nonsense back out again to put that somewhere more sensible.
These work because they don't need to find your nonsense in particular using hashing, which may now be impossible, they're just looking at everything.
There are so many discussions about "X language is so weird about it handles numbers!" and it's just IEEE 754 floats.
See: https://docs.python.org/3/library/stdtypes.html#hashing-of-n...
I'll add that, if I recall correctly, in R, the statement NaN == NaN evaluates to NA which basicall means "it is not known whether these numbers equal each other" which is a more reasonable result than False.
Equality, among other operations, are not defined for these inputs. NaN's really are a separate type of object embedded inside another objects value space. So you get the rare programmers gift of being able to construct a statement that is not always realizable based solely on the values of your inputs.
IEEE-754 does remainder(5, 3) = -1, whereas Python does 5 % 3 = 2.
There's no reason to expect exact equivalence between operators.
When you define an order relation on a set, the order may be either a total order or a partial order.
In a totally ordered set, there are 3 values for a comparison operation: equal, less and greater. In a partially ordered set, there are 4 values for a comparison operation: equal, less, greater and unordered.
For a totally ordered set you can define 6 relational operators (6 = 2^3 - 2, where you subtract 2 for the always false and always true predicates), while for a partially ordered set you can define 14 relational operators (14 = 2^4 - 2).
For some weird reason, many programmers have not been taught properly about partially-ordered sets and also most programming languages do not define the 14 relational operators needed for partially ordered sets, but only the 6 relational operators that are sufficient for a totally ordered set.
It is easy to write all 14 relational operators by combinations of the symbols for not, less, greater and equal, so parsing this in a programming language would be easy.
This lack of awareness about partial order relations and the lack of support in most programming languages is very bad, because practical applications need very frequently partial orders instead of total orders.
For the floating-point numbers, the IEEE standard specifies 2 choices. You can either use them as a totally-ordered set, or as a partially-ordered set.
When you encounter NaNs as a programmer, that is because you have made the choice to have partially-ordered FP numbers, so you are not allowed to complain that this is an odd behavior, when you have chosen it. Most programmers do not make this choice consciously, because they just use the default configuration of the standard library, but it is still their fault if the default does not do what they like, but nonetheless they have not changed the default settings.
If you do not want NaNs, you must not mask the invalid operation exception. This is actually what the IEEE standard recommends as the default behavior, but lazy programmers do not want to handle exceptions, so most libraries choose to mask all exceptions in their default configurations.
When invalid operations generate exceptions, there are no NaNs and the FP numbers are totally ordered, so the 6 relational operators behave as naive programmers expect them to behave.
If you do not want to handle the invalid operation exception and you mask it, there is no other option for the CPU than to use a special value that reports an invalid operation, and which is indeed not-a-number. With not-numbers added to the set of FP numbers, the set becomes a partially-ordered set and all relational operators must be interpreted accordingly.
If you use something like C/C++, with only 6 relational operators, then you must do before any comparison tests to detect any NaN operand, because otherwise the relational operators do not do what you expect them to do.
In a language with 14 relational operators, you do not need to check for NaNs, but you must choose carefully the relational operator, because for a partially-ordered set, for example not-less is not the same with greater-or-equal (because not-less is the same with greater-or-equal-or-unordered).
If you do not expect to do invalid operations frequently, it may be simpler to unmask the exception, so that you will never have to do any test for NaN detection.
The same not-number, produced by the same computation, occupying the same memory, is still not equal to itself. It is true that I haven't been able to brush up my knowledge on partial ordering, but isn't being identical is the same as being equal in math?
[1] https://stackoverflow.com/questions/18291620/why-will-stdsor...
Rust's standard sorts not only lack the API footgun which causes so many C++ programmers to blow their feet off this way, but even if you go out of your way to provide a nonsensical comparison both sorts are safe anyway because it's a safe language and yet they're also faster than any popular implementation of std::sort / std::stable_sort as appropriate.
It's kinda silly, as late as the Biden administration one of the popular C++ stdlib implementations wasn't even O(n log n) because apparently twenty years was not enough for them to have introduced an introsort yet... That one isn't the standard's fault, the C++ 11 standard does say you should provide an O(n log n) sort at least.
Oh, I missed it. But yes, this is more to do with NaN than Python.
> But, of course, you can't actually get to those values by their keys: ... That is, unless you stored the specific instance of nan as a variable:
Worth noting that sets work the same way here, although this was glossed over: you can store multiple NaNs in a set, but not the same NaN instance multiple times. Even though it isn't equal to itself, the insertion process (for both dicts and sets) will consider object identity before object equality:
And, yes, the same is of course true of `collections.Counter`.In my experience, INF and NaN are almost always an indicator of programming error.
If someone want's to programmatically represent those concepts, they could do it on top of and to the side of the floating point specification, not inside it.
Nobody forces you to use NaNs or to have partially-ordered floating-point numbers.
The default settings that have been recommended since the beginning by the standard were to not use NaNs and to have totally-ordered FP numbers.
For this choice, the invalid operation exception must not be masked.
However, it turns out that most programmers are lazy and do not want to bother to write an exception handler, so the standard libraries of C and of most other languages have chosen to mask by default this exception, which causes the generation of NaNs.
This bad default configuration in combination with the bad feature that most programming languages have only the 6 relational operators sufficient for a total order, instead of the 14 operators required for a partial order, makes necessary frequent tests for detecting whether a value is a NaN, to avoid surprising behavior in relational expressions.
I think that in most cases the extra checks for NaNs add up to much more work than writing at exception handler, so the better solution is to unmask the invalid operation exception, which makes the NaN problem go away.
The correct solution for any programming language is to define all the 14 relational operators that are required by any partially-ordered set, instead of defining only the 6 of them that are sufficient for a totally-ordered set.
If the programming language fails to define all 14 operators, then you must always test the operands for NaNs, before using any of the 6 ALGOL relational operators. If you consider this tedious, then you must unmask the invalid operation exception and take care to handle this exception.
If invalid operations generate exceptions, then the floating-point numbers become a totally-ordered set and NaN cannot exist (if a NaN comes from an external source, it will also generate an exception, while internally no NaN will ever be generated).
The real surprise to me is that Python can index NaN keys sometimes, at least by reference to the original NaN. I knew CPython does some Weird Shit with primitive values, so I assume it's because the hashmap is comparing by reference first and then by value.
Although you shouldn't put nonsense in a HashMap, HashMap::drain will give you (an iterator to get) back everything you put into the HashMap even if it's nonsense and HashMap::retain will let you provide a predicate to throw away the nonsense if you want to keep everything else, while HashMap::extract_if even allows you to get the nonsense back out again to put that somewhere more sensible.
These work because they don't need to find your nonsense in particular using hashing, which may now be impossible, they're just looking at everything.