In a way this is an argument for languages where it's normal to have result-types [0] or exceptions, instead of -1 etc. It just feels wrong to have any potential collision or confusion between normal results and error results.
1. But, Python has exceptions! The underlying C language doesn't, but Python's run-time has them and can use them in the C code.
2. It may be an argument for ensuring that absolutely everything that is an object can hash: the object hasher must not have error states. Nobody wants the overhead of a simple hash code being wrapped in a result type.
Not really, I don’t think. Python code doesn’t ever really use hash() for anything where specific outputs are expected. Even if hash(a)==hash(b), it’s not implied that a==b or a is b.
But -2 seems like just a bad choice here. -1 and -2 are very "common" integers. Seems they could have just picked a quasi-random large integer to reduce the likelihood of collision. I expect hash functions to have collisions, but I expect it to be "unlikely" in a way that scales with the size of the hash.
Somewhat related, hash() returns a result % sys.hash_info.modulus, so if you build a set or a dict with keys multiple of sys.hash_info.modulus, you run into quadratic behavior.
>>> {i * sys.hash_info.modulus for i in range(5000)}
0.40s
>>> {i * sys.hash_info.modulus for i in range(50000)}
29.34s
>>> {i for i in range(50000)}
0.06s
The fact that a C function really only has its return value for communicating success or error status is why most fallible C functions use mutable parameters, also known as output arguments, for their result value or values. This is like, elementary C program design.
In this case however it can be partly justified as a space saving mechanism, as that hash has to be also cached in the object. It's like Rust's `Option<NonZeroU64>` done manually.
What I’m saying is either the value or the error can be discarded. If it was an error, the exception mechanism can be used since at that point you’re in the Python wrapper. It would probably be a little slower though, but Python is already slow. That way the error doesn’t need to be in the API at all.
There could be an issue whereby some attacker controls the hash inputs which happen to be integers, and chooses a sequence which collides to the same hash value. It would be easy to identify such a sequence.
It just doesn't seem to have been identified as a real isssue, probably because the most commonly hashed external data is character strings. (For those, there are countermeasures like a properly scrambled hashing function, modified by a seed that can be randomized.)
> So, in C, when you design a function, and you want your function to be able to indicate “there was an error”, it has to return that error as its return value.
Well, you can also use an errno-like system. It has its own set of drawbacks as well, but it removes the "I need to reserve a sentinel value" problem.
Functions using errno typically still use a return value of -1 to indicate an error has occurred, then errno just stores the specific error code. Otherwise, how do you even know that you need to check errno?
There are examples of such an ambiguity in the ISO C library itself.
In those situations, the application must begin by clearing errno to zero.
Then, it checks for a certain error value or values from the function which are ambiguous: they could be legit or indicate an error.
If one of those values occurs, and if errno is nonzero, then the error occurred.
This is how you deal with, for instance the strtol (string to long int) function. If there is a range error, strtol returns LONG_MIN or LONG_MAX. Those are also valid values in the range of long, but when no error has occurred, they are produced without errno being touched.
strtol can also return 0 in another error case, when the input is such that no conversion can be performed. ISO C doesn't require errno to be set to anything in this case, unfortunately. The case is distinguished from a legitimate zero by the original pointer to the string being stored in *endptr (if the caller specifies endptr that is not null).
ISO C and POSIX library functions do not reset errno to zero. They either leave it alone or set it to a nonzero value.
If you need to use the above trick and are working inside a C library function, you have to save the original errno value before storing a zero in it, and then put that value back if no error has happened.
[0] https://en.wikipedia.org/wiki/Result_type
https://en.wikipedia.org/wiki/Semipredicate_problem
(I’m only mostly joking)
2. It may be an argument for ensuring that absolutely everything that is an object can hash: the object hasher must not have error states. Nobody wants the overhead of a simple hash code being wrapped in a result type.
It just doesn't seem to have been identified as a real isssue, probably because the most commonly hashed external data is character strings. (For those, there are countermeasures like a properly scrambled hashing function, modified by a seed that can be randomized.)
Deleted Comment
Well, you can also use an errno-like system. It has its own set of drawbacks as well, but it removes the "I need to reserve a sentinel value" problem.
In those situations, the application must begin by clearing errno to zero.
Then, it checks for a certain error value or values from the function which are ambiguous: they could be legit or indicate an error.
If one of those values occurs, and if errno is nonzero, then the error occurred.
This is how you deal with, for instance the strtol (string to long int) function. If there is a range error, strtol returns LONG_MIN or LONG_MAX. Those are also valid values in the range of long, but when no error has occurred, they are produced without errno being touched.
strtol can also return 0 in another error case, when the input is such that no conversion can be performed. ISO C doesn't require errno to be set to anything in this case, unfortunately. The case is distinguished from a legitimate zero by the original pointer to the string being stored in *endptr (if the caller specifies endptr that is not null).
ISO C and POSIX library functions do not reset errno to zero. They either leave it alone or set it to a nonzero value.
If you need to use the above trick and are working inside a C library function, you have to save the original errno value before storing a zero in it, and then put that value back if no error has happened.
This is the kind of crap ignorant C developers do and then pretend is the language's fault.
You can jump, you can return a reference to a complex type, a tuple, whatever.