Why is hash(-1) == hash(-2) in Python?

Terr_ · a year ago

In a way this is an argument for languages where it's normal to have result-types [0] or exceptions, instead of -1 etc. It just feels wrong to have any potential collision or confusion between normal results and error results.

[0] https://en.wikipedia.org/wiki/Result_type

nayuki · a year ago

In other words, this is a case of https://en.wikipedia.org/wiki/In-band_signaling versus out-of-band signaling https://en.wikipedia.org/wiki/Signaling_(telecommunications)... .

EricRiese · a year ago

Clears throat the precise wiki page you're looking for is the Semipredicate Problem

https://en.wikipedia.org/wiki/Semipredicate_problem

kazinator · a year ago

-1 is in-band signaling; but it steals a value.

kevinventullo · a year ago

We all know that types are not Pythonic.

(I’m only mostly joking)

folkrav · a year ago

Ackchyually, Python has pretty strong typing, as far as dynamic languages go.

kazinator · a year ago

1. But, Python has exceptions! The underlying C language doesn't, but Python's run-time has them and can use them in the C code.

2. It may be an argument for ensuring that absolutely everything that is an object can hash: the object hasher must not have error states. Nobody wants the overhead of a simple hash code being wrapped in a result type.

kstrauser · a year ago

Not really, I don’t think. Python code doesn’t ever really use hash() for anything where specific outputs are expected. Even if hash(a)==hash(b), it’s not implied that a==b or a is b.

tmvphil · a year ago

But -2 seems like just a bad choice here. -1 and -2 are very "common" integers. Seems they could have just picked a quasi-random large integer to reduce the likelihood of collision. I expect hash functions to have collisions, but I expect it to be "unlikely" in a way that scales with the size of the hash.

TRiG_Ireland · a year ago

Yes, but having result types would avoid the need for this special casing.

vhcr · a year ago

Somewhat related, hash() returns a result % sys.hash_info.modulus, so if you build a set or a dict with keys multiple of sys.hash_info.modulus, you run into quadratic behavior.

    >>> {i * sys.hash_info.modulus for i in range(5000)}
    0.40s
    >>> {i * sys.hash_info.modulus for i in range(50000)}
    29.34s
    >>> {i for i in range(50000)}
    0.06s

nneonneo · a year ago

Naturally, this extends to custom types:

  >>> class X:
  ...     def __hash__(self): return -1
  ... 
  >>> hash(X())
  -2

ks2048 · a year ago

The next k for which hash(k) != k: 2**61-1

    print(hash(2**61 - 2))
    2305843009213693950

    print(hash(2**61 - 1))
    0

binary132 · a year ago

The fact that a C function really only has its return value for communicating success or error status is why most fallible C functions use mutable parameters, also known as output arguments, for their result value or values. This is like, elementary C program design.

lifthrasiir · a year ago

In this case however it can be partly justified as a space saving mechanism, as that hash has to be also cached in the object. It's like Rust's `Option<NonZeroU64>` done manually.

binary132 · a year ago

What I’m saying is either the value or the error can be discarded. If it was an error, the exception mechanism can be used since at that point you’re in the Python wrapper. It would probably be a little slower though, but Python is already slow. That way the error doesn’t need to be in the API at all.

daxfohl · a year ago

Another question is why do ints largely hash to themselves? Most hashing guides recommend hashes be far more distributed IIUC.

kazinator · a year ago

There could be an issue whereby some attacker controls the hash inputs which happen to be integers, and chooses a sequence which collides to the same hash value. It would be easy to identify such a sequence.

It just doesn't seem to have been identified as a real isssue, probably because the most commonly hashed external data is character strings. (For those, there are countermeasures like a properly scrambled hashing function, modified by a seed that can be randomized.)

woodruffw · a year ago

Python's `__hash__` used primarily for hash map bucketing. In that context, the identity of an integer is a perfectly cromulent bucketing value.

Deleted Comment

zarzavat · a year ago

If your keys are multiples of the table size then all your keys will hash to the same bucket, in a naive hash table at least.

Galanwe · a year ago

> So, in C, when you design a function, and you want your function to be able to indicate “there was an error”, it has to return that error as its return value.

Well, you can also use an errno-like system. It has its own set of drawbacks as well, but it removes the "I need to reserve a sentinel value" problem.

secondcoming · a year ago

Out params are thing too.

sgerenser · a year ago

Functions using errno typically still use a return value of -1 to indicate an error has occurred, then errno just stores the specific error code. Otherwise, how do you even know that you need to check errno?

kazinator · a year ago

There are examples of such an ambiguity in the ISO C library itself.

In those situations, the application must begin by clearing errno to zero.

Then, it checks for a certain error value or values from the function which are ambiguous: they could be legit or indicate an error.

If one of those values occurs, and if errno is nonzero, then the error occurred.

This is how you deal with, for instance the strtol (string to long int) function. If there is a range error, strtol returns LONG_MIN or LONG_MAX. Those are also valid values in the range of long, but when no error has occurred, they are produced without errno being touched.

strtol can also return 0 in another error case, when the input is such that no conversion can be performed. ISO C doesn't require errno to be set to anything in this case, unfortunately. The case is distinguished from a legitimate zero by the original pointer to the string being stored in *endptr (if the caller specifies endptr that is not null).

ISO C and POSIX library functions do not reset errno to zero. They either leave it alone or set it to a nonzero value.

If you need to use the above trick and are working inside a C library function, you have to save the original errno value before storing a zero in it, and then put that value back if no error has happened.

BenjiWiebe · a year ago

You could always check errno and reset it to zero before any function call.

moralestapia · a year ago

>it has to return that error as its return value.

This is the kind of crap ignorant C developers do and then pretend is the language's fault.

You can jump, you can return a reference to a complex type, a tuple, whatever.

kragen · a year ago

Also you can longjmp().

TRiG_Ireland · a year ago

I enjoyed this, because it seems to be written by an interested and interesting human, at a level that I could actually follow.