Readit News logoReadit News
fasterik · 2 years ago
The subset relation is defined as "X is a subset of Y if for all x in X, x is in Y". The negation of this is "X is not a subset of Y if there exists some x in X, such that x is not in Y". We can see from this that the empty set is NOT NOT a subset of every set. Therefore the empty set is a subset of every set.

A similar argument should show that the empty string is a substring of every string. Therefore indexOf(substring, string) should return a non-negative index if substring is the empty string.

This reminds me of a discussion on HN a while ago about predicates over the empty set. There was debate over whether "all(x == "foo" for x in my_array)" should evaluate to True or False if my_array is empty. I tried to point out that if it evaluated to False, it would break first order logic in a similar way as I explained above, but people were still trying to argue the opposite case.

adrian_b · 2 years ago
While an empty substring is a substring of every string, attempting to find its index in any string must signal an error, because there is no univocal value of such an index (all possible index values are equally valid answers).

On the other hand, a predicate that tests whether an empty substring is found in a string must indeed return "true".

In a function that must return a single value, an error must be signaled both when there is no value that can be returned and when there are multiple values that could be returned.

If the index of the substring is defined as the smallest such index, which makes it unique even for multiple substring occurrences, then the search function must return 0 or 1 for an empty substring, whichever is defined to be the first index.

Attempting to search an empty string for non-empty or empty substrings should always signal an error, because there are no valid index values that could be returned.

fasterik · 2 years ago
Yeah, I think the convention should be to return the smallest valid index such that the predicate is satisfied, or an error value if the predicate isn't satisfied. As you point out, it's the same as when a non-empty string appears multiple times in a given string. So I think 0 is the only reasonable result in this case, assuming 0-based indexing.
hermitcrab · 2 years ago
>While an empty substring is a substring of every string, attempting to find its index in any string must signal an error, because there is no univocal value of such an index (all possible index values are equally valid answers).

The operations in question are to find the first (IndexOf) and last (LastIndexOf) indexes. So if the empty exists at every index, you just return the first or last one.

>Attempting to search an empty string for non-empty or empty substrings should always signal an error

I believe that an empty string is valid input for each parameter and should return 'not found' in some form, rather than an error. It is splitting hairs a bit though.

Deleted Comment

tzs · 2 years ago
> The subset relation is defined as "X is a subset of Y if for all x in X, x is in Y".

This might be a bit nit picky, but shouldn't that be "if and only if" or "iff" instead of just "if"?

> The negation of this is "X is not a subset of Y if there exists some x in X, such that x is not in Y". We can see from this that the empty set is NOT NOT a subset of every set.

In what follows I'm replacing lower case x with lower case e.

I'm kind of simple minded, so find that a bit confusing because of using the negated definition and a double not, and also "for all e in X" when X is the empty set because it makes you have to keep for the rest of the definition that e might be nonexistent.

This is clearer to me:

1. Rewrite the definition so it is in terms of all e, not all e in X: "X is a subset of Y if and only if for all e, e is in X implies e is in Y". This form of the definition only talks about elements that exist.

2. "e is in X implies e is in Y" is equivalent to "e is not in Y implies e is not in X", which gives us this equivalent definition: "X is a subset of Y if and only if for all e, e is not in Y implies e is not in X".

3. When X is the empty set it is always the case that e is not in X, and so "e is not in Y implies e is not in X" is always true (because both false implies true and true implies true are true), and so X is a subset of Y by the definition of subset.

bryancoxwell · 2 years ago
What reason could there possibly be for looking for an empty string?
tharkun__ · 2 years ago
Because the string you are trying to find is in some variable which at runtime in some cases for some reason happens to be an empty string and you don't (want to) special case this and just "let it work by itself"?
fasterik · 2 years ago
There are many cases where a user-provided string is passed to the string API. Generally the programmer should handle empty strings as a special case, but when they don't the API needs to return something.
hermitcrab · 2 years ago
Is it a sensible thing to do: No.

Will users do it anyway: Yes.

We have to handle all possible values the user can provide.

ajuc · 2 years ago
Avoiding the need to special-case empty input.
ajuc · 2 years ago
The invariant is:

     haystack.substring(haystack.indexOf(needle), needle.length()) == needle
And indexOf should return lowest such number.

So it should return 0. LastIndexOf too, because the haystack is empty.

For nonempty haystack it should be

    "Abc".indexOf("") == 0
    "Abc".lastIndexOf("") == 3

l0b0 · 2 years ago
I'd rather have this invariant:

  haystack[haystack.indexOf(needle)] == needle
That is, there is no index of the empty string within a string. A nice thing about this is that it also works for non-character arrays:

  "abc".indexOf("") raises "No such entry exception"
  ["", "abc", ""].indexOf("") == 0

ajuc · 2 years ago
This invariant only works for 1-character needles. So no

    "Lorem Ipsum".indexOf("Ipsum")
Which is the main purpose of this function.

Also with your invariant types are different. Not "abc".indexOf("b") but "abc".indexOf('b').

layer8 · 2 years ago
> there isn’t a first position in the string

This isn’t necessarily true if you think of positions as where a vertical-bar cursor would be, that is, between characters. In

"|"

where the double quotes delimit an empty string, the vertical bar (intended to represent the cursor here) has a position between those delimiters.

hermitcrab · 2 years ago
That is certainly an approach. But wouldn't that make indexOf() and lastIndexof() different in cases you would intuitively expect them to be the same, e.g:

0-based indexOf "a" in "a" would be 0

0-based lastIndexOf "a" in "a" would be 1

?

layer8 · 2 years ago
No, because “the index of a string within a string” would be consistently defined to always be the lower index (or, alternatively, always the upper index). There is no intrinsic relation to the search direction, since all characters of the search string have to be compared regardless of the direction. For example, when searching from the end, the search algorithm could start comparing at index `totalLength - searchStringLength`.

To further illustrate, it would be the index where, if the search string is removed from the found position, it would have to be inserted in order to revert to the original string.

hurril · 2 years ago
It's 0 and this fact is sometimes referred to as epsilon when we talk about parsing.
o11c · 2 years ago
0, obviously. It's the only consistent choice.

This is only confusing if you're thinking of taking single elements from an array, rather than thinking in terms of slices of arrays. Slicing also makes it obvious why we must use half-open ranges and 0-based indexing.

The only minor gotcha is that, when doing a request for "find all non-overlapping" or "find next strictly after", you have to increment the search index by 1 even though the size of the match is 0. ("find all overlapping" or "find next starting after" always increment by 1 and ignore the size of the match).

CoastalCoder · 2 years ago
To me, this seems similar to asking what should

  int divide(int numerator, int denominator)
do when denominator == 0, or what should a modern version of

  int atoi(const char* s)
do when s == "foo"?

I.e., the answer is that the parameter lies outside the domain of well-supported values, so ideally the callee should somehow help the programmer realize they have a programming error.

fwip · 2 years ago
If I were implementing this in a new language, I'd lean toward making a needle of the empty string throw an error. When the feature request inevitably comes in to make it return a specific value, then I'd have usage information to guide the implementation. :)
arithma · 2 years ago
The most appropriate behavior is implemented by regex.

Find the empty string and replace with say

  `($&)` 
where

  `$&`
refers to the captured variable, and with input

  `xyz`,
the result is

  `()x()y()z()`
With an empty input, the result is

  `()`.
The result of indexOf can consistently return the first index of the first half open subrange in a string. It just so happens that the first subrange of an empty string in an empty string is [0, 0).

The limiting case is also an interesting way to look at it:

  ()a()b()c()
  ()a()b()
  ()a()
  ()