The subset relation is defined as "X is a subset of Y if for all x in X, x is in Y". The negation of this is "X is not a subset of Y if there exists some x in X, such that x is not in Y". We can see from this that the empty set is NOT NOT a subset of every set. Therefore the empty set is a subset of every set.
A similar argument should show that the empty string is a substring of every string. Therefore indexOf(substring, string) should return a non-negative index if substring is the empty string.
This reminds me of a discussion on HN a while ago about predicates over the empty set. There was debate over whether "all(x == "foo" for x in my_array)" should evaluate to True or False if my_array is empty. I tried to point out that if it evaluated to False, it would break first order logic in a similar way as I explained above, but people were still trying to argue the opposite case.
While an empty substring is a substring of every string, attempting to find its index in any string must signal an error, because there is no univocal value of such an index (all possible index values are equally valid answers).
On the other hand, a predicate that tests whether an empty substring is found in a string must indeed return "true".
In a function that must return a single value, an error must be signaled both when there is no value that can be returned and when there are multiple values that could be returned.
If the index of the substring is defined as the smallest such index, which makes it unique even for multiple substring occurrences, then the search function must return 0 or 1 for an empty substring, whichever is defined to be the first index.
Attempting to search an empty string for non-empty or empty substrings should always signal an error, because there are no valid index values that could be returned.
Yeah, I think the convention should be to return the smallest valid index such that the predicate is satisfied, or an error value if the predicate isn't satisfied. As you point out, it's the same as when a non-empty string appears multiple times in a given string. So I think 0 is the only reasonable result in this case, assuming 0-based indexing.
>While an empty substring is a substring of every string, attempting to find its index in any string must signal an error, because there is no univocal value of such an index (all possible index values are equally valid answers).
The operations in question are to find the first (IndexOf) and last (LastIndexOf) indexes. So if the empty exists at every index, you just return the first or last one.
>Attempting to search an empty string for non-empty or empty substrings should always signal an error
I believe that an empty string is valid input for each parameter and should return 'not found' in some form, rather than an error. It is splitting hairs a bit though.
> The subset relation is defined as "X is a subset of Y if for all x in X, x is in Y".
This might be a bit nit picky, but shouldn't that be "if and only if" or "iff" instead of just "if"?
> The negation of this is "X is not a subset of Y if there exists some x in X, such that x is not in Y". We can see from this that the empty set is NOT NOT a subset of every set.
In what follows I'm replacing lower case x with lower case e.
I'm kind of simple minded, so find that a bit confusing because of using the negated definition and a double not, and also "for all e in X" when X is the empty set because it makes you have to keep for the rest of the definition that e might be nonexistent.
This is clearer to me:
1. Rewrite the definition so it is in terms of all e, not all e in X: "X is a subset of Y if and only if for all e, e is in X implies e is in Y". This form of the definition only talks about elements that exist.
2. "e is in X implies e is in Y" is equivalent to "e is not in Y implies e is not in X", which gives us this equivalent definition: "X is a subset of Y if and only if for all e, e is not in Y implies e is not in X".
3. When X is the empty set it is always the case that e is not in X, and so "e is not in Y implies e is not in X" is always true (because both false implies true and true implies true are true), and so X is a subset of Y by the definition of subset.
Because the string you are trying to find is in some variable which at runtime in some cases for some reason happens to be an empty string and you don't (want to) special case this and just "let it work by itself"?
There are many cases where a user-provided string is passed to the string API. Generally the programmer should handle empty strings as a special case, but when they don't the API needs to return something.
That is certainly an approach. But wouldn't that make indexOf() and lastIndexof() different in cases you would intuitively expect them to be the same, e.g:
No, because “the index of a string within a string” would be consistently defined to always be the lower index (or, alternatively, always the upper index). There is no intrinsic relation to the search direction, since all characters of the search string have to be compared regardless of the direction. For example, when searching from the end, the search algorithm could start comparing at index `totalLength - searchStringLength`.
To further illustrate, it would be the index where, if the search string is removed from the found position, it would have to be inserted in order to revert to the original string.
This is only confusing if you're thinking of taking single elements from an array, rather than thinking in terms of slices of arrays. Slicing also makes it obvious why we must use half-open ranges and 0-based indexing.
The only minor gotcha is that, when doing a request for "find all non-overlapping" or "find next strictly after", you have to increment the search index by 1 even though the size of the match is 0. ("find all overlapping" or "find next starting after" always increment by 1 and ignore the size of the match).
do when denominator == 0, or what should a modern version of
int atoi(const char* s)
do when s == "foo"?
I.e., the answer is that the parameter lies outside the domain of well-supported values, so ideally the callee should somehow help the programmer realize they have a programming error.
If I were implementing this in a new language, I'd lean toward making a needle of the empty string throw an error. When the feature request inevitably comes in to make it return a specific value, then I'd have usage information to guide the implementation. :)
The most appropriate behavior is implemented by regex.
Find the empty string and replace with say
`($&)`
where
`$&`
refers to the captured variable, and with input
`xyz`,
the result is
`()x()y()z()`
With an empty input, the result is
`()`.
The result of indexOf can consistently return the first index of the first half open subrange in a string. It just so happens that the first subrange of an empty string in an empty string is [0, 0).
The limiting case is also an interesting way to look at it:
A similar argument should show that the empty string is a substring of every string. Therefore indexOf(substring, string) should return a non-negative index if substring is the empty string.
This reminds me of a discussion on HN a while ago about predicates over the empty set. There was debate over whether "all(x == "foo" for x in my_array)" should evaluate to True or False if my_array is empty. I tried to point out that if it evaluated to False, it would break first order logic in a similar way as I explained above, but people were still trying to argue the opposite case.
On the other hand, a predicate that tests whether an empty substring is found in a string must indeed return "true".
In a function that must return a single value, an error must be signaled both when there is no value that can be returned and when there are multiple values that could be returned.
If the index of the substring is defined as the smallest such index, which makes it unique even for multiple substring occurrences, then the search function must return 0 or 1 for an empty substring, whichever is defined to be the first index.
Attempting to search an empty string for non-empty or empty substrings should always signal an error, because there are no valid index values that could be returned.
The operations in question are to find the first (IndexOf) and last (LastIndexOf) indexes. So if the empty exists at every index, you just return the first or last one.
>Attempting to search an empty string for non-empty or empty substrings should always signal an error
I believe that an empty string is valid input for each parameter and should return 'not found' in some form, rather than an error. It is splitting hairs a bit though.
Deleted Comment
This might be a bit nit picky, but shouldn't that be "if and only if" or "iff" instead of just "if"?
> The negation of this is "X is not a subset of Y if there exists some x in X, such that x is not in Y". We can see from this that the empty set is NOT NOT a subset of every set.
In what follows I'm replacing lower case x with lower case e.
I'm kind of simple minded, so find that a bit confusing because of using the negated definition and a double not, and also "for all e in X" when X is the empty set because it makes you have to keep for the rest of the definition that e might be nonexistent.
This is clearer to me:
1. Rewrite the definition so it is in terms of all e, not all e in X: "X is a subset of Y if and only if for all e, e is in X implies e is in Y". This form of the definition only talks about elements that exist.
2. "e is in X implies e is in Y" is equivalent to "e is not in Y implies e is not in X", which gives us this equivalent definition: "X is a subset of Y if and only if for all e, e is not in Y implies e is not in X".
3. When X is the empty set it is always the case that e is not in X, and so "e is not in Y implies e is not in X" is always true (because both false implies true and true implies true are true), and so X is a subset of Y by the definition of subset.
Will users do it anyway: Yes.
We have to handle all possible values the user can provide.
So it should return 0. LastIndexOf too, because the haystack is empty.
For nonempty haystack it should be
Also with your invariant types are different. Not "abc".indexOf("b") but "abc".indexOf('b').
This isn’t necessarily true if you think of positions as where a vertical-bar cursor would be, that is, between characters. In
"|"
where the double quotes delimit an empty string, the vertical bar (intended to represent the cursor here) has a position between those delimiters.
0-based indexOf "a" in "a" would be 0
0-based lastIndexOf "a" in "a" would be 1
?
To further illustrate, it would be the index where, if the search string is removed from the found position, it would have to be inserted in order to revert to the original string.
This is only confusing if you're thinking of taking single elements from an array, rather than thinking in terms of slices of arrays. Slicing also makes it obvious why we must use half-open ranges and 0-based indexing.
The only minor gotcha is that, when doing a request for "find all non-overlapping" or "find next strictly after", you have to increment the search index by 1 even though the size of the match is 0. ("find all overlapping" or "find next starting after" always increment by 1 and ignore the size of the match).
I.e., the answer is that the parameter lies outside the domain of well-supported values, so ideally the callee should somehow help the programmer realize they have a programming error.
Find the empty string and replace with say
where refers to the captured variable, and with input the result is With an empty input, the result is The result of indexOf can consistently return the first index of the first half open subrange in a string. It just so happens that the first subrange of an empty string in an empty string is [0, 0).The limiting case is also an interesting way to look at it: