If at all possible, use a union all rather than a plain union to avoid an extra sort / unique node.
I've used that OR > UNION ALL trick a number of times to vastly improve performance on specific queries. It's crazy how much of an effect it can have.
I wish Postgres would implement a planner optimization to automatically run queries with an OR more efficiently (e.g. use the same plan as with a UNION/ALL where possible).
I have a vague memory of this exact optimisation being one of the things Hibernate did automagically, about 10 years ago. I remember doing a fair bit of reading to figure out how & why it was so much faster.
I was able to implement this in the ORM layer of Django through a function that would inspect the query and split it if it had an OR under certain conditions.
While OR can be performance killers, in other cases it works perfectly fine, or sometimes even better than the equivalent UNION ALL query. In this case the problem is that they are applying a LIMIT with ORDER BY so it can’t do the usual BitmapOr optimization and thus the index ends up being used purely for sorting purposes.
+1. If you really understand your queries (which, if your queries are complicated enough that you already don't use an ORM, you should), then my favorite set of "tricks" are CTEs to collect info and convert/translate as necessary, then unions to merge/filter said data from the CTEs. Add aggregate/window functions to this data, and you are good to go.
You can really do some complicated pipelining of data cleaning and filtering this way, much faster than resorting to a whole bunch of complicated inline SQL work in the clauses.
[edit] I should add, learn how to read an explain plan, this is vital for understanding and improving your queries.
This seems like the type of thing that a sophisticated query planner in 2024 would be able to figure out on its own? Maybe as a non database expert I'm expecting too much?
There were attempts to do this automatically ~7 years ago (https://www.postgresql.org/message-id/flat/7f70bd5a-5d16-e05...). It didn't actually make it in due to worries about the approach and properly de-duplicating the final resultset, as well as that approach to the optimization not playing well with duplicate inner-join removal optimizations (which were also being discussed at the time).
FTA: “Why the query planner doesn’t automatically convert a condition like x > a OR (x == a AND y > b) to (x, y) > (a, b) is something I still don’t understand to this day, though.”
“The order of evaluation of subexpressions is not defined. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order.
Furthermore, if the result of an expression can be determined by evaluating only some parts of it, then other subexpressions might not be evaluated at all. For instance, if one wrote:
SELECT true OR somefunc();
then somefunc() would (probably) not be called at all. The same would be the case if one wrote:
SELECT somefunc() OR true;
Note that this is not the same as the left-to-right “short-circuiting” of Boolean operators that is found in some programming languages.”
Agree it's well-known, in a slightly snobbish sort of way, but it could do with being more widely known for sure; it is a bit of a gotcha, and IMO TFA is quite a nice tutorial on how they went about it, investigating the issue and determining the problem etc., perhaps especially if the answer as you wrote isn't already somewhat familiar, the explain analyze -err-explainer is quite nice. Not reference material, but as an intro, hey this is how you might be able to debug a similar problem you encounter.
The article felt like it was fumbling around when the initial explain pointed right at the issue. I didn’t know this specific trick, so I did learn something I guess.
That looks like a trivial optimizaiton / transformation. I am still quite puzzled how much work has gone into compilers, static code analysis, guided compilation (naturally comes w/ JITs, too). Yet, database optimizers of pretty much any database look rudimentary and unreliable. Sometimes I think they got stuck in the '90s and that's it.
That is a bit wrong and I'm pointing it out because your summary logically didn't make sense to me. So I did have to read the blog post after reading your comment.
The query should be CreateAt > $1 OR (CreateAt = $1 (NOT $2 like in your sample) AND Id > $2)
So the idea is here about paginating through posts that might be constantly changing so you can't use a simple offset, as that would give you duplications along the way. So you try to use CreateAt, but it could be possible that CreateAt is equal to another one so you fallback to ID.
But here I stopped reading the blog post, because I now think why not use Id in the first place since it also seems to be auto increment since otherwise you couldn't really rely on it to be a fallback like this? I don't have time to investigate it further, but tldr; that still left me confused - why not use ID in the first place.
It seems like the code is used both for indexing from scratch and for maintaining an existing index. In the case of maintaining an existing index you need to know where to start back up and Postgres doesn't guarantee that SERIAL primary keys are always in chronological order when there are multiple simultaneous sessions.
Ha, that's a good point. Devil's advocate though, and because perhaps maybe they changed it slightly for the example in the blog post, it could be that 'CreateAt' is more like 'PublishedAt', i.e. doesn't include a possible draft period, but then id (which does correspond to actual record initial creation time, obviously) is as good as anything to disambiguate them - because it's actually arbitrary, only needs to be consistent at that point.
[edit] CreatedAt timestamp could be something from the client when the post is submitted or tagged from the ingest server and not when they actually are processed and hit the database
Not at all surprised that Laurenz Albe played a key role in figuring this out -- I've learned so much about postgres from his Stack Overflow answers over the years.
I generate queries like this in a paging library I wrote at work, I did not use the tuple comparison because sometimes the sorting order is different between fields, each field in the tuple comparison would need a different operator, such as
WHERE Posts.CreateAt > ?1 OR (Posts.CreateAt = ?1 AND Posts.Id < ?2) ORDER BY Posts.CreateAt ASC, Posts.Id DESC
I now wonder if it might be possible to use a negation operator to invert the sorting order, like:
(Posts.Id, -Posts.CreateAt) < (?1, ?2) ORDER BY Posts.CreateAt ASC, Posts.ID DESC
A UNION allows you to use two separate indexes and can speed up queries.
I've used that OR > UNION ALL trick a number of times to vastly improve performance on specific queries. It's crazy how much of an effect it can have.
I wish Postgres would implement a planner optimization to automatically run queries with an OR more efficiently (e.g. use the same plan as with a UNION/ALL where possible).
You can really do some complicated pipelining of data cleaning and filtering this way, much faster than resorting to a whole bunch of complicated inline SQL work in the clauses.
[edit] I should add, learn how to read an explain plan, this is vital for understanding and improving your queries.
There have been more recent attempts to improve OR clause performance in another thread: https://www.postgresql.org/message-id/flat/567ED6CA.2040504%...
Of course UNION [ALL] does often yield better performance, yes.
There’s a difference between the two. https://www.postgresql.org/docs/current/sql-expressions.html...:
“The order of evaluation of subexpressions is not defined. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order.
Furthermore, if the result of an expression can be determined by evaluating only some parts of it, then other subexpressions might not be evaluated at all. For instance, if one wrote:
then somefunc() would (probably) not be called at all. The same would be the case if one wrote: Note that this is not the same as the left-to-right “short-circuiting” of Boolean operators that is found in some programming languages.”https://www.postgresql.org/docs/current/functions-comparison...:
“For the <, <=, > and >= cases, the row elements are compared left-to-right, stopping as soon as an unequal or null pair of elements is found.”
So, this is using something akin to short-circuiting.
I think that can mean the two give different results in the presence of null values.
⇒ I would guess the optimizer isn’t smart enough to detect when the second, stricter query will be equivalent and faster.
Deleted Comment
The author filtered with:
That can't use one index for both conditions. Instead, make it faster and simpler: End of post.1. Always use BUFFERS when running an EXPLAIN. It gives some data that may be crucial for the investigation.
2. Always, always try to get an Index Cond (called Index range scan in MySQL) instead of a Filter.
3. Always, always, always assume PostgreSQL and MySQL will behave differently. Because they do.
The article felt like it was fumbling around when the initial explain pointed right at the issue. I didn’t know this specific trick, so I did learn something I guess.
The query should be CreateAt > $1 OR (CreateAt = $1 (NOT $2 like in your sample) AND Id > $2)
So the idea is here about paginating through posts that might be constantly changing so you can't use a simple offset, as that would give you duplications along the way. So you try to use CreateAt, but it could be possible that CreateAt is equal to another one so you fallback to ID.
But here I stopped reading the blog post, because I now think why not use Id in the first place since it also seems to be auto increment since otherwise you couldn't really rely on it to be a fallback like this? I don't have time to investigate it further, but tldr; that still left me confused - why not use ID in the first place.
https://dba.stackexchange.com/questions/266405/does-ordering...
https://uuid7.com/
[edit] CreatedAt timestamp could be something from the client when the post is submitted or tagged from the ingest server and not when they actually are processed and hit the database
Deleted Comment
From article one of them is: 'tpomh9yu1tffmdp6dopobwuc9h'
But the query is trying to page through the most recent posts first.
So a newer post doesn't necessarily have a higher ID or lower ID.
The ID is just there to have a stable tie break between the same timestamps.
Deleted Comment
Did not know you could do "tuple" filtering like this
WHERE Posts.CreateAt >= ?1 AND Posts.Id > ?2 (where ?1 and ?2 are the current cursor values)
WHERE Posts.CreateAt > ?1 OR (Posts.CreateAt = ?1 AND Posts.Id < ?2) ORDER BY Posts.CreateAt ASC, Posts.Id DESC
I now wonder if it might be possible to use a negation operator to invert the sorting order, like:
(Posts.Id, -Posts.CreateAt) < (?1, ?2) ORDER BY Posts.CreateAt ASC, Posts.ID DESC
and keep the performance improvement.