Making a Postgres query 1k times faster

hansonkd · a year ago

OR Queries are performance killers because they often force a full table scan. Another alternative to the Or is actually a UNION.

A UNION allows you to use two separate indexes and can speed up queries.

Tostino · a year ago

If at all possible, use a union all rather than a plain union to avoid an extra sort / unique node.

I've used that OR > UNION ALL trick a number of times to vastly improve performance on specific queries. It's crazy how much of an effect it can have.

I wish Postgres would implement a planner optimization to automatically run queries with an OR more efficiently (e.g. use the same plan as with a UNION/ALL where possible).

boomskats · a year ago

I have a vague memory of this exact optimisation being one of the things Hibernate did automagically, about 10 years ago. I remember doing a fair bit of reading to figure out how & why it was so much faster.

hansonkd · a year ago

I was able to implement this in the ORM layer of Django through a function that would inspect the query and split it if it had an OR under certain conditions.

hinkley · a year ago

What do you suppose prevents the query planned to do the same thing?

zeroimpl · a year ago

While OR can be performance killers, in other cases it works perfectly fine, or sometimes even better than the equivalent UNION ALL query. In this case the problem is that they are applying a LIMIT with ORDER BY so it can’t do the usual BitmapOr optimization and thus the index ends up being used purely for sorting purposes.

xarope · a year ago

+1. If you really understand your queries (which, if your queries are complicated enough that you already don't use an ORM, you should), then my favorite set of "tricks" are CTEs to collect info and convert/translate as necessary, then unions to merge/filter said data from the CTEs. Add aggregate/window functions to this data, and you are good to go.

You can really do some complicated pipelining of data cleaning and filtering this way, much faster than resorting to a whole bunch of complicated inline SQL work in the clauses.

[edit] I should add, learn how to read an explain plan, this is vital for understanding and improving your queries.

rpbiwer2 · a year ago

This seems like the type of thing that a sophisticated query planner in 2024 would be able to figure out on its own? Maybe as a non database expert I'm expecting too much?

Tostino · a year ago

There were attempts to do this automatically ~7 years ago (https://www.postgresql.org/message-id/flat/7f70bd5a-5d16-e05...). It didn't actually make it in due to worries about the approach and properly de-duplicating the final resultset, as well as that approach to the optimization not playing well with duplicate inner-join removal optimizations (which were also being discussed at the time).

There have been more recent attempts to improve OR clause performance in another thread: https://www.postgresql.org/message-id/flat/567ED6CA.2040504%...

hun3 · a year ago

This can be mitigated somewhat by not having ORs at top level.

Of course UNION [ALL] does often yield better performance, yes.

Someone · a year ago

FTA: “Why the query planner doesn’t automatically convert a condition like x > a OR (x == a AND y > b) to (x, y) > (a, b) is something I still don’t understand to this day, though.”

There’s a difference between the two. https://www.postgresql.org/docs/current/sql-expressions.html...:

“The order of evaluation of subexpressions is not defined. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order.

Furthermore, if the result of an expression can be determined by evaluating only some parts of it, then other subexpressions might not be evaluated at all. For instance, if one wrote:

    SELECT true OR somefunc();

then somefunc() would (probably) not be called at all. The same would be the case if one wrote:
SELECT somefunc() OR true;
Note that this is not the same as the left-to-right “short-circuiting” of Boolean operators that is found in some programming languages.”

https://www.postgresql.org/docs/current/functions-comparison...:

“For the <, <=, > and >= cases, the row elements are compared left-to-right, stopping as soon as an unequal or null pair of elements is found.”

So, this is using something akin to short-circuiting.

I think that can mean the two give different results in the presence of null values.

⇒ I would guess the optimizer isn’t smart enough to detect when the second, stricter query will be equivalent and faster.

Deleted Comment

paulddraper · a year ago

I'm gonna save you 15 minutes.

The author filtered with:

    CreateAt > $1 OR (CreateAt = $1 AND Id > $2)

That can't use one index for both conditions. Instead, make it faster and simpler:

   (CreateAt, Id) > ($1, $2)

End of post.

damidekronik · a year ago

I think a nicer takeaway is what the author summarized with:

1. Always use BUFFERS when running an EXPLAIN. It gives some data that may be crucial for the investigation.

2. Always, always try to get an Index Cond (called Index range scan in MySQL) instead of a Filter.

3. Always, always, always assume PostgreSQL and MySQL will behave differently. Because they do.

OJFord · a year ago

Agree it's well-known, in a slightly snobbish sort of way, but it could do with being more widely known for sure; it is a bit of a gotcha, and IMO TFA is quite a nice tutorial on how they went about it, investigating the issue and determining the problem etc., perhaps especially if the answer as you wrote isn't already somewhat familiar, the explain analyze -err-explainer is quite nice. Not reference material, but as an intro, hey this is how you might be able to debug a similar problem you encounter.

tomrod · a year ago

Is that logically the same filter syntax? I'm not familiar with the second construct. If creatat is equal to $1, does it compare the second argument?

paulddraper · a year ago

Yes

aidos · a year ago

Thanks. Too late though, I already read it.

The article felt like it was fumbling around when the initial explain pointed right at the issue. I didn’t know this specific trick, so I did learn something I guess.

xxs · a year ago

That looks like a trivial optimizaiton / transformation. I am still quite puzzled how much work has gone into compilers, static code analysis, guided compilation (naturally comes w/ JITs, too). Yet, database optimizers of pretty much any database look rudimentary and unreliable. Sometimes I think they got stuck in the '90s and that's it.

Bjartr · a year ago

Same with debugging and error messages. At least it feels that way to me.

mewpmewp2 · a year ago

That is a bit wrong and I'm pointing it out because your summary logically didn't make sense to me. So I did have to read the blog post after reading your comment.

The query should be CreateAt > $1 OR (CreateAt = $1 (NOT $2 like in your sample) AND Id > $2)

So the idea is here about paginating through posts that might be constantly changing so you can't use a simple offset, as that would give you duplications along the way. So you try to use CreateAt, but it could be possible that CreateAt is equal to another one so you fallback to ID.

But here I stopped reading the blog post, because I now think why not use Id in the first place since it also seems to be auto increment since otherwise you couldn't really rely on it to be a fallback like this? I don't have time to investigate it further, but tldr; that still left me confused - why not use ID in the first place.

enragedcacti · a year ago

It seems like the code is used both for indexing from scratch and for maintaining an existing index. In the case of maintaining an existing index you need to know where to start back up and Postgres doesn't guarantee that SERIAL primary keys are always in chronological order when there are multiple simultaneous sessions.

https://dba.stackexchange.com/questions/266405/does-ordering...

OJFord · a year ago

Ha, that's a good point. Devil's advocate though, and because perhaps maybe they changed it slightly for the example in the blog post, it could be that 'CreateAt' is more like 'PublishedAt', i.e. doesn't include a possible draft period, but then id (which does correspond to actual record initial creation time, obviously) is as good as anything to disambiguate them - because it's actually arbitrary, only needs to be consistent at that point.

selecsosi · a year ago

They could be using something like application generated time sortable UUID7s or some other sortable key

https://uuid7.com/

[edit] CreatedAt timestamp could be something from the client when the post is submitted or tagged from the ingest server and not when they actually are processed and hit the database

paulddraper · a year ago

Thanks, I fixed that typo.

durkie · a year ago

Not at all surprised that Laurenz Albe played a key role in figuring this out -- I've learned so much about postgres from his Stack Overflow answers over the years.

mewpmewp2 · a year ago

My question after a brief look - why not just use ID instead of CreateAt?

Deleted Comment

thatwasunusual · a year ago

That was my initial thought as well. I'm hoping the original author reads this and can explain, because surely there must be a reason...?

aleksiy123 · a year ago

I think because ID is random and not monotonically increasing.

From article one of them is: 'tpomh9yu1tffmdp6dopobwuc9h'

But the query is trying to page through the most recent posts first.

So a newer post doesn't necessarily have a higher ID or lower ID.

The ID is just there to have a stable tie break between the same timestamps.

Deleted Comment

MuffinFlavored · a year ago

> WHERE (Posts.CreateAt, Posts.Id) > (?1, ?2)

Did not know you could do "tuple" filtering like this

astralagent · a year ago

Given Post.ID is monotonically increasing and this is a cursor-based pagination query, can this WHERE clause be effectively rewritten as:

WHERE Posts.CreateAt >= ?1 AND Posts.Id > ?2 (where ?1 and ?2 are the current cursor values)

sbstp · a year ago

I generate queries like this in a paging library I wrote at work, I did not use the tuple comparison because sometimes the sorting order is different between fields, each field in the tuple comparison would need a different operator, such as

WHERE Posts.CreateAt > ?1 OR (Posts.CreateAt = ?1 AND Posts.Id < ?2) ORDER BY Posts.CreateAt ASC, Posts.Id DESC

I now wonder if it might be possible to use a negation operator to invert the sorting order, like:

(Posts.Id, -Posts.CreateAt) < (?1, ?2) ORDER BY Posts.CreateAt ASC, Posts.ID DESC

and keep the performance improvement.