CTEs as lookup tables

CTEs (common table expressions) are wonderful, they make SQL grokkable and maintainable. But, before some dev goes crazy refactoring all of the SQL in their codebase into maintainable CTEs: always benchmark. CTEs can cause your query planner to optimize incorrectly. In some cases, CTEs can force the query optimizer to choose a plan it otherwise is not choosing and be more performant - perhaps up to a certain point. (That point being the size of the tables in question/index/etc.)

jvolkman · 3 years ago

CTEs used to be "optimization fences" in PostgreSQL, but that changed with v12. https://www.depesz.com/2019/02/19/waiting-for-postgresql-12-...

DevX101 · 3 years ago

This blog has a VERY high signal to noise ratio for anyone interested in digging into the PostgreSQL internals. Lots of great articles.

srcreigh · 3 years ago

They still are sometimes. Whenever you have a CTE which is referenced more than once, to be specific.

wvenable · 3 years ago

I had some code from a vendor that was hanging in SQL Server and I looked at the code and they had composed everything together using smaller CTE queries and it took forever to run. They had taken various complex criteria and executed one query for each and then combined the results -- I ended up spending a day refactoring the whole thing to single SELECT with all the criteria and it ran instantly.

This was for a nightly data importer and it was filled with queries like that -- I only fixed the one giving us issues -- but I bet I could have reduced the total runtime by 90%.

alexvoda · 3 years ago

Indeed, SQL is tricky like that because there are plenty of performance consequences for a language that is meant to be declarative and optimized by a query planner.

Unless you have realistic data in the test database the performance behavior of a query can be unpredictable.

SQL is very leaky abstraction.

emodendroket · 3 years ago

I'm also curious how turning some of those into views would compare, but this is also probably dependent on which database.

bob1029 · 3 years ago

I have strongly encouraged reckless use of CTEs throughout our product. We use in-memory instances of SQLite to evaluate any queries which would leverage CTEs. These datasets are usually small enough to reside within L1 (and certainly within L2).

btown · 3 years ago

“By the time it doesn’t fit in [L2/RAM] you will have a different order of magnitude of engineering team” is one of my favorite adages. Arguably a bit less true with the venture model of growth-at-all-costs showing some cracks, but still very much a good mental model!

hobs · 3 years ago

In my experience a huge portion of the userbase who loves CTEs are analysts or devs running queries on decently large datasets, and they mostly like them because of readability and don't understand performance possibilities.

I appreciate that tools like dbt allow materialization options as both CTEs and views/tables/etc because being able to pivot between them is super nice.

TheCleric · 3 years ago

100% true.

I wrote some very elegant, readable SQL to perform a complex query. It was dog slow.

I handed it to a DBA, they ripped out the CTEs and replaced them all with temp tables. The query was an unreadable mess at the end, but boy was it orders of magnitude faster.

mwigdahl · 3 years ago

Did they do something else other than just popping CTE queries into temp tables? In my experience, _just_ doing that doesn't affect readability much at all.

recursive · 3 years ago

Behold, the declarative 4th generational language.

ComodoHacker · 3 years ago

>always benchmark

…on real or at least representative data.

with ids as ( select u.user_id, u.token, s.customer_id from users u join stripe_customer s on s.user_id = u.user_id ), user_impressions as ( select i.*, ids.id from impressions i join ids using token ), payments as ( select p.*, ids.id from payments p join ids using customer_id ), select sum(i.views), sum(i.clicks), sum(p.amount) from payments p join user_impressions i using user_id where user_id = $1 group by date;

WITH countries (code, name) AS ( ...> SELECT \* FROM (VALUES ...> ('us', 'United States'), ('fr', 'France'), ('in', 'India') ...> ) AS codes ...> ) ...> SELECT data.code, name FROM data LEFT JOIN countries ON countries.code = data.code;

CREATE TABLE vals (x, y); INSERT INTO vals VALUES (1, 1), (2, 0.5), (3, 0.4), (4, 0.1), (5, 0); WITH m(v) AS ( SELECT ((COUNT(*) * (SUM(x * y))) - (SUM(x) * SUM(y))) / ((COUNT(*) * SUM(POW(x, 2))) - (POW(SUM(x), 2))) FROM vals ), b(v) AS ( SELECT (SUM(y) - (m.v * SUM(x))) / COUNT(*) FROM vals JOIN m ) SELECT x AS real_x, y AS real_y, m.v AS m, b.v AS b, x * m.v + b.v AS interp_y FROM vals JOIN m JOIN b; real_x real_y m b interp_y ------ ------ ----- ---- ------------------- 1 1 -0.24 1.12 0.88 2 0.5 -0.24 1.12 0.64 3 0.4 -0.24 1.12 0.4 4 0.1 -0.24 1.12 0.16 5 0 -0.24 1.12 -0.0800000000000001

Removing CTEs from your codebase by replacing them with the creation of a temp table and then using separate queries with as few joins as possible to populate it will give you low-locking performance hundreds of times better nearly every time. No problem with readability.

pjungwir · 3 years ago

I see this pattern all the time in mssql code, but hardly ever elsewhere. I've always wondered if there is a reason for that. Are you coming from that background? I'm curious if anyone knows why it is so favored there?

I'm not sure I agree re performance btw. I've fixed a lot of slow mssql queries by changing them in the other direction (EDIT: or just adding a join). I do see how temp tables might shorten the time you hold locks---but then you might be subverting your transaction isolation level.

gregw2 · 3 years ago

I have also seen the performance boost by using separate temp tables rather than CTEs (in my case in Redshift).

My hypothesis is that while memory use my be the same either way, you don't have the same transactional/locking requirements/overhead/consistency with multi statements as you do with a single big set of CTEs. And for analytics you rarely need that transactional isolation required by a single big CTE.

drittich · 3 years ago

It's not necessarily about locking, it's a way to control what order joins get done in, which, at least in MS SQL, is up to the query engine to decide. By writing to a temp table you can override the query engine's plan, often with huge performance gains.

Less known is that you can get the same effect using a TOP N statement where N is a number larger than the rows you might get back. This presumably avoids the additional i/o penalty of the temp table creation, but I've never validated that. This trick likely is portable across db engines, too.

See https://bertwagner.com/posts/does-the-join-order-of-my-table... - Adam Machanic has some good videos diving deeper into this.

pcblues · 3 years ago

It depends on the use of the database. Where there is intense transactional writes (e.g. case management system, etc.) the CTE approach can easily get locked up, especially if the write queries do a lot of their own lookups. I have come from an MSSQL background, so maybe that is an artifact either of the DB or the way it is used in my own history :) Good question though, and I'd like to hear from someone who has experience in other DBs.

WRT subverting the transaction isolation level, using temp tables is a tacit decision to do so and best for non-vital read queries in my own experience. That is, I used it for lookups of tables that did not _need_ to be completely up to date, or when I knew the data _would_ be up to date.

hobs · 3 years ago

I find that anyone who praises chaining CTEs has not dealt with how terrible they can become, each sub table being able to take on any dependency in the parent tables is not a feature of going fast.

garfij · 3 years ago

I've used CTEs fairly extensively so I'm generally familiar with their pitfalls, but I don't quite grok your point here. Can you go into a little more detail about what you mean by "take on any dependency in the parent tables"?

branko_d · 3 years ago

But careful! In the concurrent environment, separate queries may need a higher transaction isolation level to mean the same thing as one query.

drx · 3 years ago

CTEs are low-key one of the best features of SQL. Great for debugging big queries, such as:

  with source as (
    select * from wherever
  ),
  transformed as (
    ...
  ),
  joined as (
    ...
  ),
  final as (
    ...
  )
  select * from final

You can switch `final` to `transformed` to see what the query is doing internally. Almost like having good control flow. Almost.

pkoird · 3 years ago

Almost seems like a procedural syntax like

source = ...

transformed = ...

joined = ...

final = ...

acjohnson55 · 3 years ago

It doesn't seem procedural to me, as there is no rebinding. Such a sequence of assignments would look just as home in a functional language like Lisp, ML, or Haskell as Python. In procedural languages, idiomatically, you have mutation, in which variables are re-bound to new values, and side-effects, in which the external environment and the program can interact with each other in ways that are unconstrained.

hackernewds · 3 years ago

I prefer actually materializing the tables so then I can check the output for what the transform tables and the joined tables look like.

generally can't just swap transformed in final because final depends on the output of transformed?

Bootvis · 3 years ago

The trick is to change the name of the CTE in your final select. This will allow you to inspect that particular step.

This works as long as your CTE’s are correct SQL and only the logic is wrong or suspect.

dx034 · 3 years ago

With CTEs, the optimizer won't execute the queries but create one optimized version. If you materialize them, execution times can be many magnitudes higher in cases you only end up using small parts of the queries.

cfeduke · 3 years ago

This is a great idea if you disable materialization (which the author of this post does not mention).

Yes, materialization is fine for small hardcoded values in the post, but for most other lookup tables, it's definitely not fine.

For example, you read this article and think great, I'll make a lookup table to map various ids to be linked across tables.

You think, great, this gets some stats for one user since its filtered at the end, and with indexes on stripe_customer(user_id), user_impressions(token), and payments(customer_id), this will be really fast and efficient!

Nope.. since ids is referenced more than once in this query, in Postgres this causes the ids CTE to be materialized onto disk with no indexes. So not only does it take up lots of extra space on disk to store all the users in your system, to join user_impressions and payments you have to O(N) search across the ids dataset. No indexes on a materialized table.

It would be a lot faster to join to users or stripe_customer or both in a loop with indexes than O(N) search through all the users in your system.

This is particularly dangerous because if you remove the payments part, ids only has one reference, so Postgres doesn't materialize the table, and everything is fast.

This can be fixed by adding WITH foobar AS NOT MATERIALIZED to the CTE syntax. IMO it should be a syntax error to not specify AS MATERIALIZED or AS NOT MATERIALIZED.. The default has too many potential performance problems and folks should have to think about whether they want materialization or not.

cyclotron3k · 3 years ago

`NOT MATERIALIZED` is now the default in Postgres, and has been since PG12 I believe

Only when the CTE is referenced once. If it's referenced more than once, it is automatically materialized.

switchbak · 3 years ago

This is a pretty huge caveat, one that I'm very thankful to know of. Thanks for the heads up!

baq · 3 years ago

side note: I asked chatgpt to rewrite this query to use subqueries and it did a reasonable job.

next step: include cte-to-subquery translation step as a part of your build pipeline. never needed tools like that but i guess they must exist since it's been such a common issue. using gpt for this is like nuking a mosquito from orbit.

AS NOT MATERIALIZED is equivalent to using sub queries

charlie0 · 3 years ago

Seems like an interesting idea, but could use a better example, at least for those who aren't yet intermediate level in SQL.

In what world is this

easier to read than this

  SELECT code,
     ...> CASE code 
     ...>   WHEN 'us' THEN 'United States'
     ...>   WHEN 'fr' THEN 'France'
     ...>   WHEN 'in' THEN 'India'
     ...>  END AS country
     ...> FROM data;

rco8786 · 3 years ago

Yea this is a common problem with contrived code samples like these. In my experience the author is right in that real world examples of using CASE can often get out of hand - especially if you need it in more than one place. But the contrived, simple example is clearly easier than the CTE.

It is perhaps not surprising that we end up with this sort of stuff in production code, because the original author only needed one simple CASE statement and then it organically grew from there.

data-ottawa · 3 years ago

When you want more than one column

santialbo · 3 years ago

To me the advantanges comes when you use some sort of query builder. You can pass your list in code rather than having to write the SQL.

pjot · 3 years ago

To me the first example makes me assume that there isn’t yet a `country` table to select from. Hence the `from values()` clause.

If you were to run the first query on a fresh db, it’d return data. Running the second would fail.

Occasionally, SQL surprises with bits of composability, such as the fact that using VALUES to specify literal tuples can be used with both INSERT and the FROM clause of a SELECT.

Is there any reason syntactically the SELECT needs to be required? If you use a VALUES table literal as a subselect, you have to give it column names with an AS clause (see https://www.postgresql.org/docs/current/sql-values.html). I can imagine a simpler syntax where you'd do WITH VALUES ... AS table_name (column_1_name, ...).

Is there any reason to alias the lookup table as `codes`?

There are apparently lots of other clever uses of the WITH clause, such as https://www.postgresql.org/docs/current/queries-with.html#QU...

revskill · 3 years ago

Yes, it's what i'm doing, example here: https://gist.github.com/revskill10/57ecd8efb72f361b93e6d9d9f...

Basically, i could put with: values after the join.

jpgvm · 3 years ago

What is that YAML query language abomination in the other snippets?

Izkata · 3 years ago

sqlite doesn't have linear regression functions, and doing the math manually is a bit awkward because "b" relies on "m". Instead of duplicating the math to calculate "b", here's how to do it with CTEs:

If all you want is "m" and "b" the final query can just be "FROM m JOIN b" and the result will be 1 row.

I'd love to see this in blog post form!

This seems like a case where it would be nice to be able to pass in a parameter `vals (x, y)`.

I've tried to get myself writing a few times but never succeeded, so I don't have a blog. But - I am actually doing parameters like so in the view where I copied the above from:

   CREATE TABLE view_confs (key, value);
   INSERT INTO view_confs VALUE ('start', 2), ('end', 4);

Now prefix it with another one:

   vals_range(x, y) AS (
      SELECT x, y
      FROM vals
      WHERE x >= (SELECT value FROM view_confs WHERE key = 'start')
        AND x <  (SELECT value FROM view_confs WHERE key = 'end')
   ),

And swap the rest of the original query to use "vals_range" instead of "vals". Just gotta remember to update view_confs with the new range whenever you want to change the view. I think it should work inside a transaction to avoid multiple threads interfering with each other, but also to be clear: This is a workaround for sqlite not having stored procedures, and not wanting to implement it in code (so I can JOIN to the view in other queries). Better to use stored procedures in other databases than this workaround.

layer8 · 3 years ago

In cases like in the example given, I would also consider using a generated column (based on the CASE expression, in the example). That way the CTE doesn’t need to be repeated on each affected query — any query on the table can directly reference the derived value. Of course, this approach is only applicable for row-based values of a single table, and requires you to be able to modify the schema to add the generated column.