Materialized views are obviously useful

What a great post. Humble and honest and simple and focused on an issue most developers think is so simple (“why not just vibe code SQL?”, “whatever, just scale up the RDS instance”).

Compliments aside, where this article stops is where things get exciting. Postgres shines here, as does Vitess, Cassandra, ScyllaDB, even MongoDB has materialized views now. Vitess and Scylla are so good, it’s a shame they’re not more popular among smaller startups!

What I haven’t seen yet is a really good library for managing materialized views.

malthejorgensen · 3 days ago

Don’t you have to manually “refresh” Postgres materialized views, essentially making it an easier to implement cache (the Redis example in the blog post) rather than the type always-auto-updating materialized view the blog post author is actually touting?

striking · 3 days ago

The real bummer is not that you have to manually refresh them, it's that refreshing them involves refreshing the entire view. If you could pick and choose what gets refreshed, you might just sometimes have a stale cache here and there while parts of it get updated. But refreshing a materialized view that is basically just not small or potentially slightly interesting runs the risk of blowing your write instance up.

For this reason I would strongly advise, in the spirit of https://wiki.postgresql.org/wiki/Don't_Do_This, that you Don't Do Materialized Views.

Sure, Differential/Timely Dataflow exist and they're very interesting; I have not gotten to build a database system with them and the systems that provide them in a usable format to end users (e.g. Materialize) are too non-boring for me to want to deploy in a production app.

lbreakjai · 3 days ago

Out of the box, you're right, but there are extensions that do just that:

https://github.com/sraoss/pg_ivm

It's however not available on RDS, so I've never had the chance to try it myself.

nine_k · 2 days ago

I think it's impossible to do an incremental update in an arbitrary case. Imagine an m-view based on a query that selects top 100 largest purchases during last 30 days on an e-commerce site. Or, worse, a query that selects the largest subtree of followers on a social network site.

Only certain kinds of conditions, such as a rolling window over a timestamp field, seem amenable to efficient incremental updates. What am I missing?

ropable · 3 days ago

Yes, you need to refresh the materialized views periodically. Which mean that, just like any other caching mechanism, you're solving one problem (query performance) but introducing another (cache invalidation). I've personally used Postgres MVs to great success, but there are tradeoffs.

erulabs · 2 days ago

Oh interesting, I didn’t know that - I’ve been so far in MySQL/Vitess land for so long, I haven’t used Postgres in several years. That’s disappointing!

recroad · 2 days ago

So the author is wrong that they’re automatic kept in sync?

bdcravens · 3 days ago

In lieu of good MV support, you can always just run a scheduled query to store results in a persisted table that you identify as a materialized view. For example, when doing this in SQL Server, I give the table name a "cache" prefix.

enedil · 3 days ago

Materialized views in ScyllaDB are (were?) known to be a buggy implementation. In particular, they often depended on the cluster being healthy at the time of propagating the changes.

> And then by magic the results of this query will just always exist and be up-to-date.

With PostgreSQL the materialized view won't be automatically updated though, you need to do `REFRESH MATERIALIZED VIEW` manually.

jelder · 3 days ago

Did I miss in the article where OP reveals the magic database that actually does this?

3rd party solutions like https://readyset.io/ and https://materialize.com/ exist specifically because databases don’t actually have what we all want materialized views to be.

sophiebits · 3 days ago

These startups (and a handful of others) are what I meant!

anon84873628 · 3 days ago

In the analytics world, BigQuery MVs seem pretty cool. You can tune the freshness parameters, it will maintain user-specific row-level security, and even rewrite regular queries to use the pre-computed aggregates if possible.

But I don't think there is anything similar in the GCP transactional db options like Spanner or CloudSQL.

rapind · 3 days ago

You can do targeted materialized view updates via triggers. It's definitely verbose but does give you a lot of control.

I'm currently parking PostgREST behind Fastly (varnish) for pretty much the same benefits plus edge CDNs for my read APIs. I really just use materialized views for report generation now.

4ndrewl · 4 days ago

Just landed here to write this. Materialized Views are _very_ implementation specific and are definitely _not_ magic.

It's important to understand how your implementation works before committing to it.

shivasaxena · 3 days ago

Curious if anyone know any implementation where they would be automatically updated?

Now that would be awesome!

EDIT: come to think of it, it would require going through CDC stream, and figuring out if any of the tables affected are a dependency of given materialized view. Maybe with some ast parsing as well to handle tenants/partitions. Sounds like it can work?

oftenwrong · 3 days ago

There is a PostgreSQL extension that adds support for incremental updates to materialised views: https://github.com/sraoss/pg_ivm

dalyons · 3 days ago

Postgres materialized views are pretty terrible / useless compared to other rdbms. I’ve never found a usecase for the very limited pg version.

globular-toast · 3 days ago

In Postgres a materialized view is basically a table that remembers the query used to generate it. Useful if you want to trigger the refreshes without knowing the queries.

thrown-0825 · 3 days ago

having a dataset refresh on a timer and cache the result for future queries is pretty useful

JohnBooty · 3 days ago

I've only used Postgres' and (ages ago) MSSQL's materialized views. What is pg missing compared to the others?

I've found them VERY useful for a narrow range of use cases but, I probably don't realize what I'm missing.

erulabs · 4 days ago

thom · 3 days ago

Materialize.com and Snowflake have pretty reliable incremental materialised views now, with caveats that aren’t back breaking. If you can transform in SQL rather than having to build a whole new pipeline or microservice to do the work that’s a pure operational win. I consider this alongside hybrid transactional/analytical databases to be the holy grail of data infrastructure. Finally we can stop just shuffling data around, support almost all workloads in one place, and get some work done.

viccis · 3 days ago

Yep Databricks does this pretty well too. I think their sales jargon for it is the "Enzyme engine"

MangoToupe · 3 days ago

> Finally we can stop just shuffling data around

Bro that's your entire job description. What is left?

thom · 2 days ago

Deciding what colour it should be.

bob1029 · 3 days ago

> I don’t know yet if the implementations of this yet are good enough to use at scale. Maybe they’re slow or maybe the bugs aren’t ironed out yet.

This technique is very well supported in the big commercial engines. In MSSQL's Indexed View case, the views are synchronously updated when the underlying tables are modified. This has implications at insert/update/delete time, so if you are going to be doing a lot of these you might want to do it on a read replica to avoid impact to production writes.

https://learn.microsoft.com/en-us/sql/relational-databases/v...

https://learn.microsoft.com/en-us/sql/t-sql/statements/creat...

mike_hearn · 2 days ago

Yes the idea it's newfangled is odd. It's only newfangled if you ignore the existence of database engines that are better than the open source ones.

Oracle's implementation scales horizontally and can do incremental view maintenance using commit logs:

https://oracle-base.com/articles/misc/materialized-views

But it may not even be necessary because Oracle also supports query caching both server and client side. By default you have to opt-in on the query level:

    SELECT /*+ RESULT_CACHE */ .... FROM .....

but if you do then queries might not even hit the database at all, the results are cached by the drivers. The main limitation of that feature is that it's meant for read-mostly tables and cache keys are invalidated on any write to the table. So for aggregating counts in a large multi-tenant table the hit rate would be low and a materialized view is more appropriate.

It's a bit unclear why the author of the article ignores triggers as not "in vogue" though. That's supported by every DB and will work fine for this use case. You do have to learn how to use these features but it can save a lot of work, especially when introducing non-consistent data sources like a Redis cache. Consistency is so valuable.

Disclosure: work for Oracle, opinions are my own (albeit in this case, presumably highly aligned).

TIL, thanks! I know Postgres and MySQL don’t include an equivalent.

Can we inspect MSSQL's source or is it shipped as a blob? I can't find any serious information about how this works. I can't imagine who would want to spend money on this.

jamesblonde · 3 days ago

This triggered me in the article

'There are a few startups these days peddling a newfangled technology called “incremental view maintenance” or “differential dataflow”. '

Incremental view maintenance can change recomputation cost of view updates from O(N) to O(1). DBSP is based on z-sets, a generalization of relational algebra. The paper won best paper at SIGMOD. There is a startup, Feldera, commercializing it.

This is just ignorance to dismiss as 'new fangled'.

PerryStyle · 3 days ago

+1. Learned about this in DB research course during grad school. Feldera is really cool.

Also I love their website design.

lsuresh · 3 days ago

Thanks for the kind words (Feldera co-founder here). I'll pass it on to the design team. :)

tylerhou · a day ago

IVM is not new, but the DBSP (2022) perspective is relatively new for databases (where the classic literature is from the 70s and 80s).

It is misleading to say that IVM reduces the cost of view updates from O(n) to O(1). While that is not technically incorrect, for any nontrivial query (e.g anything with an index join) the cost of a view update will be smaller than the original query but not constant time.

Also, the tone of “newfangled” was not dismissive in the context of an article praising IVM. At worst, it was sarcastic; I interpreted it as teasing.

I do research related to IVM / DBSP.

I mean, everything you said sounds exactly like the definition of "new fangled" to me. I don't think the term is meant to be so pejorative or dismissive, just that the tech is currently intimidating to people not on the cutting edge. (Edit: e.g. taking graduate level database courses, as mentioned by a sibling comment :-)

There is constantly so much new stuff in software, you have to be a bit willfully ignorant of some things some of the time.

Does going from O(N) to O(1) sound like "new fangled"? That is the smell of progress

quectophoton · 4 days ago

Materialized views are great and (obviously) useful, but they have the usual tradeoffs of any caching mechanism (e.g. now you have to worry about cache data age and invalidation.

IMO a slept-on database feature is table partitioning to improve query performance. If you have a frequently-used filter field that you can partition on (e.g. creation timestamp), then you can radically improve query performance of large databases by having the DB only need to full-scan the given partitions. The database itself manages where records are placed, so there is no additional overhead complexity beyond initial setup. I've only used this for PostgreSQL, but I assume that other databases have similar partition mechanisms.

This is a weird article. The author doesn't even mention what database they are talking about then just drops in some SQL that looks like Postgres. If you think Postgres will magically have the right values in it for a materialized view you will be very disappointed...

quasarj · 3 days ago

Yeah, that was my thoughts as well. What database is this? In Postgres you definitely have to update materialized views manually....

jmull · 3 days ago

I curious why an index can't handle that first query well.

crazygringo · 3 days ago

There is no reason. An index is the proper solution for dealing with <1K tasks per project, conservatively. (On modern SSD's you'd probably still be plenty fast for <100K tasks.)

In fact, the query would return the result straight from counting the project_id index entries, never even needing to scan the table itself (as the author acknowledges).

If you're really dealing with many, many thousands of tasks per project, then materialized views are going to be just as slow to update as to view. They're not "magic". The standard performant solution would be to keep a num_tasks field that was always incremented or decremented in a transaction together with inserting or deleting a task row. That will actually be lightning fast.

Materialized views aren't even supported in many common relational databases. They're a very particular solution that has very particular tradeoffs. Unfortunately, this article doesn't go into the tradeoffs at all, and picks a bad example where they're not even an obviously good solution in the first place.

mb7733 · 3 days ago

Indexes can only help narrow down to the issues for the project (more generally: matching rows for the query). Once the index narrows down the rows, Postgres still has to count them all, and Postgres isn't particularly fast at that, especially in an active table[0]. That's what the author meant by 'complete index scan of the tasks for the project'.

Of course this isn't really relevant until there are a very large number of rows to count for a given query. Much larger than what is likely for "tasks in a project". I've run into this only with queries that end up counting 10e7/8/9 rows, i.e. more like OLAP workloads

[0] https://wiki.postgresql.org/wiki/Slow_Counting

th0ma5 · 3 days ago

This is my thing, I often thought of these views as a way to bridge organizational divides rather than technical ones. Still cool! But if you own everything you can do all kinds of other stuff just as easily.