Shipping Multi-Tenant SaaS Using Postgres Row-Level Security

This is such a killer feature in PG, my new job uses it and it makes audits of our tenancy model dead simple.

Coming from a SaaS company that used MySQL, we would get asked by some customers how we guarantee we segmented their data, and it always ended at the app layer. One customer (A fortune 10 company) asked if we could switch to SQL Server to get this feature...

Our largest customers ask how we do database multi-tenant and we point to our SDLC + PG docs and they go 'K'.

e1g · 3 years ago

Every B2B client who asked us how we handle multi-tenancy also asked how we ensure their data is erased at the end of the contract. Using a shared database with RLS means you have to go through all DB backups, delete individual rows for that tenant, then re-generate the backup. That’s a non-starter, so we opted for having one DB per tenant which also makes sharding, scaling, balancing, and handling data-residency challenges easier.

brightball · 3 years ago

I filled out a ton of enterprise questionnaires on this stuff before and we just told people that it would be deleted when the backups expired after X days because we didn't have the capability to delete specific rows from our backups. Nobody ever argued.

There's not a single customer I've ever run across who's going to halt a contract because you can't purge their data from your backups fast enough. They're signing up because of what you offer, not the termination clause.

bearjaws · 3 years ago

We usually write a "reasonable best effort" clause into our deletion, that it will 100% be deleted from production within 30 days and automatically fall out of backups 60 days from there. This also helps since we can't control our downstream vendors such as Twilio, AWS SES, etc, who all have their own legal obligations and time frames.

Even for large health systems they have been okay with it.

Dave3of5 · 3 years ago

> we opted for having one DB per tenant which also makes sharding, scaling, balancing, and handling data-residency challenges easier

When you say 1 DB I suspect you mean you have a single DB Server and multiple DB's on that server. Then I don't think this really solves the data-residency problem as the clients data is just in a different DB but still on the same instance. It makes other problems for you as well for example you now have 2 DB's to run maintenance, upgrades, data migrations on. Current company uses a similar model for multiple types of systems and it makes upgrading the software very difficult.

It also makes scaling more difficult as instead of having a single DB cluster that you can tweek for everyone you'll need to tweek each cluster individually depending on the tenants that are on those clusters. You also have a practical limit to how many DB's you can have on any physical instance so your load balancing will become very tricky.

There are other problems it causes like federation which Enterprise Customers often want.

hobs · 3 years ago

Agree - performance on row level security (at least on SQL Server) is terrible, sharing by database is fairly easy.

axlee · 3 years ago

It makes BI work an absolute hellscape as well. Tradeoffs.

spiffytech · 3 years ago

> we opted for having one DB per tenant which also makes sharding, scaling, balancing, and handling data-residency challenges easier

I'm surprised this simplified balancing for you. When I worked somewhere with per-customer DBs, we had constant problems with rebalancing load. Some customers grew too big for where we put them, some nodes usually performed fine until the wrong subset of customers ran batch jobs simultaneously, etc.

jquery_dev · 3 years ago

How do you manage your backend in this case? Do you have an insurance of backend for each customer or do you allow backend to make connections to all the DBs.

I'm interested in doing similar and wondering about the best way to handle the routing between the databases from a single backend.

gervwyk · 3 years ago

Also second this, we even split our AWS org into an AWS account per tentant. Although, this will maybe be a problem if we have +100s of clients. But it makes onboarding and off-loading simple.

BeefWellington · 3 years ago

One DB (or perhaps schema, depending on DBMS) per tenant is the way to go IMO. It simplifies so much of your work, from DR to deletion to scaling.

SahAssar · 3 years ago

I think the way to handle this (based on how many companies handle GDPR compliance) is to not keep backups older than X months (usually 3 months) and have a clause that all data past that time is deleted.

eastbound · 3 years ago

I honestly don’t understand how Oracle is still alive. Postgres has so many of these killer features.

Also, I wonder how others do tenant separation, what other solutions there are.

mritchie712 · 3 years ago

Legacy.

If you have thousands of lines of code relying on Oracle the cost to migrate would be enormous.

abraae · 3 years ago

Oracle has had the ability to do this for decades ("virtual private database"), so whatever is keeping them alive, it's nothing to do with this particular nifty Postgres feature.

ibejoeb · 3 years ago

PostgreSQL is a great product, but Oracle has so many more features. Even just in the query language. There's so much power embedded in just the model clause that would be all custom software in Postgres.

If you buy Oracle, you should use Oracle. Like really lean into it. If you really need it, it will be worth the money. I don't like dealing with Oracle sales, but the product is killer.

throwaway787544 · 3 years ago

They're like a tick. Very good at burrowing in and hard to remove. They have a lot of clients for whom a dozen million dollars is a drop in the bucket, and moving away is a decade-long millions-of-dollars project.

revskill · 3 years ago

Sharding is the only scalable way per my experience. The point about scalability here is, i can control the load as the data gets bigger.

Scarbutt · 3 years ago

oracle has flashback

gz5 · 3 years ago

And PG supports layer 3 shut-down of link listeners and inbound fw ports. So you can combine the L7 tenancy with a secure networking architecture which eliminates the problems of managing firewalls and ACLs. One of the open source examples: https://youtu.be/s-skpw7bUfI

ksec · 3 years ago

Fortune Top 10

1 Walmart

2 Amazon

3 Apple

4 CVS Health

5 UnitedHealth Group

6 Exxon Mobil

7 Berkshire Hathaway

8 Alphabet

9 McKesson

10 AmerisourceBergen

We can rule out 2,3,7,8 …

mkurz · 3 years ago

Be aware when using RLS with views: By default the RLS policy will be executed with the permissions of the owner of the view instead with the permissions of the user executing the current query. This way it can easily happen that the RLS policy will be bypassed because the owner of the view is a admin account or the same account that owns the underlying table (see the the gotchas section of the original post).

However, upcoming PostgreSQL 15 adds support for security invoker views: https://github.com/postgres/postgres/commit/7faa5fc84bf46ea6... That means you can then define the security_invoker attribute when creating a view and this "... causes the underlying base relations to be checked against the privileges of the user of the view rather than the view owner" (see https://www.postgresql.org/docs/15/sql-createview.html) PG15 beta 1 release notes: https://www.postgresql.org/about/news/postgresql-15-beta-1-r...

markhalonen · 3 years ago

This was a shock to us after using RLS for a while. The solution outlined here worked great for us: https://www.benburwell.com/posts/row-level-security-postgres...

infogulch · 3 years ago

So they create table-valued functions which support the "SECURITY INVOKER" security context, and then select from that function to form the view. I suppose there's a feature request somewhere to support the "SECURITY INVOKER" feature for views directly?

alberth · 3 years ago

That seems like a bug to me, and a significant one as well.

The underlining promise of RLS (sometimes even referred to as “virtual private database”) in an RDBMS, is that data should never leak because it’s handled transparently by the db.

This seems like a significant leakage point that the user has to personally manage.

Spivak · 3 years ago

This pretty much mirrors stored procedures though which have the option of running as the definer or the invoker. Breaking this with “when RLS is enabled stored procedures and views set to run with the permissions of the definer intersect with the RLS policy of the invoker” is crazy weird.

Maybe it would actually be good behavior but it would super super unintuitive.

simonw · 3 years ago

I don't fully understand the performance implications here.

Say I was using this for a blog engine, and I wanted to run this SQL query:

    select * from entries;

But I actually only want to get back entries that my current user is allowed to view - where author_id = 57 for example.

Would PostgreSQL automatically turn the above query into the equivalent of this:

    select * from entries where author_id = 57;

And hence run quickly (assuming there's an index on that author_id column)?

Or would it need to run an additional SQL query check for every single row returned by my query to check row permissions, adding up to a lot of extra overhead?

ossopite · 3 years ago

yes, postgres will add such a condition to the query and in simple cases like this is able to use a corresponding index

unfortunately this can break down in more complex cases. roughly postgres trusts a limited set of functions and operators not to leak information about rows (e.g. via error messages) that the RLS policy says a query should not be able to see. that set includes basic comparisons but not more esoteric operations like JSON lookups. at some point postgres will insist on checking the RLS policy result for a row before doing any further work, which can preclude the use of indexes

lmeyerov · 3 years ago

We were looking at RLS, various ABAC integrated frameworks (casbin, ..), and zanzibar clones late last year --

* RLS is super appealing. Long-term, the architecture just makes so much more sense than bringing in additional maintenance/security/perf/etc burdens. So over time, I expect it to hollow out how much the others need to do, reducing them just to developer experience & tools (policy analysis, db log auditing, ...). Short-term, I'd only use it for simple internal projects because cross-tenant sharing is so useful in so many domains (esp if growing a business), and for now, RLS seems full of perf/expressivity/etc. footguns. So I wouldn't use for a SaaS unless something severely distinct tenant like payroll, and even then, I'd have a lot of operational questions before jumping in.

* For the needed flexibility and app layer controls, we took the middle of casbin, though others tools emerging to. Unlike the zanzibar style tools that bring another DB + runtime + ..., casbin's system of record is our existing system of record. Using it is more like a regular library call than growing the dumpster fire that is most distributed systems. Database backups, maintenance, migrations, etc are business as usual, no need to introduce more PITAs here, and especially not a vendor-in-the-middle with proprietary API protocols that we're stuck with ~forever as a dependency.

* A separate managed service might make zanzibar-style OK in some cases. One aspect is ensuring the use case won't suffer the view problem. From there, it just comes down to governance & risk. Auth0 being bought by Okta means we kind of know what it'll look like for awhile, and big cloud providers have growing identity services, which may be fine for folks. Startup-of-the-month owning parts of your control plane is scarier to me: if they get hacked, go out of business, get acquired by EvilCorp or raise $100M in VC and jack up prices, etc.

There's a lot of innovation to do here. A super-RLS postgres startup is on my list of easily growable ideas :)

On a related note: We're doing a bunch of analytics work on how to look at internal+customer auth logs -- viz, anomaly detection, and supervised behavioral AI -- so if folks are into things like looking into account take overs & privilege escalations / access abuse / fraud in their own logs, would love to chat!

jzelinskie · 3 years ago

As the developer of an external authorization system (full disclosure)[0], I feel obligated to chime in the critiques of external authorization systems in this article. I don't think they're far off base, as we do recommend RLS for use cases like what the article covers, but anyways, here's my two cents:

1+2: Cost + Unnecessary complexity: this argument can be used against anything that doesn't fit the given use case. There's no silver bullet for any choice of solution. You should only adopt the solution that makes the most sense for you and vendors should be candid about when they wouldn't recommend adopting their solution -- it'd be bad for both the users and reputation of the solution.

3: External dependencies: That depends on the toolchain. Integration testing against SpiceDB is easier than Postgres, IMO [1]. SpiceDB integration tests can run fully parallelized and can also model check your schema so that you're certain there are no flaws in your design. In practice, I haven't seen folks write tests to assert that their assumptions about RLS are maintained over time. The last place you want invariants to drift is authorization code.

4: Multi-tenancy is core to our product: I'm not sure I'm steel-manning this point, but I'll do my best. Most companies do not employ authorization experts and solutions worth their salt should support modeling multi-tenant use cases in a safe way. SpiceDB has a schema language with idioms and recommendations to implement functionality like multi-tenancy, but still leaves it in the hands of developers to construct the abstraction that matches their domain[2].

[0]: https://github.com/authzed/spicedb

[1]: https://github.com/authzed/examples/tree/main/integration-te...

[2]: https://docs.authzed.com/guides/schema

gwen-shapira · 3 years ago

The blog explicitly said that if the requirements involve actual authorization models (beyond simple tenancy) then RLS is not the best fit (see: https://thenile.dev/blog/multi-tenant-rls#if-you-have-sophis...).

I think this covers both the complexity aspect and the difference between what you get from RLS and what external authz brings to the table (schema, for example).

I do think that RLS is a great way for a company without authz experts to built a multi-tenant MVP safely. I've yet to see a single pre-PMF company that worries about authorization beyond that, this is a series-B concern in my experience.

semitones · 3 years ago

Btw that's a localhost link

shaicoleman · 3 years ago

We're currently using the schema-per-tenant, and it's working very well for us:

* No extra operational overhead, it's just one database

* Allows to delete a single schema, useful for GDPR compliance

* Allows to easily backup/restore a single schema

* Easier to view and reason about the data from an admin point of view

* An issue in a single tenant doesn't affect other tenants

* Downtime for maintenance is shorter (e.g. database migration, non-concurrent REINDEX, VACUUM FULL, etc.)

* Less chance of deadlocks, locking for updates, etc.

* Allows easier testing and development by subsetting tenants data

* Smaller indexes, more efficient joins, faster table scans, more optimal query plans, etc. With row level security, every index needs to be a compound index

* Easy path to sharding per tenant if needed. Just move some schemas to a different DB

* Allows to have shared data and per-tenant data on the same database. That doesn't work with the tenant-per-database approach

There are a few cons, but they are pretty minor compared to the alternative approaches:

* A bit more code to deal in the tenancy, migrations, etc. We opted to write our own code rather than use an existing solution

* A bit more hassle when dealing with PostgreSQL extensions . It's best to install extensions into a separate extensions schema

* Possible caching bugs so you need to namespace the cache, and clear the query cache when switching tenant

* The security guarantees of per tenant solution aren't perfect, so you need to ensure you have no SQL injection vulnerabilities

bvirb · 3 years ago

We ran a multi-tenant SaaS product for years w/ a schema-per-tenant approach. For the most part it all worked pretty great.

We ran into issues here and there but always found a way to work around them:

* Incremental backups were a pain because of needing to lock so many objects (# of schemas X # of tables per schema).

* The extra code to deal w/ migrations was kinda messy (as you mentioned).

* Globally unique IDs become the combination of the row ID + the tenant ID, etc...

For us though the real deal-breaker turned out to be that we wanted to have real foreign keys pointing to individual rows in tenant schemas from outside of the tenant schema and we couldn't. No way to fix that one since with multi-schema the "tenant" relies on DB metadata (the schema name).

We ended up migrating the whole app to RLS (which itself was a pretty interesting journey). We were afraid of performance issues since the multi-schema approach kinda gives you partitioning for free, but with the index usage on the RLS constraints we've had great performance (at least for our use case!).

After quite a bit of time working with both multi-schema & RLS I probably wouldn't go back to multi-schema unless I had a real compelling reason to do so due to the added complexity. I really liked the multi-schema approach, and I think most of the critiques of it I found were relatively easy to work around, but RLS has been a lot simpler for us.

POPOSYS · 3 years ago

Would you like to explain the deal-breaker a little bit more? I do not understand the limitation you hit. It seems like it is no problem to access foreign keys in a different schema as long permissions do allow that. Thanks!

Deleted Comment

uhoh-itsmaciek · 3 years ago

There are some other cons:

Memory usage and I/O can be less efficient. Postgres handles table data in 8kb pages, so even if you're just reading a single row, that reads 8kb from disk and puts 8kb in the Postgres buffer cache, with that row and whatever happens to be next to it in the physical layout of the underlying table. Postgres does this because of locality of reference: it's cheaper to bulk-load data from disk, and, statistically speaking, you may need the adjacent data soon. If each user is touching separate tables, you're loading a page per row for each user, and you're missing out on some of the locality benefits.

Another problem is monitoring (disclosure: I work for pganalyze, which offers a Postgres monitoring service). The pg_stat_statements extension can track execution stats of all normalized queries in your database, and that's a very useful tool to find and address performance problems. But whereas queries like "SELECT * FROM posts WHERE user_id = 123" and "SELECT * FROM posts WHERE user_id = 345" normalize to the same thing, schema-qualified queries like "SELECT * FROM user_123.posts" and "SELECT * FROM user_345.posts" normalize to different things, so you cannot easily consider their performance in aggregate (not to mention bloating pg_stat_statements by tracking so many distinct query stats). This is the case even when you're using search_path so that your schema is not explicitly in your query text.

Also, performance of tools like pg_dump is not great with a ton of database objects (tables and schemas) and, e.g., you can run into max_locks_per_transaction [1] limits, and changing that requires a server restart.

I wouldn't say you should never do schema-based multi-tenancy (you point out some good advantages above), but I'd be extremely skeptical of using it in situations where you expect to have a lot of users.

[1]: https://www.postgresql.org/docs/current/runtime-config-locks...

andy_ppp · 3 years ago

I find adding loads of stuff to Postgres exciting and fun, but I want all of my logic in the code in GitHub, rather that floating around in my global data store. Has anyone thought about a data layer that allows you to define this stuff programmatically rather than in SQL but then it configures your data layer to work like this. Not necessarily an ORM but more a business logic layer that compiles everything down to use features like this. Or maybe even a data layer that is a set of programmatic building blocks that works as described?

hamandcheese · 3 years ago

Flyway is a plain-sql migration tool with support for “repeatable migrations” which somewhat do what you want:

https://flywaydb.org/documentation/tutorials/repeatable

macNchz · 3 years ago

I’ve set something like that up a handful of times in a kind of ad-hoc manner, by subclassing/extending the autogeneration tools from existing db migration frameworks to just detect changes in a directory of .sql files. Has worked pretty well to keep stored procedures/triggers/materialized views up to date with the repo.

Yep -- postgres enforces, while config-as-code can inject as usual bc the whole point is it is just SQL, so policy changes are just migrations (and subject to SDLC)

mixmastamyk · 3 years ago

You're describing an ORM, or perhaps SQLAlchemy, which has a lower level interface. .sql files work fine in version control as well. "create or replace …" pattern can make them idempotent.

simlevesque · 3 years ago

Supabase kinda does this.

LAC-Tech · 3 years ago

Does superbase basically give you a pgAdmin accessible instance to play with, then generate a JSON API using postgREST? It's not 100% clear to me from skimming the docs.

>Another issue we caught during testing was that some requests were being authorized with a previous request’s user id.

This is the terrifying part about RLS to me: having to rely on managing the user id as part of the database connection session seems like an easy way to shoot yourself in the foot (especially when combined with connection pooling). Adding WHERE clauses everywhere isn't great, but at least it's explicit.

That said, I've never used RLS, and I am pretty curious: it does seem like a great solution other than that one gotcha.

koolba · 3 years ago

If you do this right, it ends up in just one place and the code that handles checking in / out connections from your pool will handle wrapping your connection usage with the appropriate session context. And of course subsequently clearing it with a DISCARD ALL.