How to use Postgres for everything

Just don't use a single Postgres DB for everything as you scale up to 100+ engineers. You'll inevitably get database-as-the-API.

Now if you have the actual technical leadership [1] to scale your systems by drawing logical and physical boundaries so that each unit has its own Postgres? Yeah Postgres for everything is solid.

[1] Surprisingly rare I've found. Lots of "successful" CTOs who don't do this hard part.

anonzzzies · a year ago

You don't have to plan this very early; most companies won't get to 100+ engineers. Let's ship first and worry about this, much much much much later. Overarchitecting stuff makes life hard; coming into companies that have 40 servers running with everything architected for 1000000 engineers and billions of visitors while in reality there aren't even 2 users and there is 1 overworked engineer. Stop doing that and stop people telling to do that.

Cthulhu_ · a year ago

This (don't overarchitecture stuff) is actually an argument in favor of using e.g. Postgres for everything, as adding more and more tools adds complexity and (architectural) overhead.

szundi · a year ago

If you die before that, it was the market, not engineers being slow. Of course. Or sales.

whateveracct · a year ago

I agree totally. But I've seen companies go to 100+ and still insist that the level of monolithic coupling I describe hasn't been outgrown.

KronisLV · a year ago

> Let's ship first and worry about this, much much much much later. Overarchitecting stuff makes life hard; coming into companies that have 40 servers running with everything architected for 1000000 engineers and billions of visitors while in reality there aren't even 2 users and there is 1 overworked engineer.

Even if you try to draw boundaries between different bits of the system, you are unlikely to end up with 40 servers, not even close. The average system wouldn't even have 40 separate use cases for PostgreSQL or even the need for 40 different dependencies.

However, if you do split it up early, you more realistically would have something along the lines of:

  * your main database - the main business domain stuff goes here
  * key-value storage/cache - decoupled from the main instance, because of the decoupling nobody will be tempted to put business domain specific columns here but keep it generic
  * message queue - for when processing some data takes a bunch of resources but during peak load you need to register a bunch of stuff quickly and process it when you get the capacity
  * blob storage - to not make the main database bloat a whole bunch, but to keep any binary stuff in a separate instance, provided you don't need S3 compatibility
  * auth - the actual user data, assuming you use Keycloak or something like it
  * metrics - all of the APM stuff, like from Apache Skywalking or PostgreSQL

Give or take 2-3 services, all of which can run off of a Docker Compose stack locally or with the container management platform of your choice (not even Kubernetes necessarily, but simpler ones like Hashicorp Nomad or even Docker Swarm, using the Compose format). All of which can have their backups be treated similarly, similar approaches to clustering, all of which have similar applicable tools, all of which have similar libraries for integration with any apps and all of which can be granularly inspected in regards to how they perform and scaled as needed.

It's arguably better than a single large instance that ends up with 300 tables eventually, has 100 GB of data in the shared test environment and you wouldn't know where to start in regards to making a working local environment if you join a legacy org that isn't using OCI containers and giving each dev a local environment. The single large deployment will rot faster than multiple ones. How many you actually need? Depends on what you do, it wasn't that long ago that GitLab decided to split their singular DB into multiple parts: https://about.gitlab.com/blog/2022/06/02/splitting-database-... (which also shows that you can get pretty far with a single schema as well, to anyone who wants a counter argument, though they did split in the end)

Realistically, if you're doing a personal or small project, everything can be in the same instance because it'll probably never go that far, but I've also seen both monolithic codebases (that are deployed as singleton apps, e.g. not even horizontal scaling) and what shoving all of the data in a single database leads to (regardless of whether it's PostgreSQL or Oracle), it's never pleasant. I'd at the very least look in the direction of splitting out bits of a larger system based on the use case, not even DDD, just like "here's the API and DB instance that process file uploads, they are somewhat decoupled from our business logic in regards to products and shopping carts and all that".

Now, would I personally always use PostgreSQL? Not necessarily, since there are also benefits to Redis/Valkey, MinIO, RabbitMQ and others, alongside good integrations with lots of frameworks/libraries that just work out of the box, as opposed to you needing to write a bunch of arguably awkward SQL for PostgreSQL. But the idea of using fewer tools like PostgreSQL for different use cases (that they are still good for) seems sound to me as well.

underdeserver · a year ago

Drop the scare quotes from "successful". These guys shipped products. The migration to multiple databases, syncing user information etc. is a milestone, not a necessity at every step.

devjab · a year ago

Unfortunately a lot of places teach software engineers to build over complexity before they need it. I like to remind people that StackOverflow ran on a couple of IIS servers and a msSQL db for a good while before scaling up to a little more than a couple of IIS servers. Hell they didn’t even do the CDN thing until 2017 or something similarity “crazy”. Mean while you have hordes of developers who write interfaces which will only ever be consumed by a single class because they’ve been taught to do so.

As I see it the only way to look at scaling issues is that you’ve made it. Almost no user facing software will ever reach a point where something like Django couldn’t run it perfectly fine. It’s when you do, you build solutions, which sometimes means forking and beating the hell out of Django (Instagram), sometimes going into Java or whatever, spreading out to multiple data bases and so on. Of course once you actually hit the scaling issues you’ll also have enough engineers to actually do it.

whateveracct · a year ago

nah at the scale I'm talking about, if you've shipped a produc but your engineers are increasingly dissatisfied with the effort it takes to ship (and are becoming gun shy because you haven't decoupled things so their outage blast radius is huge), then you are currently failing as a technical leader. And if those things are true and you stick to "one Postgres monolithic DB for everything" you are cargo culting too hard.

You can't just keep "doing things that don't scale" forever. If you have 100+ engineers [1], you aren't a startup anymore no matter your roots.

[1] and remember, my comment that is the ultimate parent of this conversation is about not continuing to do this at 100+. I said nothing about scaling up to that point.

szundi · a year ago

Thank you

RedShift1 · a year ago

> You'll inevitably get database-as-the-API

I think that's actually the point of "postgres for everything"?

whateveracct · a year ago

If you don't have any discipline it becomes hell.

Not to mention that a random team writing a migration that locks a key shared table (or otherwise chokes resources) now causes outages for everyone.

solatic · a year ago

Database-as-the-API can scale surprisingly far, particularly if you sell a single-tenant shard to each customer and therefore a separate database to each customer. Drawing logical software boundaries before Product even knows what the domain looks like (i.e. which features will sell) is quite risky.

qaq · a year ago

database as API works fine if you have properly abstracted things with sprocs and views. It will be also far less brittle than 100 services exposed as GraphQL

Kwpolska · a year ago

Stored procedures just add overhead and make everyone's lives harder. Forget about any ORMs, you're writing raw SQL with all the quirks of PL/pgSQL biting you all the time.

ForHackernews · a year ago

Postgres is inherently multi-tenant. Have separate logic DBs in one physical instance, connect using roles with minimal permissions, expose views (and materialized views!) for querying so you can mutate the underlying tables without requiring applications to change.

whateveracct · a year ago

Great point - wish the people in charge at my job knew that lol

benrutter · a year ago

I'm so glad to see somebody made this comment!!

Postgres for everything is pretty neat in that you can take expertise from one place, and use it somewhere else (or just not have to learn a zillion tools)

The same database for everything is a really good way to have a tangled mess (are people still using the word complected?) where nobody knows which parts are depended on by what.

CalRobert · a year ago

You could go whole hog with postgrest I suppose

jjallen · a year ago

I always wonder if postgrest works well after growing quite a bit. I really should google the largest companies using it.

I absolutely love Postgres, but please allow me to say that you absolutely don't want to expose an API generated from a database to people outside of your team. This limits you a lot in changing the way you store your data.

I wrote about this topic before and haven't changed my opinion much. You don't want to have that tight coupling: https://wundergraph.com/blog/six-year-graphql-recap#generate...

RedShift1 · a year ago

What exactly is the problem with tight coupling? You're going to insert an entire layer that basically translates format A to format B, just so you can later change a column name in the database and not have to change it in the API or something?

grey-area · a year ago

1. You don’t want or need to expose lots of implementation details. Many of your data structures should be private, and many should be partly private.

2. Your data structures should not dictate the shape of your api, usage patterns should (e.g. the user always needs records a,b,c together or they have a and want c but don’t care about b)

3. It stops you changing any implementation details later and/or means any change is definitely a breaking change for someone.

Cthulhu_ · a year ago

There's a few issues; one is that if you have the DB do everything, all of your business logic lives there too, instead of just the data. This is still fine if you have a single use case, but what if in addition to your main application, you also need to use it for things like BI, customer service, analytics / predictions, etc? It then quickly becomes better to use it as a datastore and have another layer decide what to do with it.

And in 30 odd years, everything will be different again, but your company's main data store will not have moved as fast.

haileys · a year ago

Yes. A lot of the work I've done through my career is essentially this, once you boil away the rest of the details.

Deleted Comment

hmottestad · a year ago

Normalization is one of those typical issues where you might be fine with having everything normalized when you start off, but then once performance gets bad you end up denormalizing tables that are typically joined.

pdimitar · a year ago

The extremely obvious problem is that how you store data is an implementation detail and those change when requirements (or the market) evolve. I'll give you an API and will make triple sure it's as fast as a machine can even serve it and you let me worry about how it's all stored.

To additionally answer you with an analogy: when you have a problem with a company, you call the call center, not Jenny from accounting in particular. Jenny might have helped you twice or thrice but she might leave the company next year and now you have no idea how to solve your problem. Have call centers to dispatch your requests wherever it's applicable in the given day and leave Jenny alone.

halfcat · a year ago

> * What exactly is the problem with tight coupling?*

As Joel Spolsky put it: ”the cost of software is the cost of its coupling”.

More specifically the cost of making changes when “if I change this thing I have to change that thing”. But if there’s no attention paid to coupling, then it’s not just the two things you gave to change, but “if I change this thing I have to change those 40 things”.

dunwaldo · a year ago

you could use views as a layer in between?

boomskats · a year ago

You should use views as the layer in between. They'll let you version your API and make changes without breaking things.

XXXXX achieves a latency of 0.095 ms, which is approximately 85% faster than the 0.679 ms latency observed for Postgres’ unlogged table. It also handles a much higher request rate, with 892.857,12 requests per second compared to Postgres’ 15.946,02 transactions per second.

jensneuse · a year ago

tlarkworthy · a year ago

I was recently annoyed to find postgres indexes don't support skipping [1] you also can't have the nul character in a string (\u0000) [2]. Its great, but it has some strange WTF gaps in places.

[1] https://wiki.postgresql.org/wiki/Loose_indexscan

[2] https://stackoverflow.com/questions/28813409/are-null-bytes-...

What is a reasonable use for a null character in a string? My first instinct is that strings with nulls in them should absolutely be rejected.

larsnystrom · a year ago

There are two kinds of programmers: Those who think of strings as text, and those who think of strings as a sequence of bytes. The second group doesn’t care about the special case where a byte is all zeroes.

lelanthran · a year ago

Yup. It's a huge red flag when a datatype intended to be used for representation of written human language is abused to store something that has no glyph recognisable in any human language.

There's a lot to complain about with nul-terminated strings, but not being able to store arbitrary bytes ain't one of them.

jagged-chisel · a year ago

Not everything needs to be a C-string (null-terminated array/sequence of characters.) We are advanced enough with our understanding of Things that we can include metadata along with a chunk of bytes to indicate “this is a ‘string’ and it’s q bytes long and can have any value you want in there.”

That said, I’m with you. And if someone wants nulls inside their “strings” then they probably want blobs.

That your JSON deserializer accepted them.

> you also can't have the nul character in a string …

Let me introduce you to blob…

hamilyon2 · a year ago

Yes, skip-index scans require custom sql now.

I am also a bit annoyed by cache-like uses not being first-class. Unlogged tables get you far, temporary tables are nice, but still all this feels like a hurdle, awkward and not what you actually need.

drtgh · a year ago

> I am also a bit annoyed by cache-like uses not being first-class.

Since what happened recently with Redis[1] the first thing I thought about was Postgre, but the performance[2] difference is too noticeable, so one have to look for other alternatives, and not very confident due thinking such alternatives may follow the same "Redi's attitude" ( ValKey, DragonflyDB, KeyDB, Kvrocks, MinIO, RabbitMQ, etc etc^2 ).

It would be nice if these cache-like uses within Postgre had a tinny push.

[1] https://news.ycombinator.com/item?id=42239607

[2] https://medium.com/redis-with-raphael-de-lio/can-postgres-re...

jeeybee · a year ago

PGQueuer is a lightweight job queue for Python, built entirely on PostgreSQL. It uses SKIP LOCKED for efficient and safe job processing, with a minimalist design that keeps things simple and performant.

If you’re already using Postgres and want a Python-native way to manage background jobs without adding extra infrastructure, PGQueuer might be worth a look: GitHub - https://github.com/janbjorge/pgqueuer

TkTech · a year ago

Also https://github.com/TkTech/chancy for another (early) Python option that goes the other way and aims to have bells and whistles included like a dashboard, workflows, mixed-mode workers, etc...

Check out the Similar Projects section in the docs for a whole bunch of Postgres-backed task queues. Haven't heard of pgqueuer before, another one to add!

WJW · a year ago

I always wondered about the claim that SKIP LOCKED is all that efficient. Surely there are lots of cases where this is a really suboptimal pattern.

Simple example: if you have a mixture of very short jobs and longer duration jobs, then there might be hundreds or thousands of short jobs executed for each longer job. In such a case the rows in the jobs table for the longer jobs will be skipped over hundreds of times. The more long-running jobs running concurrently, the more wasted work as locked rows get skipped again and again. It wouldn't be a huge issue if load is low, but surely a case where rows get moved to a separate "running" table would be more efficient. I can think of several other scenarios where SKIP LOCKED would lead to lots of wasted work.

Good point about SKIP LOCKED inefficiencies with mixed-duration jobs. In PGQueuers benchmarks, throughput reached up to 18k jobs/sec, showing it can handle high concurrency well. For mixed workloads, strategies like batching or partitioning by job type can help.

While a separate "running" table reduces skips, it adds complexity. SKIP LOCKED strikes a good balance for simplicity and performance in many use cases.

One known issue is that vacuum will become an issue if the load is persistent for longer periods leading to bloat.

cercaapps · a year ago

Job rows could have an indexed column state so you just query for the rows with state "not-started".

This way you won't need to skip over the long jobs that are in state "processing".

arunix · a year ago

What are its advantages compared to a more dedicated job queue system?

I think PGQueuers main advantage is simplicity; no extra infrastructure is needed, as it runs entirely on PostgreSQL. This makes it ideal for projects already using Postgres and operational familiarity. While it may lack the advanced features or scalability of dedicated systems like Kafka or RabbitMQ, it’s a great choice for lightweight without the overhead of additional services.

j12a · a year ago

Being stuck with MariaDB/MySQL in some projects, I recently compared it to PostgreSQL and found many of these extended capabilities existing there also, including JSON, temporal tables (w/ SYSTEM VERSIONING), columnar and vector storages etc.

LISTEN/NOTIFY type functionality was sort of missing but otherwise it was surprising how it is keeping up also, while little of that is probably being used by many legacy apps.

jankovicsandras · a year ago

Can you please add this shameless plug?

https://github.com/jankovicsandras/plpgsql_bm25

Opensource BM25 search in PL/pgSQL (for example where you can't use Rust extensions), and hybrid search with pgvector and Reciprocal Rank Fusion.

samanator · a year ago

Cool to see you here!

We discussed this very thing on supabase

https://github.com/orgs/supabase/discussions/18061#discussio...

feketegy · a year ago

Try opening a pull request maybe

KingOfCoders · a year ago

"It's inspired by this article from the Amazing CTO [0]"

Getting up in the morning, seeing an article that references you, bliss!

[0]: https://www.amazingcto.com/postgres-for-everything/

tbarbugli · a year ago

> Use Postgres for Fulltext Search instead of Elastic

pg fulltext search is very limited and user unfriendly, not a great suggestion here

ianpurton · a year ago

I can see a lot of benefits of having 1 API to access a lot of functionality.

For example instead of integrating with a message queue I can just do an INSERT this is great. It lowers the friction.

Vector search is a no brainer too. Why would I have 2 databases when 1 can do it all.

Using Postgres to generate HTML is questionable though. I haven't tried it but I can't image its a viable way to create user interfaces.

qpiox · a year ago

> Using Postgres to generate HTML is questionable though. I haven't tried it but I can't image its a viable way to create user interfaces.

Check https://apex.oracle.com/ which does that.