I often go down rabbit holes like this, trying to collapse and simplify the application stack.
But inevitably, as an application grows in complexity, you start to realize _why_ there's a stack, rather than just a single technology to rule them all. Trying to cram everything into Postgres (or lambdas, or S3, or firebase, or whatever other tech you're trying to consolidate on) starts to get really uncomfortable.
That said, sometimes stretching your existing tech is better than adding another layer to the stack. E.g. using postgres as a message queue has worked very well for me, and is much easier to maintain than having a totally separate message queue.
I think the main takeaway here is that postgres is wildly extensible as databases go, which makes it a really fun technology to build on.
I have certain experience with some technologies, e.g., SQS and Postgres.
Say I'm on your team, and you're an application developer, and you need a queue. If you're taking the "we're small, this queue is small, just do it in PG for now and see if we ever grow out of that" — that's fine. "Let's use SQS, it's a well-established thing for this and we're already in AWS" — that's fine, I know SQS too. I've seen both of these decisions get made. (And both worked: the PG queue was never grown out of, and generally SQS was easy to work with & reliable.)
But what I've also seen is "Let's introduce bespoke tech that nobody on the team, including the person introducing it, has experience in, for a queue that isn't even the main focus of what we're building" — this I'm less fine with. There needs to be a solid reason why we're doing that, and that we're going to get some real benefit, vs. something that the team does have experience in, like SQS or PG. Instead, this … thing … crashes on the regular, uses its own bespoke terminology, and you find out the documentation is … very empty. This does not make for a happy SRE.
This desire can sometimes be so strong that people insist on truly wacky decisions. I have before demonstrated that Postgres performs perfectly well (and in fact exceeds) compared with a niche graph database, and heard some very strange reasons for why this approach should be avoided. A lot of the time you hear that it's engineers who chase shiny technology, but I've seen first hand what can happen when it's leadership.
I worked in a team that did this. It was mostly staffed by juniors, and the team leader wasn't very interested in the technical aspects, they just went to a page, checked that the new feature worked alright and gave the green light.
So over the years these juniors have repeatedly chosen different tech for their applications. Now the team maintains like 15-20 different apps and among them there's react, Vue, angular, svelte, jQuery, nextjs and more for frontends alone. Most use Episerver/Optimizely for backend but of course some genius wanted to try Sanity so now that's in the mix as well.
And it all reads like juniors built it. One app has an integration with a public api, they built a fairly large integration app with an integration db. This app is like 20k lines of code, much of which is dead code, and it gets data from the public api twice a day whereas the actual app using the data updates once a day and saves the result in its own Episerver db. So the entire thing results in more api traffic rather than less, the app itself could have just queried the api directly.
But they don't want me to do that, they just want me to fix the redundant integration thing when it breaks instead. Glad I'm not on that team any more.
Ok. I get that. But to play devil's advocate: with that mentality we'd never learn a new technology and still be stuck on punch cards. And I don't have the time anymore for hobby projects. I'd say it's ok to introduce something new as long as it's one thing at a time and not an entire new stack in the "a rewrite will solve all problems" projects
I think a lot of the industry struggles with the idea that maybe there is no "one size fits all", and what makes sense when you're a one person company with 100 customer probably doesn't make sense when you're a 1000 people company with millions of customers.
If you use a stack meant for a huge userbase (with all the tradeoffs that comes with it) but you're still trying to find market fit, you're in for a disappointment
Similarly, if you use a stack meant for smaller projects while having thousands of users relying on you, you're also in for a disappointment.
It's OK to make a choice in the beginning based on the current context and environment, and then change when it no longer makes sense. Doesn't even have to be "technical debt", just "the right choice at that moment".
> It's OK to make a choice in the beginning based on the current context and environment, and then change when it no longer makes sense.
Yep. And Postgres is a really good choice to start with. Plenty of people won't outgrow it. Those who do find it's not meeting some need will, by the time they need to replace it, have a really good understanding of what that replacement looks like in detail, rather than just some hand-wavy "web scale".
That's a nicely balanced view. I've been working on the intersection between dev, sec and ops for many, many years and one of the most important lessons has been that every dependency is a liability. That liability is either complexity, availability, security, wasting resources or projects or key people disappearing. Do anything to avoid adding more service, library or technology dependencies; if necessary, let people have their side projects and technological playgrounds to distil future stacks out of.
There are good reasons to go OLAP or graph for certain kinds of problems, but think carefully before adding more services and technologies because stuff has a tendency to go in easily but nothing ever leaves a project and you will inevitably end up with a bloated juggernaut that nobody can tame. And it's usually those people pushing the hardest for new technologies that are jumping into new projects when shit starts hitting the fan.
If a company survives long enough (or cough government), a substantial and ever increasing amount of time, money and sec/ops effort will go into those dependencies and complexity cruft.
PG works really well as a message queue and there's several excellent implementations on top of it.
Most systems are still going to need Redis involved just as a coordinator for other pub/sub related work unless you're using a stack that can handle it some other way (looking at BEAM here).
But there are always going to be scenarios as an application grows where you'll find a need to scale specific pieces. Otherwise though, PostgreSQL by itself can get you very, very far.
It's also worth noting that by using PG as a message queue, you can do something that's nearly impossible with other queues - transactionally enqueue tasks with your database operations. This can dramatically simplify failure logic.
On the other hand, it also means replacing your message queue with something more scalable is no longer a simple drop-in solution. But that's work you might never have to do.
The more I do fullstack work the more I see an obesity crisis. I under the need to modularize (I dearly think I do) but god you have relational model, reimplemented in your framework, reencoded as a middleware to handle url parsing, the one more layer to help integrate things client side. I find that insane. And Postgrest was a refreshing idea.
Seriously. There's like 7000 duplicates of the very same data layer in a single stack: database, back-end ORM/data mapper, front end and various caching things in between. Things like PostgREST and Hasura area great pared with fluent clients.
See, now that we're profitable, we're gonna become a _scale up_, go international, hire 20 developers, turn everything into microservices, rewrite the UI our customers love, _not_ hire more customer service, get more investors, get pressured by investors, hire extra c-levels, lay-off 25 developers and the remaining customer service, write a wonderful journey post.
Handling business logic in the database is often going to be an order of magnitude faster than the application layer of some of the popular language stacks (looking at you, Rails, Node, etc). It also will outlive whatever webstack of the day (and acquisition which of en requires a re-write of the application layer but keeps general database structure - been there done that).
> you start to realize _why_ there's a stack, rather than just a single technology to rule them all
Architecturally, there are other cases besides message queues where there's no reason for introducing another layer in the stack, once you have a database, other than just because SQL isn't anybody's favorite programming language. And that's the real reason there's a stack.
The problem is rarely cost, it's operational overhead.
Using SQS for a queue rather than my already-existing Postgres means that I have to:
- Write a whole bunch of IaC, figuring out the correct access policies
- Set up monitoring: figure out how to monitor, write some more IaC
- Worry about access control: I just increased the attack surface of my application
- Wire it up in my application so that I can connect to SQS
- Understand how SQS works, how to use its API
It's often worth it, but adding an additional moving piece into your infra is always a lot of added cognitive load.
You get exactly once when you consume with pgmq and run your queue operations inside transactions in your postgres database. I can't think of an easy way to get some equivalent on SQS without building something like an outbox.
This is very much the way I'm pushing in our internal development platform: I want to offer as little middlewares as possible, but as many as necessary. And ideally these systems are boring, established tech covering a lot of use cases.
From there, Postgres ended up being our relational storage for the platform. It is a wonderful combination of supporting teams by being somewhat strict (in a flexible way) as well as supporting a large variety of use cases. And after some grumbling (because some teams had to migrate off of SQL Server, or off of MariaDB, and data migrations were a bit spicy), agreement is growing that it's a good decision to commit on a DB like this.
We as the DB-Operators are accumulating a lot of experience running this lady and supporting the more demanding teams. And a lot of other teams can benefit from this, because many of the smaller applications either don't cause enough load on the Postgres Clusters to be even noticeable or we and the trailblazer teams have seen many of their problems already and can offer internally proven and understood solutions.
And like this, we offer a relational storage, file storage, object storage and queues and that seems to be enough for a lot of applications. We're only now adding in Opensearch after a few years as a service now for search, vector storage and similar use cases.
On top of that, a lot of discourse seems to happen with an assumption that you only make the tech/stack choice once.
For the majority of apps, just doing basic CRUD with a handful of data types, is it that hard to just move to another DB? Especially if you're in framework land with an ORM that abstracts some of the differences, since your app code will largely stay the same.
The same argument of UNIX design patterns (Single responsibility, well defined interfaces and communication protocals) vs Monolithic design patterns comes up a lot. I think that its mainly because both are effective at producing products, its just that they both have downsides.
I read a meme yesterday about how you can just interject "it's all about finding that balance" into any meeting and people will just agree with you. I'm gonna say it here.
Sometimes a flexible tool fits the bill well. Sometimes a specialized tool does. It's all about finding that balance.
Just noting that sometimes one can do both: seperate postgres DBs/clusters for different use-case, seperate instances of a web server for TLS termination, caching, routing/rewriting, Ed:static asset serving. Benefit is orderly architecture, and fewer different dependencies.
I'm one of the makers of ParadeDB, a modern alternative to Elasticsearch. We build Postgres extensions to do fast search (pg_bm25) and analytics (pg_analytics). I love Postgres. If you have a small workload, like a startup, it certainly makes sense to stay within Postgres as long as you can.
The problem is, at scale, Postgres isn't the answer to everything. Each of the workloads one can put in Postgres start to grow into very specific requirements, you need to isolate systems to get independent scaling and resilience, etc. At this point, you need a stack of specialized solutions for each requirement, and that's where Postgres starts to no longer be enough.
There is a movement to build a Postgres version of most components on the stack (we are a part of it), and that might be a world where you can use Postgres at scale for everything. But really, each solution becomes quite a bit more than Postgres, and I doubt there will be a Postgres-based solution for every component of the stack.
what is "at scale"? Is there a specific metric or range of metrics that raises a flag to begin considering something else? For example, in the olden days when it was my problem, page load times were the metric. Once it got high enough you looked for the bottleneck, solved it, and waited. When the threshold was broken again you re-ran the same process.
This bugs me every time performance comes up. No one is ever concrete, so they can never be wrong.
If Michael Jackson rose from the dead to host the Olympics opening ceremony and there were 2B tweets/second about it, then postgres on a single server isn't going to scale.
A crud app with 5-digit requests/second? It can do that. I'm sure it can do a lot more, but I've only ever played with performance tuning on weak hardware.
Visa is apparently capable of a 5-digit transaction throughput ("more than 65,000")[0] for a sense of what kind of system reaches even that scale. Their average throughput is more like 9k transctions/second[1].
The truth is that it really depends on your application work load. Is it read-heavy, or write-heavy? Are the reads more lookup-heavy (i.e. give me this one user's content), or OLAP heavy (i.e. `group by`'s aggregating millions of rows)? Is read-after-write an important problem in your application? Do you need to support real-time/"online" updates? Does your OLAP data need to be mutable, or can it be immutable (and therefore compressed, columnarized, etc.)? Is your schema static or dynamic, to what degree?
I agree with others that a good simplification is "how far can you get with the biggest single AWS instance"? And the answer is really far, for many common values of the above variables.
That being said, if your work load is more OLAP than OLTP, and especially if your workload needs to be real-time, Postgres will begin to give you suboptimal performance without maxing-out i/o and memory usage. Hence, "it really depends on your workload", and hence why you see it's common to "pair" Postgres with technologies like Clickhouse (OLAP, immutable, real-time), RabbitMQ/Kafka/Redis (real-time, write-heavy, persistence secondary to throughput).
For me with any kind of data persistence backend, it's when you go from scaling vertically to horizontally. In other words, when it's no longer feasible to scale by just buying a bigger box.
I don't know that there is a canonical solution for scaling Postgres data for a single database across an arbitrary number of servers.
I know there is CockroachDB which scales almost limitlessly, and supports Postgres client protocol, so you can call it from any language that has a Postgres client library.
For scaling, has anyone here used hash based partitioning to scale horizontally?
In principle, seems like it should work to allow large scale distribution across many servers. But the actual management of replicas and deciding which servers to place partitions, redistributing when new servers are added, etc. could lead to a massive amount of operational overhead.
As other sibling comments noted, Citus does this pretty well. Recommend reading through their docs and use-cases. There's some migration/setup costs, but once you have a good configuration, it mostly just-works.
Main downside is that you either have to either self-manage the deployment in AWS EC2 or use Azure's AWS-RDS-equivalent (CitusData was acquired by MS years ago).
FWIW, I've heard that people using Azure's solution are pretty satisfied with it, but if you're 100% on AWS going outside that fold at all might be a con for you.
Citus is indeed an example for "distributed PostgreS". There are also serverless Postgres (Neon, Nile, AWS Aurora) which do this.
If you are interested in partitioning in an OLAP scenario, this will soon be coming to pg_analytics, and some other Postgres OLAP providers like Timescale offer it already
hash based partitioning makes repartitioning very expensive. most distributed DB now use key-range based partitioning. Iirc, Dynamo which introduced this concept has also made the switch
This makes a strong case, but I've decided to start every new project with sqlite and not switch until absolutely necessary. If Postgres is the 90% case, then sqlite is the 80% case and is also dead simple to get going and genuinely performant. So when vertical scaling finally fails me, I know I'll be at a wonderful place with what I'm building.
> [...] sqlite is the 80% case and is also dead simple to get going and genuinely performant.
I don't understand this. PostgreSQL is ALSO dead simple to get going, either locally or in production. Why not just start off at 90%?
I mean, I get there are a lot of use cases where sqlite is the better choice (and I've used sqlite multiple times over the years, including in my most recent gig), but why in general?
I think "dead simple" is not doing anyone any favors when it is being used to try to equate the simplicity of things.
It's obviously a lot simpler to just have a file, than to have a server that needs to be connected to, as long as we're still talking about running things on regular computers.
I'm not saying it's hard to set up Postgres locally, but sqlite is a single binary with almost no dependencies and no config, easily buildable from source for every platform you can think of. You can grab a single file from sqlite.org, and you're all set. Setting up Postgres is much more complicated in comparison (while still pretty simple in absolute terms - but starting with a relatively simpler tool doesn't seem like a bad strategy.)
Practically, because sqlite is good enough for one machine and compatible-enough with postgresql that you can use either pretty easily. One thing I wrote was an exactly-once stream processor that fetched events from a lot of remote systems for processing. Transaction-based queue in the DB to achieve exactly-once with recovery (remote systems accepted time-stamp resyncing of the stream of events). It works fine at small scale on a single machine for design and testing (local integration tests with short startup time are very valuable) but trivially scales to hundreds of workers if pointed at a postgres instance. The work to allow sqlite vs postgres was a single factory that returned a DB connection in Go based on runtime configuration.
It's also good practice for designing reasonably cross-database compatible schemas.
One use case where SQLite is a good option is for embedding as a local database in an app. Starting local-only with SQLite allows you to defer a lot of the backend effort while testing an MVP.
If dead simple involves me babysitting a service process then not it is not. SQLite has embedded version that requires no service out of process. That's what my definition of dead simple.
Postgres complicates the application in several ways. In particular, Postgres suffers from the n+1 problem, while SQLite does not. That requires a significant amount of added complexity in the application to hack around. Why over engineer the application before it has proven itself as something anyone even wants to use? Let's face it, the large majority of software written gets thrown away soon after it is created.
I already hear you saying that you know of a library that provides a perfect abstraction to hide all those details and complexities, making the choice between Postgres and SQLite just a flip of a switch away. Great! But then what does Postgres bring to the table for you to choose it over SQLite? If you truly prove a need for it in the future for whatever reason, all you need to do is update the configuration.
Even with a WAL or some kind of homegrown spooling, you're going to be limited by the rate at which one thread can ingest that data into the database.
One could always shard across multiple SQLite databases, but are you going to scale the number of shards with the number of concurrent write requests? If not, SQLite won't work. And if you do plan on this, you're in for a world of headaches instead of using a database that does concurrency on its own.
Don't get me wrong; SQLite is great for a lot of things. And I know it's nice to not have to deal with the "state" of an actual database application that needs to be running, especially if you're not an "infrastructure" team, but there's good reasons they're ubiquitous and so highly regarded.
I’m with you in general, but what about vector search?
It really feels like the DB industry has taken a huge step backward from the promise of SQL. Switching from Postgres to SQLite is easy because the underlying queries are at least similar. But as soon as you introduce embeddings, every system is totally different (and often changing rapidly).
Specialized vector indexes become important when you have a large number of vectors, but the reality of software is that it is unlikely that your application will ever be used at all, let alone reach a scale where you start to hurt. Computers are really fast. You can go a long way with not-perfectly-optimized solutions.
Once you have proven that users actually want to use your product and see growth on the horizon to where optimization becomes necessary, then you can swap in a dedicated vector solution as needed, which may include using a vector plugin for SQLite. The vector databases you want to use may or may not use SQL, but the APIs are never that much different. Instead of one line of SQL to support a different implementation you might have to update 5 lines of code to use their API, but we're not exactly climbing mountains here.
Know your problem inside and out before making any technical choices, of course.
There are vector search solutions for sqlite that basically work, so if my project doesn't rely on that 100% then I'm willing to use those as stop-gaps until it does.
Of course, if there's a shortcoming of sqlite that I know I need right out of the gate, that would be a situation where I start with postgres.
Vector similarity searches can be done with simple SQL statements, and while the performance will not match a vector db, it's often good enough. For a db like SQLite that can run in-memory, I suspect they would be reasonably fast.
SQL was introduced in 1970s. Considering vector search was only adopted in the last 5 years, I’m not surprised by the lack of standards on vector API. At Google embedding as retrieval became popular in 2019-2020.
This is the new kid in town so you would see soon all major SQL dbs will support vector. However, any serious user, O(10M) vectors or above, would still require a dedicated vector db for performance reasons.
As a hardcore c++ guy, I recently switched to a company heavily into databases. I never had contact to databases before. And I'd like to go one step further: Why databases?
I come from an industry that heavily uses custom binary file formats. And I'm still bewildered by the world of databases. They seem to solve many issues on the surface, but not really in pratice. The heavy limitations on data types, the update disasters, the incompatibility between different SQL engines etc all make it seem like an awful idea. I get the interop benefits, and maybe with extreme data volumes. But for anything else, what's the point? Genuinely asking
> They seem to solve many issues on the surface, but not really in practice
I believe you haven't had a chance to work on problems that require an actual database. Multi-user access, ACID support, unified API (odbc/jdbc), common query language... all of these would require many man-years to set properly with a custom solution.
> the update disasters, the incompatibility between different SQL engines etc all make it seem like an awful idea
What update disasters? If you meant by updating database versions, these aren't things you do frequently because the database is expected to be running 24/7. But Postgres and Mysql are already rock solid here. Wrt SQL engine incompatibilities, you usually set with a single database vendor in practice. If you suddenly start to switch databases in the middle of the project, something needs to be fixed with the process design, not database.
All these comments seem to fuel my suspicion that we in fact shouldn't use databases, because we don't use any of these features. We just use them as external data storage for a single application. And not even that much data, like <10 gigs.
But the updating I would have expected to go more smoothly. If you make a point of using a software dedicated to managing data, I sure as hell would expect an update to go so smooth that I don't have to worry about or even notice it. In reality updates more often than not seem to come with undocumented errors. That is a constant source of frustration for me.
It gives you an easy, high-level way to use high performance data-structures and algorithms. You don't need to explicitly write or rewrite code to maintain hash maps or b-trees or whatever and to use the right structures for fast lookups from one set of data to another. You just say "CREATE INDEX name ON table USING HASH(column)", and from then on, your hash map will maintain itself, and any lookups that would benefit from that hash map will automatically use it. No need to rewrite any of your code that needs to work with that column. In some cases, it will also automatically do things like make temporary hashmaps or bitmaps for just the life of a query to speed it up.
You mention further down a "mystery box of performance", but if you understand what data structures it's using and how it uses them, then it's generally pretty straightforward. Mostly you can reason about what indices are available and how trees work (and e.g. whether it can walk two trees side-by-side to join data) to know what query plan it should make, and you can ask it to tell you what plan it makes and which indices it used. Likewise, if you have a query plan you want it to run (loop over this table, then use this column to look up in this table, etc.), you'll know what indices are needed to support that plan.
If people struggle with using a database correctly, they're really going to struggle with using something like a b-tree in a way where you don't corrupt your data in the event of a power loss or crash, or in a way where multiple threads don't clobber each other's updates or create weird in-between states (or you just use a global lock, but then you lose performance).
It's pretty opaque to me. If something in my cpp code is dodgy or runs slow, I can use a number of debugging and profiling tools. While I don't really even know how databases work on the insides, let alone profile or diagnose them. To this day my colleagues rewrite equivalent SQL statements because some run better than others. And we regularly run into unexpected latency spikes where most of the time a statement runs fine, but even nth time it's several magnitudes of times slower - and no one knows why. So we cobble code and caches around things. It all seems pretty mind boggling to me.
Read the original relational database paper by Codd. The problem is that trees of data are not flexible for querying, you really need a graph. If you draw it out on paper, you will find a relation is a very efficient way to store general graphs. I'm actually suprised to hear this from a C++ engineer, because sorting and searching columns of data is the name of the game in C++.
Another key problem they solve is separating the logical representation from the on disk representation. You can evolve and and optimize storage without breaking anything.
The other problem with files is that you have a disconnect between in memory and on disk. You have to constantly serialize and deserialize. Sqlite has quite a bit of info on this: https://www.sqlite.org/appfileformat.html
But that's the thing: All that is nice on paper. Of course the relational nature is nice. Of course the disk representation is nice. But I never feel like that's worth the trouble day in day out. The costs, the mystery box of performance, the insane statements, the extra hardware, the updates, the strange data types, the box of tricks everyone needs to make them behave. But I've been made aware that our usecases suck, so I'll attribute most to that.
ok I get that, but at least our applications usually only have one thing (one thread of one process on one machine) operating on the database. So maybe it's just our silly usecase
Over years, I've seen a lot of poorly made binary formats used by various programs. Databases like PostgreSQL are where such binary formats had evolved through benchmarking, testing for various edge cases, and are generally better than whatever you will find in the bespoke formats developed for a particular program.
You have to be really good at understanding how storage works, have a lot of time and resources to develop your own program to beat something like PostgreSQL. They have zillions of man-hours on you when you start, experience and knowledge. It's not impossible that you could find a case where a bespoke format would beat an established storage product, but over a range of cases, you most likely won't.
And, of course, there's a convenience aspect. Outside of niche technologies, SQL offers the richest language for data description and operations on data.
As for the inconsistencies between SQL implementations: in practice, it matters very little: most programs will never migrate between different SQL implementations anyways.
As for upgrades: it's a doubly-edged sword. You get a very expressive data format and it's not surprising that it's hard to upgrade. But, try to match the abilities of SQL in your own format, and you'll probably find out that it's hard to have generic tools for upgrading it too.
None of this means that you cannot do better than SQL databases. It'd be ridiculous to think there could be a way to prove that SQL is somehow the best we can get. It's just that it's very hard to do better. Especially if you want a universal tool
It's a decent question. I believe there's a simple answer that explains a large part of the reason people choose databases over flat files. People want ACID compliance _and_ data distribution, and databases are traditionally the only things that provide both.
For custom binary file you could and should use sqlite, it has many benefits instead of writing your own custom file format. You just have acid without any cost. It's a single file database you can access throuhg lib statically linked in your application and that's done.
Besides what others are saying here, data integrity is a lot easier with a relational database if your data is heavily relational. I have managed systems with <1gb of data that would have been much harder to build without primary and foreign keys
About data types, what limitations are you referring to? jsonb is well supported in many dbs and throwing random large binary blobs in the middle with your normal data is a bad idea with or without a relational database.
I can see the appeal to replace noSQL solutions though, a lot of people are using S3 (and other storage solutions) as a makeshift database lately
I think my wtf moment was that you can't have real variable length strings. And I'm pretty pampered by c++ highly nuanced numeric types with varying sizes and signed/unsigned.
Having written a small, bespoke query engine (why? long story not for today, but mainly it was super-optimized for a very particular type of query, and very fast, with 10-50us average query times), I don't ever want to do it again except if -and only if- the performance we can wring out of it is absolutely essential and otherwise unobtainable. I'd rather use an off-the-shelf query engine any day and every day.
I would happily work on query engines, it's just I don't recommend using bespoke ones unless absolutely necessary.
> > Wouldn't it be nice to be able to write a trigger that called a web service?
> Rather you than me.
Yes, don't do that. Instead consume notification streams or logical replication streams and act on those -- sure, you'll now have an eventually consistent system, but you won't have a web service keeping your transactions from making progress. You don't want to engage in blocking I/O in your triggers.
Love postgres and use it extensively. However, there's always an issue when I start doing the more advanced stuff: how do I combine that with all my years of experience with version control, code reviews, types, tests, static analysis and all the niceties of coding in general?
It's not a great DX for sure. Once you start stuffing a ton of application logic in Postgres triggers (or worse split logic between your application layer and triggers) your system will explode in complexity really fast. Developers will insert a row in an inconspicuous table and things will break in ways that seem mysterious because it's not obvious, looking at the application code, why the system is misbehaving.
There are commands in most postgres clients, even psql, to view the _currently_ defined functions... but when you go to debug those you will have to look through migrations to see how the function came into it's current state... bisecting through history here is not very useful since each change to the function is a new file. I think this can be fixed though and made much easier, it's just not there yet.
In general I don't think the developer tooling is up to par to push very much of your application logic into postgres itself. I recommend using triggers for consistency and validation or table-local updates (ie: timestamps, audit logs) but keep process-oriented behaviour (iow: when this happens, then that, else this, wait for call and insert here, etc) in the application layer (chasing a cascade of triggers is not fun and quite annoying).
... all that being said, you can do unit testing in Postgres. And there is decent support for languages other than pgSQL (ie: javascript, ocaml, haskell, python, etc). It's possible to build dev tooling that would be suitable to make Postgres itself an application development platform. I'm not aware of anyone who has done it yet.
Hover, I do struggle with one big issue: changing database logic (views, functions, etc) that has other logic dependent on it. This seems like a solvable problem.
There is a bit of tooling needed but is already around. For Java for example I had very good experience with a combination of flyway [1] for migrations, testcontainers [2] for making integration tests as easy as unit tests and querydsl [3] for a query and mapping layer.
You might be curious to look up how Alembic and Django accomplish migrations. Even if Python's not your jam, the concepts/semantics are language/framework agnostic.
Basically, you store each migration in a file, and you "squish" the migration history down to a table definition once you've decided you're happy with the change and it's been affected across all the different deployment environments.
RDBMS's are still very much an 80s thing. Great at dynamic queries, horrible at managing changes in an immutable, atomic, versioned way. The structure and relationship of data is so brittle it's kind of crazy it's still used. Probably it's because so many people have grown dependent on its particular flaws; the incumbent just lumbers on.
There’s always a trade off though. Append-only storage models are great for use cases where they are truly necessary, but they also store substantially more data than mutable databases do, and designing for something like Datomic carries a heavier cognitive cost than a simple RDBMS.
Was talking to coworker yesterday about a spectrum of where code lives, and the differences from where I started to where I am now in understanding.
Start after college and backend web dev was fully in scripting language, Python or Ruby, and ORMs that completely fogged where any of the data was stored. Rails and ActiveRecord is so good at shrouding the database to the point where you type commands that create databases and you never see them. Classes are written to describe what we want the data to look like and poof! SQL commands are created to build the schema that we never need to see. On this end of the spectrum, the scripting language will stay the same, but we want to be agnostic to where the data is stored.
On on the other end of the spectrum, Postgres is enough. More than enough. Like in the link, it can do all the tasks you ever care about. The code you're writing for the backend / data is about data, not about the script. We care where it's stored, that it's clear the structure, the reads and updates are efficient. We can write all statements in SQL to create tables, functions, trigger, queues, and efficient read queries with indexes to make the data come back to the scripting language in the exact form that's wanted. On this end, we know and optimize how the data is stored, agnostic to the scripting language that uses the data.
I went from the first end of the spectrum to the second. Everything can be done in Postgres. Audibility, clarity, efficiency is much better there than in Python, is my position. The only thing holding it back is that people don't see development from the data side yet, and if you're deciding on tech, it's not easy to use a tech that people don't have as good of development ability yet. There are no Postgres bootcamps right now.
But There's more and more adoption of this I'm seeing, and the money and development of Postgres leads me to trust that it'll be around a very long time, only getting better. Posts about the power of databases, Postgres and some SQLite for example are becoming more and more common. It's a cool change to follow and watch grow.
This was similar to my journey as well. I'm a self-taught developer and was so green when I learned Rails I didn't even understand that there was such a thing as SQL behind the ORM. Took some grey beard .net folks to share the and power of the database.
Hasn't steered my wrong yet. Every once in a while I'm tempted to try to use Postgres for Pub/Sub but then I realize that I need Redis for caching and sidekiq anyways, and Redis is amazing too, so why bother.
If you're open to Elixir (you'll like it coming from Ruby) then you don't even need Redis. Oban + Postgres for jobs, WalEx for database events, Nebulex for distributed caching. It simplifies things so much (and is cheaper to run).
I keep hearing about it, just never got around to actually doing anything with it...but I might have that itch now. I'll look into these things thanks.
Our company has spend a lot of resources optimizing these kinds of queries in our application. It's been a bit of a struggle since we also use GraphQL, but we've been doing alright. Partial indexes and liberal use of jsonb have been doing a lot of heavy lifting for us. Our application is extremely read heavy, so the performance hit from updating lots of indexes hasn't been an issue for us (yet).
But inevitably, as an application grows in complexity, you start to realize _why_ there's a stack, rather than just a single technology to rule them all. Trying to cram everything into Postgres (or lambdas, or S3, or firebase, or whatever other tech you're trying to consolidate on) starts to get really uncomfortable.
That said, sometimes stretching your existing tech is better than adding another layer to the stack. E.g. using postgres as a message queue has worked very well for me, and is much easier to maintain than having a totally separate message queue.
I think the main takeaway here is that postgres is wildly extensible as databases go, which makes it a really fun technology to build on.
Say I'm on your team, and you're an application developer, and you need a queue. If you're taking the "we're small, this queue is small, just do it in PG for now and see if we ever grow out of that" — that's fine. "Let's use SQS, it's a well-established thing for this and we're already in AWS" — that's fine, I know SQS too. I've seen both of these decisions get made. (And both worked: the PG queue was never grown out of, and generally SQS was easy to work with & reliable.)
But what I've also seen is "Let's introduce bespoke tech that nobody on the team, including the person introducing it, has experience in, for a queue that isn't even the main focus of what we're building" — this I'm less fine with. There needs to be a solid reason why we're doing that, and that we're going to get some real benefit, vs. something that the team does have experience in, like SQS or PG. Instead, this … thing … crashes on the regular, uses its own bespoke terminology, and you find out the documentation is … very empty. This does not make for a happy SRE.
So over the years these juniors have repeatedly chosen different tech for their applications. Now the team maintains like 15-20 different apps and among them there's react, Vue, angular, svelte, jQuery, nextjs and more for frontends alone. Most use Episerver/Optimizely for backend but of course some genius wanted to try Sanity so now that's in the mix as well.
And it all reads like juniors built it. One app has an integration with a public api, they built a fairly large integration app with an integration db. This app is like 20k lines of code, much of which is dead code, and it gets data from the public api twice a day whereas the actual app using the data updates once a day and saves the result in its own Episerver db. So the entire thing results in more api traffic rather than less, the app itself could have just queried the api directly.
But they don't want me to do that, they just want me to fix the redundant integration thing when it breaks instead. Glad I'm not on that team any more.
If you use a stack meant for a huge userbase (with all the tradeoffs that comes with it) but you're still trying to find market fit, you're in for a disappointment
Similarly, if you use a stack meant for smaller projects while having thousands of users relying on you, you're also in for a disappointment.
It's OK to make a choice in the beginning based on the current context and environment, and then change when it no longer makes sense. Doesn't even have to be "technical debt", just "the right choice at that moment".
Yep. And Postgres is a really good choice to start with. Plenty of people won't outgrow it. Those who do find it's not meeting some need will, by the time they need to replace it, have a really good understanding of what that replacement looks like in detail, rather than just some hand-wavy "web scale".
There are good reasons to go OLAP or graph for certain kinds of problems, but think carefully before adding more services and technologies because stuff has a tendency to go in easily but nothing ever leaves a project and you will inevitably end up with a bloated juggernaut that nobody can tame. And it's usually those people pushing the hardest for new technologies that are jumping into new projects when shit starts hitting the fan.
If a company survives long enough (or cough government), a substantial and ever increasing amount of time, money and sec/ops effort will go into those dependencies and complexity cruft.
Most systems are still going to need Redis involved just as a coordinator for other pub/sub related work unless you're using a stack that can handle it some other way (looking at BEAM here).
But there are always going to be scenarios as an application grows where you'll find a need to scale specific pieces. Otherwise though, PostgreSQL by itself can get you very, very far.
It's also worth noting that by using PG as a message queue, you can do something that's nearly impossible with other queues - transactionally enqueue tasks with your database operations. This can dramatically simplify failure logic.
On the other hand, it also means replacing your message queue with something more scalable is no longer a simple drop-in solution. But that's work you might never have to do.
https://www.postgresql.org/docs/current/sql-notify.html
Some applications never grow that much
See, now that we're profitable, we're gonna become a _scale up_, go international, hire 20 developers, turn everything into microservices, rewrite the UI our customers love, _not_ hire more customer service, get more investors, get pressured by investors, hire extra c-levels, lay-off 25 developers and the remaining customer service, write a wonderful journey post.
The future is so bright!
Don't use it anymore than you have to for your application. Other than network IO it's the slowest part of your stack.
Architecturally, there are other cases besides message queues where there's no reason for introducing another layer in the stack, once you have a database, other than just because SQL isn't anybody's favorite programming language. And that's the real reason there's a stack.
Out of the widely underrated AWS services include SNS and SES and they are not a bad choice even if you're not using AWS for compute and storage.
Using SQS for a queue rather than my already-existing Postgres means that I have to:
- Write a whole bunch of IaC, figuring out the correct access policies - Set up monitoring: figure out how to monitor, write some more IaC - Worry about access control: I just increased the attack surface of my application - Wire it up in my application so that I can connect to SQS - Understand how SQS works, how to use its API
It's often worth it, but adding an additional moving piece into your infra is always a lot of added cognitive load.
From there, Postgres ended up being our relational storage for the platform. It is a wonderful combination of supporting teams by being somewhat strict (in a flexible way) as well as supporting a large variety of use cases. And after some grumbling (because some teams had to migrate off of SQL Server, or off of MariaDB, and data migrations were a bit spicy), agreement is growing that it's a good decision to commit on a DB like this.
We as the DB-Operators are accumulating a lot of experience running this lady and supporting the more demanding teams. And a lot of other teams can benefit from this, because many of the smaller applications either don't cause enough load on the Postgres Clusters to be even noticeable or we and the trailblazer teams have seen many of their problems already and can offer internally proven and understood solutions.
And like this, we offer a relational storage, file storage, object storage and queues and that seems to be enough for a lot of applications. We're only now adding in Opensearch after a few years as a service now for search, vector storage and similar use cases.
For the majority of apps, just doing basic CRUD with a handful of data types, is it that hard to just move to another DB? Especially if you're in framework land with an ORM that abstracts some of the differences, since your app code will largely stay the same.
Sometimes a flexible tool fits the bill well. Sometimes a specialized tool does. It's all about finding that balance.
Thank you for coming to my TED talk.
The problem is, at scale, Postgres isn't the answer to everything. Each of the workloads one can put in Postgres start to grow into very specific requirements, you need to isolate systems to get independent scaling and resilience, etc. At this point, you need a stack of specialized solutions for each requirement, and that's where Postgres starts to no longer be enough.
There is a movement to build a Postgres version of most components on the stack (we are a part of it), and that might be a world where you can use Postgres at scale for everything. But really, each solution becomes quite a bit more than Postgres, and I doubt there will be a Postgres-based solution for every component of the stack.
Is there an equivalent for postgres?
If Michael Jackson rose from the dead to host the Olympics opening ceremony and there were 2B tweets/second about it, then postgres on a single server isn't going to scale.
A crud app with 5-digit requests/second? It can do that. I'm sure it can do a lot more, but I've only ever played with performance tuning on weak hardware.
Visa is apparently capable of a 5-digit transaction throughput ("more than 65,000")[0] for a sense of what kind of system reaches even that scale. Their average throughput is more like 9k transctions/second[1].
[0] https://usa.visa.com/solutions/crypto/deep-dive-on-solana.ht...
[1] PDF. 276.3B/year ~ 8.8k/s: https://usa.visa.com/dam/VCOM/global/about-visa/documents/ab...
I agree with others that a good simplification is "how far can you get with the biggest single AWS instance"? And the answer is really far, for many common values of the above variables.
That being said, if your work load is more OLAP than OLTP, and especially if your workload needs to be real-time, Postgres will begin to give you suboptimal performance without maxing-out i/o and memory usage. Hence, "it really depends on your workload", and hence why you see it's common to "pair" Postgres with technologies like Clickhouse (OLAP, immutable, real-time), RabbitMQ/Kafka/Redis (real-time, write-heavy, persistence secondary to throughput).
I don't know that there is a canonical solution for scaling Postgres data for a single database across an arbitrary number of servers.
I know there is CockroachDB which scales almost limitlessly, and supports Postgres client protocol, so you can call it from any language that has a Postgres client library.
In principle, seems like it should work to allow large scale distribution across many servers. But the actual management of replicas and deciding which servers to place partitions, redistributing when new servers are added, etc. could lead to a massive amount of operational overhead.
Main downside is that you either have to either self-manage the deployment in AWS EC2 or use Azure's AWS-RDS-equivalent (CitusData was acquired by MS years ago).
FWIW, I've heard that people using Azure's solution are pretty satisfied with it, but if you're 100% on AWS going outside that fold at all might be a con for you.
If you are interested in partitioning in an OLAP scenario, this will soon be coming to pg_analytics, and some other Postgres OLAP providers like Timescale offer it already
I've gone far out of my way not to use Elasticsearch and push Postgres as far as as I can in my SaaS because I don't want the operational overhead.
I don't understand this. PostgreSQL is ALSO dead simple to get going, either locally or in production. Why not just start off at 90%?
I mean, I get there are a lot of use cases where sqlite is the better choice (and I've used sqlite multiple times over the years, including in my most recent gig), but why in general?
It's obviously a lot simpler to just have a file, than to have a server that needs to be connected to, as long as we're still talking about running things on regular computers.
I'm not saying it's hard to set up Postgres locally, but sqlite is a single binary with almost no dependencies and no config, easily buildable from source for every platform you can think of. You can grab a single file from sqlite.org, and you're all set. Setting up Postgres is much more complicated in comparison (while still pretty simple in absolute terms - but starting with a relatively simpler tool doesn't seem like a bad strategy.)
It's also good practice for designing reasonably cross-database compatible schemas.
I already hear you saying that you know of a library that provides a perfect abstraction to hide all those details and complexities, making the choice between Postgres and SQLite just a flip of a switch away. Great! But then what does Postgres bring to the table for you to choose it over SQLite? If you truly prove a need for it in the future for whatever reason, all you need to do is update the configuration.
Even with a WAL or some kind of homegrown spooling, you're going to be limited by the rate at which one thread can ingest that data into the database.
One could always shard across multiple SQLite databases, but are you going to scale the number of shards with the number of concurrent write requests? If not, SQLite won't work. And if you do plan on this, you're in for a world of headaches instead of using a database that does concurrency on its own.
Don't get me wrong; SQLite is great for a lot of things. And I know it's nice to not have to deal with the "state" of an actual database application that needs to be running, especially if you're not an "infrastructure" team, but there's good reasons they're ubiquitous and so highly regarded.
That will carry most early stage applications really far.
It really feels like the DB industry has taken a huge step backward from the promise of SQL. Switching from Postgres to SQLite is easy because the underlying queries are at least similar. But as soon as you introduce embeddings, every system is totally different (and often changing rapidly).
Specialized vector indexes become important when you have a large number of vectors, but the reality of software is that it is unlikely that your application will ever be used at all, let alone reach a scale where you start to hurt. Computers are really fast. You can go a long way with not-perfectly-optimized solutions.
Once you have proven that users actually want to use your product and see growth on the horizon to where optimization becomes necessary, then you can swap in a dedicated vector solution as needed, which may include using a vector plugin for SQLite. The vector databases you want to use may or may not use SQL, but the APIs are never that much different. Instead of one line of SQL to support a different implementation you might have to update 5 lines of code to use their API, but we're not exactly climbing mountains here.
Know your problem inside and out before making any technical choices, of course.
Of course, if there's a shortcoming of sqlite that I know I need right out of the gate, that would be a situation where I start with postgres.
https://supabase.com/docs/guides/database/extensions/pgvecto...
Deleted Comment
This is the new kid in town so you would see soon all major SQL dbs will support vector. However, any serious user, O(10M) vectors or above, would still require a dedicated vector db for performance reasons.
I come from an industry that heavily uses custom binary file formats. And I'm still bewildered by the world of databases. They seem to solve many issues on the surface, but not really in pratice. The heavy limitations on data types, the update disasters, the incompatibility between different SQL engines etc all make it seem like an awful idea. I get the interop benefits, and maybe with extreme data volumes. But for anything else, what's the point? Genuinely asking
I believe you haven't had a chance to work on problems that require an actual database. Multi-user access, ACID support, unified API (odbc/jdbc), common query language... all of these would require many man-years to set properly with a custom solution.
> the update disasters, the incompatibility between different SQL engines etc all make it seem like an awful idea
What update disasters? If you meant by updating database versions, these aren't things you do frequently because the database is expected to be running 24/7. But Postgres and Mysql are already rock solid here. Wrt SQL engine incompatibilities, you usually set with a single database vendor in practice. If you suddenly start to switch databases in the middle of the project, something needs to be fixed with the process design, not database.
But the updating I would have expected to go more smoothly. If you make a point of using a software dedicated to managing data, I sure as hell would expect an update to go so smooth that I don't have to worry about or even notice it. In reality updates more often than not seem to come with undocumented errors. That is a constant source of frustration for me.
You mention further down a "mystery box of performance", but if you understand what data structures it's using and how it uses them, then it's generally pretty straightforward. Mostly you can reason about what indices are available and how trees work (and e.g. whether it can walk two trees side-by-side to join data) to know what query plan it should make, and you can ask it to tell you what plan it makes and which indices it used. Likewise, if you have a query plan you want it to run (loop over this table, then use this column to look up in this table, etc.), you'll know what indices are needed to support that plan.
If people struggle with using a database correctly, they're really going to struggle with using something like a b-tree in a way where you don't corrupt your data in the event of a power loss or crash, or in a way where multiple threads don't clobber each other's updates or create weird in-between states (or you just use a global lock, but then you lose performance).
Another key problem they solve is separating the logical representation from the on disk representation. You can evolve and and optimize storage without breaking anything.
The other problem with files is that you have a disconnect between in memory and on disk. You have to constantly serialize and deserialize. Sqlite has quite a bit of info on this: https://www.sqlite.org/appfileformat.html
Because having hundreds or thousands of concurrent file handles across a data center is kind of hard.
You have to be really good at understanding how storage works, have a lot of time and resources to develop your own program to beat something like PostgreSQL. They have zillions of man-hours on you when you start, experience and knowledge. It's not impossible that you could find a case where a bespoke format would beat an established storage product, but over a range of cases, you most likely won't.
And, of course, there's a convenience aspect. Outside of niche technologies, SQL offers the richest language for data description and operations on data.
As for the inconsistencies between SQL implementations: in practice, it matters very little: most programs will never migrate between different SQL implementations anyways.
As for upgrades: it's a doubly-edged sword. You get a very expressive data format and it's not surprising that it's hard to upgrade. But, try to match the abilities of SQL in your own format, and you'll probably find out that it's hard to have generic tools for upgrading it too.
None of this means that you cannot do better than SQL databases. It'd be ridiculous to think there could be a way to prove that SQL is somehow the best we can get. It's just that it's very hard to do better. Especially if you want a universal tool
You can check these links:
https://sqlite.org/appfileformat.html
https://sqlite.org/aff_short.html
About data types, what limitations are you referring to? jsonb is well supported in many dbs and throwing random large binary blobs in the middle with your normal data is a bad idea with or without a relational database.
I can see the appeal to replace noSQL solutions though, a lot of people are using S3 (and other storage solutions) as a makeshift database lately
I would happily work on query engines, it's just I don't recommend using bespoke ones unless absolutely necessary.
I think you'll come around eventually.
Excerpt from the motivaion of "PostgreSQL HTTP Client":
> Wouldn't it be nice to be able to write a trigger that called a web service?
Rather you than me.
> Rather you than me.
Yes, don't do that. Instead consume notification streams or logical replication streams and act on those -- sure, you'll now have an eventually consistent system, but you won't have a web service keeping your transactions from making progress. You don't want to engage in blocking I/O in your triggers.
Migrations?
There are commands in most postgres clients, even psql, to view the _currently_ defined functions... but when you go to debug those you will have to look through migrations to see how the function came into it's current state... bisecting through history here is not very useful since each change to the function is a new file. I think this can be fixed though and made much easier, it's just not there yet.
In general I don't think the developer tooling is up to par to push very much of your application logic into postgres itself. I recommend using triggers for consistency and validation or table-local updates (ie: timestamps, audit logs) but keep process-oriented behaviour (iow: when this happens, then that, else this, wait for call and insert here, etc) in the application layer (chasing a cascade of triggers is not fun and quite annoying).
... all that being said, you can do unit testing in Postgres. And there is decent support for languages other than pgSQL (ie: javascript, ocaml, haskell, python, etc). It's possible to build dev tooling that would be suitable to make Postgres itself an application development platform. I'm not aware of anyone who has done it yet.
https://www.jetbrains.com/help/idea/database-tool-window.htm...
Popular tooling like Phoenix, Hasura, etc have good built in migration stories.
https://www.bytebase.com looks really promising.
Hover, I do struggle with one big issue: changing database logic (views, functions, etc) that has other logic dependent on it. This seems like a solvable problem.
[1] https://github.com/flyway/flyway
[2] https://java.testcontainers.org/modules/databases/postgres/
[3] https://github.com/querydsl/querydsl
Basically, you store each migration in a file, and you "squish" the migration history down to a table definition once you've decided you're happy with the change and it's been affected across all the different deployment environments.
It's not perfect, but it works reasonably well.
Start after college and backend web dev was fully in scripting language, Python or Ruby, and ORMs that completely fogged where any of the data was stored. Rails and ActiveRecord is so good at shrouding the database to the point where you type commands that create databases and you never see them. Classes are written to describe what we want the data to look like and poof! SQL commands are created to build the schema that we never need to see. On this end of the spectrum, the scripting language will stay the same, but we want to be agnostic to where the data is stored.
On on the other end of the spectrum, Postgres is enough. More than enough. Like in the link, it can do all the tasks you ever care about. The code you're writing for the backend / data is about data, not about the script. We care where it's stored, that it's clear the structure, the reads and updates are efficient. We can write all statements in SQL to create tables, functions, trigger, queues, and efficient read queries with indexes to make the data come back to the scripting language in the exact form that's wanted. On this end, we know and optimize how the data is stored, agnostic to the scripting language that uses the data.
I went from the first end of the spectrum to the second. Everything can be done in Postgres. Audibility, clarity, efficiency is much better there than in Python, is my position. The only thing holding it back is that people don't see development from the data side yet, and if you're deciding on tech, it's not easy to use a tech that people don't have as good of development ability yet. There are no Postgres bootcamps right now.
But There's more and more adoption of this I'm seeing, and the money and development of Postgres leads me to trust that it'll be around a very long time, only getting better. Posts about the power of databases, Postgres and some SQLite for example are becoming more and more common. It's a cool change to follow and watch grow.
Not on my bubble, it has been fully in .NET and Java since 2001, with exception of a couple of services written in C++.
Postgres, Redis, S3
Hasn't steered my wrong yet. Every once in a while I'm tempted to try to use Postgres for Pub/Sub but then I realize that I need Redis for caching and sidekiq anyways, and Redis is amazing too, so why bother.