Streaming data to clients at the edge (apps or services) is a hard problem. We built an approach that keeps Postgres at the center and allows clients to consume a stream of data that is already in Postgres without any additional moving parts.
This works with read replica style scaling or with Postgres flavours that support scaling out easily (eg: Cosmos Postgres, Yugabyte & cockroach coming soon).
I think there’s a huge unacknowledged gap in the database industry for good streaming products.
If we are building a report or dashboard that we pull up a few times a day then a pull based model where we query the database on page load is fine.
For almost anything else such as an app, a microservice, an alerting system, a web page, a dashboard, we want to be able to update it in near real time for the user experience. Receiving a stream of query results is by far the easiest way to do this.
Polling is obviously a poor interim solution.
I think streaming will be a huge story in data over the next decade. The products are coming through now which is a start.
Materialize gives a vaguely PostgreSQLish user land on top of this idea. Obviously Postgres has LISTEN/NOTIFY which are fine for most places you need a queue and push updates but the key is incremental view updates so you get instant results on complex queries.
I'm generally pretty happy to pay for open source software, but licensing like this is just too risky. I need to be able to experiment with something at scale, in production, before I start paying someone.
I wouldn't quite say that this need is unacknowledged, most DBs have some kind of streaming interface for actual data contents (not queries, though), which for instance are leveraged by Debezium to expose change event streams in one uniform event format (disclaimer: I work on Debezium). Going foward, I expect that these capabilities will be substantially built out, also providing streaming query support (incrementally updated materialized views), not being locked in separate licensing layers or a more or less complex to use after-thought, but a primary part of database offerings, just as SELECT today.
Streaming has been the story for the past couple of decades already. But once you accept the value of an event-first architecture, the benefits of traditional RDBMSes become much weaker.
It’s been discussed for a long time, but the ecosystem is still spotty. Many databases have some kind of “live query” feature for instance but with limitations or not intended for production use.
A lot of work has taken place in the Kafka, Flink, DataFlow ecosystem but that still leaves a lot of work for the developer over a simple subscribe to query results.
I do think a lot of work has been done, but it all needs to move up a few levels of abstraction.
maybe if you're a really big company with petabytes of data but I wouldn't' say its worth the operational complexity for 90% of tech companies out there. Seriously, design your schema so you can have tables optimized for aggregate functions. you can use materialized views or have them populate via triggers. Either is still going to be an order of magnitude less work that dealing with all the edge cases that are ging to come up when you try to build a distributed data pipeline. Thats a death by thousand cuts for any small to medium startup that doesn't already have a team of engineers experienced in making such systems work.
Plus you can stream db changes from postgres to kafka for those edge cases where you really need it
TLDR: if you're in a startup and thinking of building a distributed system.. DONT. stick with a monolith and spin out services as NEEDED.
So the question is: Why have two copies of your data, two products to learn and monitor and operate, write boilerplate to move data between the DBs, etc.?
A "message queue" comes down to being another index on your table/set of tables ordered by a post-commit sequence number. These are things all SQL DBs have already, it just lacks a bit of exposing/packaging to be as convenient to use as a messaging queue.
While I love ingenuity of developers at Hasura (because I've been personally through these scaling challenges), I always get a gag reaction with GraphQL. I've honestly tried hard to digest it, and I can tell you at large scale where single DB won't cut it, you would either need to develop a large federation layer like [Netflix](https://netflixtechblog.com/how-netflix-scales-its-api-with-...), or just rip-it out. Streaming might elevate the problem a little, but I real problem still lurks under the hood. The fact that front-end community wants to fit everything under GraphQL bothers me, because every backend developer knows that a single tool/technology is usually not the best tool for solving all problems of your life. Remember the golden words, THERE IS NO SILVER BULLET!
If database capacity is your main concern and not independent schema deployability, then federation is overkill. You can just connect to whichever databases contain your data in your resolvers within a single service.
You have to be at pretty massive scale before federation becomes necessary and by then (if ever) your frontend teams have experienced benefits that are pretty much miraculous. The reason frontend wants to fit as much as possible into it is because it's vastly better than what came before it unless you have a 95th percentile org that is really doing an outstanding job managing the API via other means.
I think graphql with a single db is an odd thing to gain popularity - as the raison d'être for graphql was to federate and proxy different, heterogeneous data sources (databases, json/rest services, soap, other graphql services...).
Speaking of Netflix - I think they had an alternative Api federation service that used some clever tricks with json string vs number keys to allow for alternating http put/get - and through that leverage http level caching. But I can't find the the link...
GraphQL raison d’être was not to federate and proxy different, heterogeneous data sources. It has initially been developed as a better API for the Facebook monolith. I asked one of GraphQL’s coauthors directly: https://twitter.com/ngrilly/status/1317415232717860866?s=46&....
The main thing I see it being used for is field selection and model nesting. From an end user perspective, it's pretty nice for this I think. Certainly nice that if I know it's a graphql endpoint I've already got a fairly solid idea about how to query it.
FD: I work at hasura and work with users/customers.
Fair point about the DB scaling but not sure if everyone is going to run into this issue. Also, lots of solutions are emerging for this specific problem (with different trade-offs of course) like distributed databases (crunchy, YugaByte, Spanner, etc.). Most folks I work with get by with a reasonably sized DB and some read replicas.
In my experience, if you have more than a small complexity to your data, and you're a medium scale business (or are going in that direction), you're going to end up with this issue. And then it becomes really painful to get out of the situation as migrating data and splitting things up is a real PITA.
While you are correct, not everyone is going to end up with this issue, those that are thinking of getting to a medium sized business should be working to avoid it, which unfortunately means solutions such as hasura lose value. It would be good to see more ability to collect data from multiple sources (please reply and correct me if you already do this, I'm not super familiar with the service).
With Hasura you can create a federation of databases and expose them as graphql or rest endpoints. You can also wrap your existing REST/Graphql services. Almost like API gateway, place where you unify your services, manage access / row-level permissions etc.
I've been thinking about the problem and the more I think about it, I tried many different alternatives, it's best to have plain SQL being sent for queries.
The problem there is whi has the right to run which queries?
It was never the solution to begin with, IMHO.
It's lacking a permissions or let's say authz framework.
It's only half the solution. And it's too complicated and there's no real standard, as in real world standard. Everyone has their own little soup cooking, because the manifest is incomplete and always evolving.
I've been using Hasura in production for small commercial projects deployed on AWS and I've been positively impressed by the stability and the speed up - it makes spinning up a graphQL backend with row-level security straightforward.
Just curious: my understanding is Hasura kind of discourages using RLS in lieu of their own access control layer[1]. Did you consider pros/cons of either approach?
Seeing some feedback on GraphQL - Hasura has had support for converting templated GraphQL into RESTish endpoints (with Open API Spec docs if needed). We are planning to do the same for this streaming API as well - does anyone have good examples of existing REST/RESTish endpoints that something similar?
We’ve been using hasura at work, but we’ve stopped using it for everything other than subscriptions in favour of hand written rest apis. The problem for us wasn’t really graphql itself, but the fact that the client app was determining the query. If the client could request a “named query”, that was then determined by the backend (perhaps via a web hook?)then we’d have been able to use hasura more.
I’ve been a big hasura user for a while. Give their RESTified endpoints a go, solves this issue for you and still gives you all and access control goodness and subscriptions under one roof.
We have written a post[1] on building a real-time chat app with Streaming Subscriptions on Postgres. It gives a quick overview of the architecture used and how you can leverage the API on the client side with AuthZ. There’s a live demo that you all can try out.[2]
This works with read replica style scaling or with Postgres flavours that support scaling out easily (eg: Cosmos Postgres, Yugabyte & cockroach coming soon).
If we are building a report or dashboard that we pull up a few times a day then a pull based model where we query the database on page load is fine.
For almost anything else such as an app, a microservice, an alerting system, a web page, a dashboard, we want to be able to update it in near real time for the user experience. Receiving a stream of query results is by far the easiest way to do this.
Polling is obviously a poor interim solution.
I think streaming will be a huge story in data over the next decade. The products are coming through now which is a start.
https://github.com/MaterializeInc/materialize/blob/main/LICE...
I'm generally pretty happy to pay for open source software, but licensing like this is just too risky. I need to be able to experiment with something at scale, in production, before I start paying someone.
A lot of work has taken place in the Kafka, Flink, DataFlow ecosystem but that still leaves a lot of work for the developer over a simple subscribe to query results.
I do think a lot of work has been done, but it all needs to move up a few levels of abstraction.
Plus you can stream db changes from postgres to kafka for those edge cases where you really need it
TLDR: if you're in a startup and thinking of building a distributed system.. DONT. stick with a monolith and spin out services as NEEDED.
Curious what products you’ve seen as well.
So the question is: Why have two copies of your data, two products to learn and monitor and operate, write boilerplate to move data between the DBs, etc.?
A "message queue" comes down to being another index on your table/set of tables ordered by a post-commit sequence number. These are things all SQL DBs have already, it just lacks a bit of exposing/packaging to be as convenient to use as a messaging queue.
You have to be at pretty massive scale before federation becomes necessary and by then (if ever) your frontend teams have experienced benefits that are pretty much miraculous. The reason frontend wants to fit as much as possible into it is because it's vastly better than what came before it unless you have a 95th percentile org that is really doing an outstanding job managing the API via other means.
Speaking of Netflix - I think they had an alternative Api federation service that used some clever tricks with json string vs number keys to allow for alternating http put/get - and through that leverage http level caching. But I can't find the the link...
Fair point about the DB scaling but not sure if everyone is going to run into this issue. Also, lots of solutions are emerging for this specific problem (with different trade-offs of course) like distributed databases (crunchy, YugaByte, Spanner, etc.). Most folks I work with get by with a reasonably sized DB and some read replicas.
Not a GraphQL problem though IMO.
While you are correct, not everyone is going to end up with this issue, those that are thinking of getting to a medium sized business should be working to avoid it, which unfortunately means solutions such as hasura lose value. It would be good to see more ability to collect data from multiple sources (please reply and correct me if you already do this, I'm not super familiar with the service).
What about the concept doesn't work? It's just a syntax for queries, I'm confused why it wouldn't scale.
The problem there is whi has the right to run which queries?
So the real problem is authorization.
As for GraphQL - it's great for clients. As a backend engineer, you still have to do the work and a LOT of it.
This is just like microservices. No due diligence on whether the added complexity and destroyed productivity is worth it. "Everyone else is doing it".
Maybe you’re too out of touch. This isn’t true, at least not anymore. GraphQL is losing appeal.
[1] https://hasura.io/docs/latest/auth/authorization/basics/
Seeing some feedback on GraphQL - Hasura has had support for converting templated GraphQL into RESTish endpoints (with Open API Spec docs if needed). We are planning to do the same for this streaming API as well - does anyone have good examples of existing REST/RESTish endpoints that something similar?
There's also a few NPM packages for auto-generating that allow list from your project (https://www.npmjs.com/search?q=hasura%20allow%20list -- the one I've used before was from `tallerdevs`).
[1] https://hasura.io/blog/building-real-time-chat-apps-with-gra... [2] https://eclectic-dragon-25a38c.netlify.app/
Would love to see more use cases coming out of this :)