Readit News logoReadit News
Zvez · 2 years ago
If you give access to your DB directly, your API effectively becomes your API with all the contract obligations of the API. Suddenly you don't completely control your schema: you can't freely change it, you need to add things there for your clients only. I've seen it done multiple times and it always end up poorly. You save some time now by removing the need to build API, but later you end up spending much more time trying to decouple your internal representation from schema you made public.
bennyelv · 2 years ago
Absolutely correct, listen to this article's ideas with great scepticism!

The system that I'm currently responsible for made this exact decision. The database is the API, and all the consuming services dip directly into each other's data. This is all within one system with one organisation in charge, and it's an unmanageable mess. The pattern suggested here is exactly the same, but with each of the consuming services owned by different organisations, so it will only be worse.

Change in a software system is inevitable, and in order to safety manage change you require a level of abstraction between inside a domain and outside and a strictly defined API contract with the outside that you can version control.

Could you create this with a layer of stored procedures on top of database replicas as described here? Theoretically yes, but in practice no. In exactly the same way that you can theoretically service any car with only a set of mole-grips.

camgunz · 2 years ago
This is just an interface, and you have the same problems with versioning and compatibility as you do with any interface. There's no difference here between the schema/semantics of a table and the types/semantics of an API.

IME what data pipelines do is they implement versioning with namespaces/schemas/versioned tables. Clients are then free to use whatever version they like. You then have the same policy of support/maintenance as you would for any software package or API.

oreilles · 2 years ago
Versioned views, materialized views or procedures are the solution to this. It is frequent that even internally, companies don't give access to their raw data but rather to a restricted schema containing a formated subset of it.
vbezhenar · 2 years ago
Views will severely restrict the kinds of changes you might want to do in the future. For example now you can't just move some data from your database into S3 or REST service.

Stored procedures technically can do anything, I guess, but at that point you would be better with traditional services which will give you more flexibility.

tacker2000 · 2 years ago
Of course it’s possible, but now you need more people with DB and SQL knowledge.

Also, using views and stored procedures with source control is a pain.

Deploying these into prod is also much more cumbersome than just normal backend code.

Accessing a view will also be slower than accessing an “original” table since the view needs to be aggregated.

vmfunction · 2 years ago
In addition if you are using postgres, then there is postgresRest to make api really quick and nice.
madsbuch · 2 years ago
why would you want to develop your api in sql over a traditional language?

versioned views and materialized views are essentially api endpoints in this context. just developed in sql instead of some sane language.

rewmie · 2 years ago
> Versioned views, materialized views or procedures are the solution to this.

Wouldn't it be far simpler to just create a service providing access to those views with something like OData?

DrScientist · 2 years ago
Whether it's method calls or database schema - isn't what really matters is control of what's accessible and the tools you have to support evolution?

So when you provide an API - you don't make all functions in your code available - just carefully selected ones.

If you use the DB schema as a contract you simply do the same - you don't let people access all functions - just the views/tables they need/you can support.

Just like API's, databases have tools to allow you to evolve - for example, maintaining views that keep a contract while changing the underlying schema.

In the end - if your schema dramatically changes - in particular changes like 1:1 relation moving to a 1:many - it's pretty hard to stop that rippling throughout your entire stack - however many layers you have.

DasIch · 2 years ago
> Just like API's, databases have tools to allow you to evolve - for example, maintaining views that keep a contract while changing the underlying schema.

What are the database tools for access logs, metrics on throughput, latency, tracing etc.? Not to mention other topics like A/B tests, shadow traffic, authorization, input validation, maintaining invariants across multiple rows or even tables...

Databases often either have no tools for this or they are not quite as good.

robertlagrant · 2 years ago
You can do impedance-matching code in a database, e.g. in stored procedures, but I think the experience is strictly worse than all the application-level tooling that's available.
moomin · 2 years ago
I have spent my entire, long, career, fighting against someone who thought this was a good idea, unpicking systems where they implemented it or bypassing systems where this was implemented. It's a many-headed hydra that keeps recurring but rarely have I seen it laid out as explicitly as this headline.
ignoramous · 2 years ago
I guess that's what one gets for reading just the headline? TFA talks about the downsides called out in this thread explicitly.

tbf, the idea isn't as novel. Data warehouses, for instance, provide SQL as a direct API atop it.

fatnoah · 2 years ago
> If you give access to your DB directly, your API effectively becomes your API with all the contract obligations of the API. Suddenly you don't completely control your schema: you can't freely change it, you need to add things there for your clients only. I've seen it done multiple times and it always end up poorly.

In a past life, I worked for a large (non-Amazon) online retailer, and "shipping the DB" was a massive boat anchor the company had to drag around for a long time. They still might be, for all I know. So much tech and infra sprung up to work around this, but at some point everything came back to the some database with countless tables and columns where no one knew the purpose, but couldn't change because it might break some random team's work.

qwerty456127 · 2 years ago
That's [another reason] why you use stored procedures and only call them (rather than hardcoded or ORM-generated SQL queries) in your client app code.
MoSattler · 2 years ago
I think this point is addressed in the article.
herodoturtle · 2 years ago
Came here to say this too.

From the article:

> A less obvious downside is that the contract for a database can be less strict than an API. One benefit to an API layer is that you can change the underlying database structure but still massage data to look the same to clients. When you’re shipping the raw database, that becomes more difficult. Fortunately, many database changes, such as adding columns to a table, are backwards compatible so clients don’t need to change their code. Database views are also a great way to reshape data so it stays consistent—even when the underlying tables change.

Neither solution is perfect (raw read replica vs API). Pros and Cons to both. Knowing when to use which comes down to one's needs.

DarkNova6 · 2 years ago
This 100%.

My last customer used an ETL tool to orchestrate their data loads between applications, but the only out of the box solution was a DB-Reader.

Eventually, no system could be changed without breaking another system and the central GIS system had to be gradually phased out. This also meant that everybody must had to use Oracle databases, since this was the "best supported platform".

Dudester230602 · 2 years ago
On the next iteration some consultancy will replace that with a bunch of microservices using a dynamic language.

When that thing fails again they will hopefully settle on a sane monolithic API.

chedabob · 2 years ago
Yeah this is my gripe with things like Firebase Realtime Database.

Don't get me wrong, the amount of time it saves is massive compared to rolling your own equivalent, but it doesn't take long before you've dug yourself a big hole that would conventionally be solved with a thin API layer.

oldnet · 2 years ago
Also you shouldn't give up access to your DB for security reasons.

That's why API exists at first place.

camgunz · 2 years ago
PostgreSQL 9.5 (7.5 years old) shipped row-level security [0] which solves this.

[0]: https://www.postgresql.org/docs/15/ddl-rowsecurity.html

vbezhenar · 2 years ago
Technically you can create different users with very precise access permissions. Might not be the good idea to provide that kind of API to the general public, but if your clients are trustworthy, it might work.
londons_explore · 2 years ago
You could ship the database together with python/JS/whatever 'client library' - and you tell your clients that they need to use your code if they want to be supported.
knallfrosch · 2 years ago
You just know they're going to run custom code, fck up their database and then still complain.

I'm not tooo familiar with DBs, but I know customers. They're going to present custom views to your client SDK. They're going to mirror your read-only DB into their own and implement stuff there. They're going to depend on every kind of implementation detail of your DB's specific version ("It worked with last version and YOU broke it!"). They're going to run the slowest Joins you've ever seen just to get data that belongs together anyway and that you would have written a performant resolver for.

Oh, and of course, you will need 30 client libraries. Python, Java, Swift, C++, JavaScript and 6+ versions each. Compare that to "hit our CRUD REST API with a JSON object, simply send the Authorization Bearer ey token and you're fine."

wmal · 2 years ago
This is the worst of both worlds. Not only are you back to square one, as you spent the time to build an API (client libraries), but now, if the API is limiting, the users will find ways of accessing the SQLite db directly.
marcinzm · 2 years ago
Are you assuming clients will actually upgrade the library on a regular basis?
Puts · 2 years ago
You can use stored procedures if you want to add another abstraction layer.
Epa095 · 2 years ago
They had stored procedures in the "old days" when they figured out that direct access to the database was a bad idea, so what has changed? (I agree that a DB view often is good enough thoug, but they ALSO had that in the "old days", IDK what has changed about that:-p )
archibaldJ · 2 years ago
yeah reminds me of meteor JS
matt_s · 2 years ago
The title reads like it came from an MBA biz-bro that doesn't want to do anything properly because it wastes time and costs money. FWIW, I skimmed the article.

Building an API for a new application is a pretty simple undertaking and gives you an abstraction layer between your data model and API consumers. Building a suite of tests against that API that run continuously with merges to a develop/test environment will help ensure quality. Why would anyone advise to just blatantly skip out on solid application design principles? (clicks probably)

camgunz · 2 years ago
The guy knows what he's talking about [0].

> Building an API for a new application is a pretty simple undertaking

This is super untrue, backend engineering is pretty hard and complicated, and there aren't enough people to do it. And this is coming from someone who thinks it should be replaced with SaaS stuff like Hasura and not a manual process anymore.

> Building a suite of tests against that API that run continuously with merges to a develop/test environment will help ensure quality.

You can test your data pipelines too; we do at my job and it's a lot easier than managing thousands of lines of PyTest (or whatever) tests.

> Why would anyone advise to just blatantly skip out on solid application design principles?

Because building an API takes a lot of time and money, and maintaining it takes even more. It would be cool if we didn't have to do it.

[0]: https://github.com/benbjohnson

EMM_386 · 2 years ago
I've been doing this for a long time and all I can say this after reading this multiple times ... "I don't get it".

I mean, I get it, from a technical standpoint. Ok, so you're going to send read-only Sqlite databases to everybody.

Is it missing what the API (that you still need) is updating when you insert or update something and all client DBs are now stale? Is there a central database? How often are you pushing out read-only database replicas across the wire to all clients? Is that really less "chatty"? If so, how much bandwidth is that saving to push an entire database multiplied by the number of clients?

None of this seems logical. Maybe I'm missing the real-world use-case. Are we discussing tiny Sqlite databases that are essentially static? Because in the last 30 years I've not run into a situation where I needed to have clients execute SQL queries on tiny, static databases let alone still need to potentially update them also.

benbjohnson · 2 years ago
Author here. We're using LiteFS to replicate SQLite databases in real time so the changes sent are only incremental. I think there are several ideal use cases for sharing databases across applications:

1. Internal tooling: you're able to manage the contract better since you have control of the source & destination applications.

2. Reporting & analytics: these tend to need a lot of query flexibility & they tend to use a lot of resources. Offloading the query computation to the client makes it easier on the source application.

As for database size, the Corrosion program mentioned in the post is about 8GB and has a continuous write load so this doesn't have to just be for tiny databases.

marcinzm · 2 years ago
Looks interesting although personally I don't see those as compelling use cases although I may very well be missing something.

> 1. Internal tooling: you're able to manage the contract better since you have control of the source & destination applications.

This has not been my experience in anything but tiny companies or companies with very strict mono-repo processes. One big point of separate teams is to minimize the communication overhead as your organization grows (otherwise it scales as N factorial). That means you do not want a lots of inter-department dependencies due to the internal tooling and APIs they leverage.

> 2. Reporting & analytics: these tend to need a lot of query flexibility & they tend to use a lot of resources. Offloading the query computation to the client makes it easier on the source application.

Depends on the resources the client has versus the server to devote to a single query. The resources are also high because of how much data is analyzed and 8gb seems tiny to me (ie: the whole thing can be kept in some DBs memory).

ryanrussell · 2 years ago
Ben, fan of your work. You guys have really moved the flag on sqlite.

Are there any plans for Corrosion to be published as OSS?

LispSporks22 · 2 years ago
Curious if you've tested it with Jepsen. Anytime I come across some distributed system, my stomach ulcers start playing up while I wonder about all the weird failure modes. I kinda looked for a caveats page on the LiteFS web site, didn't really see one.
pininja · 2 years ago
I could imagine this technique being useful for kepler.gl or other data visualization tools
rgavuliak · 2 years ago
Do you realize most reporting & analytics use cases don't use SQLLite Databases?
lawik · 2 years ago
I have more interest in this: https://electric-sql.com/

They are adding a sync-mechanism for Sqlite so local writes can be synced to a remote location and on into Postgres and so on. CRDT-based eventual consistency.

So local write latency, eventually consistent sync and the tools for partial sync and user access are the primary areas of development.

Early days but an interesting approach to shipping the db.

anty · 2 years ago
Oh, I implemented something like that for my Android app. It seems to work quite well. I don't have many users yet, though.

I replicate the clients Sqlite DB to a central server (over a REST-API) where it is synced with existing data. I use an CRDT, so changes don't get lost and as long as the clocks of all the users devices are accurate, the merges are in the correct order.

This enables offline access, but also the use of my app without an account. You can always merge the data later. Multi-device merge is also possible, if you create an account. Especially the multi-device merge was a big headache until I settled for this approach.

Since it is a home-grown solution I still have to do some manual stuff that could be abstracted away, like converting datatypes to and from JSON, Kotlin, PHP, MySQL. There's not always an equivalent datatype in those technologies.

This approach probably won't work well for big databases that share data between multiple users, though.

zmmmmm · 2 years ago
It probably makes most sense if you have a middle-tier-less application where the UI IS the application, and effectively it just dumps the whole application state out through APIs anyway. We have ended up with this scenario playing out and you eventually just throw up your hands and make the UI hit the server with a giant "send me a huge blob of JSON that is most of the tables in the database on first page load" query anyway.

So the assumption that "if anybody changes anything everybody needs to know it" is close to true. In that scenario, putting a bunch of APIs in the way just makes the data less coherent rather than more. In most other scenarios, yeah, it's hard to see that it really makes sense.

EMM_386 · 2 years ago
I have never in my career spanning decades have I had to ask a server to send me a dataset so large that it "most tables in the database on first page load".

In what use case do you run into that?

I'm lead and an architect on an enterprise application at the moment that drives the whole company. It's your standard configuration, front-end, APIs, SQL. The system requests what it needs to fulfill only what functionality the user is dealing with.

Earlier in my career I was dealing with large enterprise desktop applications that talked directly to the database, with no centralized API. Some of them had thousands of individual desktop clients hitting a single multi-tenant SQL server. No problem, SQL Server would handle it without breaking a sweat. The bandwidth to an indidual client was fine. It was fast. And that was 20 years ago.

jokethrowaway · 2 years ago
Imagine you're building a SaaS which allows your users to do create a website, hotel booking platform and channel manaager.

User can open the application, get the database, the frontend does all the offline updates the user want to perform. The user can update their website, add bookings they received on the phone, check what prices they set. This is all blazingly fast with no loading time, no matter your internet connection, because no further communication with the server is needed.

At some point they press a magic Publish button and the database gets synced with upstream and uploaded. The upstream service can take the data and publish a website for its users, update availabilities, update prices, etc.

It would be a better user experience than 99% of the SaaS out there.

q7xvh97o2pDhNrh · 2 years ago
Most of the Internet spent the last 10-20 years moving away from this because business metrics generally benefit from realtime updates and background autosaves. By and large, users expect systems to autosave their work-in-progress nowadays.

You might remember horror stories from the late 1900s when people would lose critical data and documents because they "forgot to hit Save and the computer crashed." Or, maybe you've never experienced this — because a couple decades of distributed-systems engineers and UX researchers made that problem obsolete.

So now we've... reinvented the Save button, but somehow needed a Paxos implementation to do it?

Everything old is new again, I guess.

EMM_386 · 2 years ago
> User can open the application, get the database, the frontend does all the offline updates the user want to perform.

"get the database"

How small do you think these databases are?!

You're going to download the entire hotel booking platform's database?

For how many hotels? One at a time, and then get another databse? Or are you getting a Sqlite booking database for every hotel in the world? And you're going to send them to each user? For what date range?

And even if that were possible, you then have to commit your offline updates. What if someone else booked the date range prior to you? Now your Sqlite copy is stale. Download the entire thing again? There could have been countless changes from other users in the time since you last got your copy.

This explanation just leaves me even more confused. It's illogical.

vosper · 2 years ago
Thanks for the example, it helped me understand the idea.

The thing that I'm unclear on is how do I figure out what data to ship to the user? Like if I don't already have a database-per-user then I have to extract only the data the user has permission to see, and ship that as a database?

That would be the case even if I had database-per-customer - not every user is necessarily able to see all the data owned by the organization they're a part of.

It seems like a lot of extra work, and error-prone, too (what could be worse than accidentally shipping the wrong data _as an operational database_ to a client?)

Edit: the article covers this at the bottom, but IMO it's a show-stopper. How many applications of any real complexity can actually implement database-per-user (database-per-tenant is probably not enough, as I mentioned above). As soon as you need any kind of management or permissions functionality in your app then you can't ship the database anymore, so you may as well start off not shipping it.

hx8 · 2 years ago
Once upon a time, this is how applications worked and this was a pretty good experience.

Now, you'll introduce a huge amount of user frustration around why their changes never made it to the central database. If you have users closing a form, especially if it's on a webpage, they are going to expect the write to happen at the same time the form closes. Making a bunch of changes and batching them in a single sync is going to be confusing to a ton of users.

RhysU · 2 years ago
Wasn't this Lotus Notes?
Takennickname · 2 years ago
What is this? A scifi novel?

By the time your dude downloads the database half the hotels on your imaginary website have changed status.

whalesalad · 2 years ago
A lot of SPA’s already operate this by leveraging things like localstorage in the users browser.
srhtftw · 2 years ago
> Because in the last 30 years I've not run into a situation where I needed to have clients execute SQL queries on tiny, static databases let alone still need to potentially update them also.

It was not uncommon in the early 2000's for applications targeting mobile professionals and sales types to work this way, except that instead of SQLite they used things like Notes¹, Jet² and MSDE³. By modern standards these would be considered "tiny mostly static databases" often not exceeding a gigabyte. Instead of connecting over the internet they used serial dial-up file transfer protocols to synchronize with a central database or a mainframe. People would typically download information in the morning, make updates offline and synchronize at the end of the day. A slow trickle of periodic bidirectional updates insured everyone had a reasonably fresh copy of the information they needed.

1- https://en.wikipedia.org/wiki/HCL_Domino

2- https://en.wikipedia.org/wiki/Access_Database_Engine

3- https://en.wikipedia.org/wiki/MSDE

uoaei · 2 years ago
I think the point is basically "unless there's a good reason for your API to look different than your DB schema, ontologically speaking, then the schema has already effectively defined your API and just let people interact with the DB", or, alternatively, "bake in as much of the data constraints as possible into your DB schema, and only when you can't enforce them with column constraints or DB extensions should you add APIs on top".
emodendroket · 2 years ago
I have seen some stuff where a streaming tool (Kafka or whatever) is used to just ship all updates to a database to certain clients. But I think this is a dubious architecture since it comes with basically all the downsides of a database that many applications all want to use besides the write contention one.
cpursley · 2 years ago
Supabase realtime handles this really elegantly
KRAKRISMOTT · 2 years ago
Is this a true cloud managed SQLite or is this like their PostgreSQL documentation where it's just a bunch of pre configured docker containers and the developer is expected to manage everything themselves? If the db goes down for an extended period of time at 3am, does fly.io have an engineer on call?

I find that fly.io has been a very disingenuous startup. They position themselves as some sort of Heroku successor and hide behind their developer content marketing while providing very low quality infrastructure. At the end of the day, fancy blog posts may win hearts and minds, but downtime cost businesses actual money.

impulser_ · 2 years ago
I think you are thinking that Fly is like a serverless platform. They aren't. They are the opposite. They are a server platform. They provide server for you and you have to manage your server yourself.

Nothing they provide is managed by them. You have to do that.

LiteFS is just a replication service for your sqlite database so you can keep your database synced across multiple nodes.

https://github.com/superfly/litefs

LiteFS Cloud which is the a service they provide just helps you backup and recover sqlite databases. You can do this yourself.

https://fly.io/docs/litefs/backup/

pests · 2 years ago
I go into a project folder and can run "fly launch" and then the application is alive in multiple regions around the world. Where exactly am I manging this server?
TylerE · 2 years ago
There is no such thing as "serverless", that's just a higher than normal level of vendor lying.

Deleted Comment

robertlagrant · 2 years ago
> you have to manage your server yourself

Can you give an example? You mean backups, OS patches, firewall config, etc etc? Are they like Hetzner?

benbjohnson · 2 years ago
Author here. I think we could have set better expectations with our Postgres docs. It wasn't meant to be a managed service but rather some tooling to help streamline setting up a database and replicas. I'm sorry about the troubles you've had and that it's come off as us being disingenuous. We blog about things that we're working on and find interesting. It's not meant say that we've figured everything out but rather this is what we've tried.

As for this post, it's not managed SQLite but rather an open source project called LiteFS [1]. You can run it anywhere that runs Linux. We use it in few places in our infrastructure and found that sharing the underlying database for internal tooling was really helpful for that use case.

[1]: https://github.com/superfly/litefs

nirushiv · 2 years ago
Can second this. Tried them out at our startup, came away unimpressed. Their good engineering blog is great marketing though, although this particular post is lacklustre. What about permissions, indirections, versioning or otherwise encapsulating breaking changes?
pests · 2 years ago
I was just setting up a new Fly account and launching a project yesterday. I will say their developer content marketing is very succesful and had me convinced they at least know what they are talking about.

Any more stories or knowledge of them being disingenous?

jokethrowaway · 2 years ago
I wouldn't say disingenuous but uptime has been disastrous.

I'm definitely not buying anything there.

I run a free service on fly.io.

It's been ok lately (or maybe my users just stopped complaining), so maybe they got better.

aledalgrande · 2 years ago
Their DX is good. They just have a lot of reliability/availability issues.
freedomben · 2 years ago
A very interesting idea to be sure, but IME the biggest downside (which tbf is mentioned in the article) is the contract. If you have clients with knowledge of and dependency on the schema, you can't change it in a breaking way unless you update all the client's code.

I've tried various patterns in the past like one that just exposes database columns as an API, and this pain point always comes calling and it hurts. Keeping your data model simple and lean as possible is an important part of limiting complexity, which directly correlates with maintainability.

The only pattern/approach that I consistently return to is the Rails pattern (I use Elixir/Phoenix now but same pattern). It is certainly not the most sexy, and having a client be able to graphql exactly what they need can be really helpful, but at least for me it has rarely turned out to be worth the tradeoffs.

danielvaughn · 2 years ago
This is basically how Firebase is designed and it drives me nuts. I've freelanced with several teams, primarily comprised of front-end devs, who decided to use Firebase for their product. When they first told me they query the database directly from both their website and their mobile app, I immediately was like "so...what happens if you need to change the structure of the database"?

Crickets.

endisneigh · 2 years ago
Crickets, really? It's very obvious how to do as recommended by Google - you use a Cloud Function (aka Cloud Functions for Firebase) to do a "migration", while simultaneously handling both cases in client code until this is done.

You must have been working with a bunch of amateurs if such an obvious solution (even without reading the documentation) didn't come to mind apparently by you or your clients (since you're the freelancer). This particular problem isn't even Firebase specific - you have to do this for any store that doesn't strictly have a schema.

This issue you're describing isn't even in the top 5 biggest issues about using Firebase.

10000truths · 2 years ago
The solution to this is the same as with APIs: versioning. Instead of naming your table "my_foo", you name it "my_foo_v1". Then, when you want to make a breaking change to the schema, you:

1. Create a new table "my_foo_v2" with your desired schema

2. Modify write queries for "my_foo_v1" so that they also write to "my_foo_v2"

3. Copy over existing data in "my_foo_v1" to "my_foo_v2" with a migration script

4. Modify read queries for "my_foo_v1" so that they read from "my_foo_v2" instead

5. Remove all write queries to "my_foo_v1"

6. Drop the "my_foo_v1" table

JustLurking2022 · 2 years ago
Holy hell does that sound awful. In practice, it doesn't even solve the core problem as you'd have to deploy new versions of all binaries precisely in sync to avoid data incorrect because some clients know about the v2 table and others don't. It becomes indistinguishable whether a row is missing because it was intentionally deleted or because it was written by a client who didn't know any the new table version. There are ways to account for this but it's layers of crap on top of crap.
singron · 2 years ago
With postgres, simple views are updatable, so you can often do this pattern without copying tables or dual writes. It's particularly useful for renaming columns, but you can also do some other things. You also don't have to use version numbers on all your tables since you only need the view temporarily.

    ALTER TABLE my_foo RENAME TO my_foo_tmp;
    ALTER TABLE my_foo_tmp <your desired change here>;
    CREATE VIEW my_foo AS SELECT <how you make your new table seem like your old table>;
    COMMIT;
    <update your apps to use my_foo_tmp>
    DROP VIEW my_foo;
    ALTER TABLE my_foo_tmp RENAME TO my_foo;
    CREATE VIEW my_foo_tmp as select * from my_foo;
    COMMIT;
    <update your apps to use my_foo>
    DROP VIEW my_foo_tmp; COMMIT;

danielvaughn · 2 years ago
You know, I've been toying with this very idea in my head for a while, as a way to make safe schema migrations. But I also know very little about databases so had no idea if this was a stupid idea or not. Glad to know I'm not alone.

One issue I've considered is fk relationships - that could get complicated depending on the approach.

Deleted Comment

manmal · 2 years ago
Imagine you have Android & iOS apps which some of your users won't ever update. You'll never be able to drop any old version's tables, and you'll need to keep all table versions in sync more or less in real-time. Only when the last user has updated from a certain client version will you be able to remove that version's tables.
mvdtnz · 2 years ago
I'm sorry that is insane. That approach wouldn't pass the first sniff test of any sensible technical design review.
SOLAR_FIELDS · 2 years ago
I think about this with Postgrest/Supabase which has a similar problem (autogenerated code based on the db schema is inherently going to conflict with api stability guarantees). I think that this approach is just fundamentally at odds with making a stable versioned controlled API.

However: I think the best thing to do if you want a setup like this and want to have your cake and eat it too is something like this:

Use your autogenned client for your “internal” api. This is for your clients with the autogenned schema that you directly control only so that you can ship changes and not have to worry about versioning/backwards compatibility.

Then for external users that have their own clients, you have that slimmed down more traditional API that offers less functionality but it’s properly versioned, tested etc

I think this kind of hybrid setup can work well for SaaS setups where you have a cloud product that does internal stuff plus things external that end users need to operate on. You get the benefit of being able to iterate quickly without breaking your clients and since your external API is smaller it’s less maintenance overhead to keep it updated and versioned

steve-chavez · 2 years ago
PostgREST docs have always recommended to use only views + functions in a dedicated schema for the API: https://postgrest.org/en/v10.2/schema_structure.html
rmbyrro · 2 years ago
They mention one partial "solution" to this: have your clients query a view, instead of tables directly.

When you need to change the schema, also update the view accordingly.

Of course, there are still limitations. For one: you can't "massage" data with a view as you could in the backend with a full featured programming language.

PS: I don't think this is a good solution, just mentioning it.

Deleted Comment

floodle · 2 years ago
If you are in the situation where you are shipping a web app where one dev team controls the backend and the one and only client and ships both together, is this a big issue?
simonw · 2 years ago
I think there are some reasonable ways to avoid the schema brittleness that people are concerned about with this.

Firstly, acknowledge that in this model your schema IS your API. Don't include anything in the database that is being replicated that isn't considered to be documented, tested and stable.

With that in place, API changes become the same as changes to a JSON API or similar: you can only make backwards-compatible changes, like adding new columns or adding new tables.

If you do need to make a backwards incompatible change there are ways to patch over it. Ship a brand new table with the new design, and include a SQL view that matches the name of the old table and exposes a compatible subset of it.

maxbond · 2 years ago
Maybe it's better to have views in the first place, and to require (though it can only be enforced at an administrative level) external services to query against those views.

We won't have to do this switch to a view, we've paid that upfront.

We can then include a version number in the view name, similar to how we do with APIs. So if we introduce changes we can keep services which didn't need that change outside the blast radius.

And we have a limited ability to enforce contracts and declare some columns internal.

thestepafter · 2 years ago
Even better, use stored procedures / routines on top of views and only query the stored procedures from the frontend.
jimmytucson · 2 years ago
That is an appealing idea but on second thought, haven’t you just shifted the problem to: how do I reliably replicate logical updates from my operational tables (internal, constantly changing) to my clean, public-facing tables?

Basically, your public tables become materialized views over the “dirty” tables, and I’ve never met a mature, reliable implementation of materialized views, or even an immature one that is feature complete. (I would love to be wrong on the internet with this one!)

remon · 2 years ago
That's just working around an artificially and unnecessarily introduced constraint. There's little zero upside to doing it this way. It's not even more convenient since it incurs pretty heavy refactor costs for even small changes.
javajosh · 2 years ago
This is correct, but there are very few people with experience with safe database modelling like this, and they may not even realize the technique exists.
simonw · 2 years ago
The same is true for APIs, too. Making changes to a JSON API that won't break existing clients isn't necessarily an obvious skill.

But it can be learned, as can backwards-compatible database schemas.

et1337 · 2 years ago
I have been thinking about this a lot lately. We have customers who run automation against us. They’re doing 10,000 API calls. Each one has to do a permission check, request signing, serialization, etc., etc. All just to mutate a blob of data less than say, 100MB. If they just downloaded the whole state, mutated it, and reuploaded it, the whole thing would be done in 2 seconds.

We already lock the entirety of each customer’s data when processing a request, in order to avoid data inconsistency from concurrent mutations.

One SQLite database per customer is a really appealing idea.

rmbyrro · 2 years ago
I'd think 100x before deploying this pattern to external customers.

The use case shared by fly.io is for internal tooling. If you have a small team that most likely won't grow to more than a few dozen devs, maybe you can get away with it.

Shipping it to several external customers will be a horrible nightmare in maintenance and dev productivity. I'd switch jobs if I was working in a project that started applying this. Paying for the HTTP API upfront is a cheap price to avoid the costs of not having it down the road.

knallfrosch · 2 years ago
> If they just downloaded the whole state, mutated it, and reuploaded it, the whole thing would be done in 2 seconds.

Do you think your customers would accept this as a test? They would run against an in-memory database if they wanted to.

maxbond · 2 years ago
A downside I didn't see mentioned (it was gestured at with contracts and the mention of backwards compatible schema changes, but not addressed directly) was tight coupling. When you link services with APIs, the downstream changes of a schema migration end at the API boundary. If you are connecting services directly at the database level, the changes will propagate into different services.

That would make me very nervous to ship something like this. You can probably cover a lot of this with views as a backwards compatibility shim, though.

benbjohnson · 2 years ago
Author here. I don't see this is a general practice to be used for most applications. It was a side effect that we came across and it allowed us to share data between internal applications in some interesting ways. Internal apps are ideal since you have control of both sides of the contract. It definitely requires some more consideration if you're publishing out to end users.
maxbond · 2 years ago
Hi Ben, thanks for taking the time to respond. It's an interesting approach, and I'm sure that it has it's place & you've made the right call for your particular situation.

I didn't think you meant pushing it out to end users, my point was more that this technique increases the blast radius. If three services are touching this database, and our changes now necessitate three deploys, that's much higher risk.

But it was an early comment, I hadn't gotten to see the suggestions for managing this coupling yet. Each one of those does take us a step closer to designing an API, but perhaps part of the value here is that it allows you to move along a spectrum and to adopt the level of constraint you can afford, given the situation.

I'm curious if this has spread to any other services at fly? Or is it just this Corrosion service?

jimmytucson · 2 years ago
I love this kind of reasoning that forces you to climb out of the comfortable nook in a tree of thought you’ve been occupying for years to see if there’s a better vantage point!

So would this work equally well? Use Postgres on some cloud and “spin up” a read-replica that your clients have access to. Let them read from the replica.

If yes, I have tried that and I can tell you the problem I had. They don’t want to get to know my database schema. I know, frustrating! “If you would just get to know me better, you would have all the secrets you desire!” Sigh… they just want this overly simplistic Interface that doesn’t change, even when we rewrite things to make them better on the backend.

Another thought: if sharing compute is the main problem, you can just egress the data to distributed file storage like S3 and let them write SQL on the files using some Presto, like Athena. Or egress to BigQuery and let them bring their own compute. But the problem there again is getting to know my internal data structure, and also by “egress” I really mean “dump the whole damn thing because once the data is on S3 I can’t just update a single row like I can on an OLTP database” and you can only do that every so often.

benbjohnson · 2 years ago
Yes, Postgres replication would work as well. I agree that the "getting to know the schema" part is an issue. I think there's use cases out there where you have power users that would gladly invest extra time in exchange for query flexibility.

Querying to S3 is a good approach too. Those query tools can be somewhat involved to setup but definitely a good option.