Keep the monolith, but split the workloads

I quite like the article and the advice it presents, building what I'd call modular monoliths (that can have modules be enabled or disabled based on feature flags) is indeed a good approach for both increasing resiliency and decreasing the blast radius of various issues.

However, this bit stuck out to me:

> When a bad Pub/Sub message was pulled into the binary, an unhandled panic would crash the entire app, meaning web, workers and crons all died.

I've never seen a production ready web framework or library that would let your entire process crash because of a bad request. There's always some error handling logic wrapping around the request handling, so that even if you let something slip by, the fallout will still be limited to the threads handling the requests and its context.

Now, of course there can still be greater failures, such as issues with DB access stalling for whatever reason, leading to the HTTP requests not being processed, leading to the corresponding thread pool for those filling up, as well as any request queues maxing out and thus all new requests getting rejected. But for the most part, requests with bad data and such, short of CVEs, always have limited impact.

I can imagine very few situations where things working differently wouldn't be jarring.

fabian2k · 2 years ago

The most dangerous kinds of mistakes in that area are bugs or oversights that cause massive resource usage or endless loops. Those can easily bring down parts of the application that are otherwise robustly separated in the way you mentioned. Consuming too much memory will kill the app at some point, just using lots of CPU or IO will make everything else grind to a halt. Holding onto resources that are pooled like DB connections is a good way to break everything as well.

There are some languages and frameworks that are designed around even handling and isolating issues like these to some extent, Erlang/Elixir for example.

But in general I would expect a web application backend to isolate services the way you mentioned. But I don't think this is really a preconfigured default in every case, and it's not safe do catch everything and keep running. Isolating web requests and only failing the individual request is of course reasonable and should be the default. But for background services your framework can't know which ones are required and which ones are optional and when to crash the process. Especially as most languages allow shared state there, so the framework can't know how independent they are.

hyperpape · 2 years ago

Indeed, I remember a significant degradation of service that happened when some data files were corrupted.

Each request would need to work with one or more of them. There was a cache, and the code was written to avoid a thundering herd. But in the case where the data was corrupt, an exception was thrown, and nothing was put in the cache. So the application sat dedicating multiple cores to a cycle of loading and parsing these data files then errorring out.

lawrjone · 2 years ago

Have you worked with Go codebases before?

Standard practice is to wrap all your entrypoints - the start of a web request, the moment you begin to run a job - in a defer recover() which will catch panics.

Sadly, recover won’t apply to any subsequently goroutine’d work. That means even if your entrypoints recover, if anything go func()s down the call stack, then if that function then panics it will bring down the entire process.

We were aware of this but ended up including one by accident anyway. It’s very sad to me that you can’t apply a global process handler to ensure this type of thing doesn’t happen, but to my knowledge that isn’t possible.

Worth mentioning Go doesn’t really encourage ‘frameworks’, and most Go apps compose together various libraries to make their app, rather than using something that packages everything together. Failures like this are an obvious downside to not having the well reviewed nuts-and-bolts-included framework to handle this for you.

zimpenfish · 2 years ago

> Have you worked with Go codebases before?

Several but ...

> Standard practice is to wrap all your entrypoints [...] in a defer recover()

... I've never seen that. Is there some literature pointing to this as best practice?

cle · 2 years ago

To me, it's unclear what the best solution is here. Other languages solve this differently with tradeoffs, e.g. Java defaults to threads silently dying when an exception isn't caught. Your program will continue to run, but it's probably in some undefined state at that point. There are mechanisms for propagating exceptions elsewhere, but they have to be explicitly set up (like in Go). You can set a default uncaught exception handler, but that's effectively a global variable with all the subsequent "fun", and the uncaught exception handler has to know how to clean up and restore state if the exception was thrown from anywhere, which seems generally difficult to do correctly.

Gys · 2 years ago

I probably misunderstand what you wrote. Because I think a (wrapped) panic will only result in crashing the one request that caused it?

For example Gin provides a middleware wrapper for handling panics: CustomRecoveryWithWriter returns a middleware for a given writer that recovers from any panics and calls the provided handle func to handle it.

(https://github.com/gin-gonic/gin/blob/master/recovery.go)

pmontra · 2 years ago

We had a single supervisor for both a Elixir/Phoenix web app and all the tasks it spawned to do long running jobs and scheduled tasks. That was very naive. A job too quickly too many times would bring down the supervisor and the web app. We moved the web app to its own supervisor tree and isolated tasks to their own supervisor too. I don't remember the details but at the end everything could crash and everything else would keep running.

nesarkvechnep · 2 years ago

That's why you use a separate `Task.Supervisor`.

thraxil · 2 years ago

> I've never seen a production ready web framework or library that would let your entire process crash because of a bad request.

It's rare, but I've seen linked C libraries trigger a segfault that takes the whole thing down; no global try/catch strategy can help you when that happens. There should generally be something that supervises and restarts the whole server process, but it's definitely painful and can affect threads other than the one handling the bad request.

tkahnoski · 2 years ago

Depends on the underlying framework on how it isolates processes... I could see this happening in a monolithic JVM application where it pretends there are separate containers, but a fatal error in the JVM will crash the world on that server.

A related example I lived through with a Perl application, someone decided to use this library 'Storable' that would serialize the memory in a binary format. We upgraded the library and started seeing "slow" performance across the server farm.

We recognized the processes were intermittently crashing and after decoding core dumps... figured out it was this upgrade to the Storable library. Apache httpd server chugged along just fine restarting processes. So different run-time, different type of crash resiliency.

Long-term lesson... be extra cautious with memory-serialized objects. Newer libraries have better protection on this to parse a header to detect compatible issues before loading the raw object into memory, but the potential is there especially with distributed systems today.

marcosdumay · 2 years ago

> There's always some error handling logic wrapping around the request handling

Yeah, some languages make this easier than others.

The way you described it made me think it was Rust, a language where those handlers are not trivial, but not incredibly hard either. But it seems that Go developers have it worse (no big surprise here).

Of course, the champions on unhandled failures are always C and C++, where almost any issue is impossible to recover from. It's no coincidence that those are low level languages (and Go).

When you move from the more controlled languages into those, you tend to also move from in-language error recovery to system-wide error recovery.

curioussavage · 2 years ago

I actually found that in practice my go applications were much more resilient. First one in production was handling well over 100k req/s and it was extremely reliable. As a beginner it was very satisfying.

Arbortheus · 2 years ago

My team has experienced _exactly_ the same cyclical panicking in production before because of the same library.

The Google Pub/Sub Go library does not handle panics by default, so if any message payload plays badly with your code and panics, you cyclically panic the service.

That's because the message keeps getting retried because you don't `Ack` it. Non acknowledgements get retried automatically, you get the picture.

Aeolun · 2 years ago

> I've never seen a production ready web framework or library that would let your entire process crash because of a bad request.

Node does this (regardless of framework/server) if you have some unhandled promise rejection/error in an asynchronous function.

gbuk2013 · 2 years ago

You can catch these with Node but I wouldn’t do it - if it’s not explicitly handled then it’s either a bug or something really bad happened and the app needs to get alarm bells ringing.

fendy3002 · 2 years ago

which can be handled easily with unhandledrejection.

Though ideally, web frameworks such as NestJs is already handled the rejection on request level and can be override easily with Filters. Yeah I know express isn't handled, you'll need to handle it yourself.

EGreg · 2 years ago

Erlang and any Actors models I think

maeln · 2 years ago

Yes and no. Erlang does apply the "fail fast" idea and recommend to let actor crash instead of trying to handle the error (with some pragmatism ofc). But we are not talking about crashing the process of the whole app. In most case in Erlang (but you find the same idea with over massively parallel system or with micro-services), your actor will be handled by a supervisor, which job it is to keep it alive, and so restart it when it crash. Also, most of the time in Erlang, having a actor crash and spawning a new one is really fast, so you can afford to crash even for something as basic as a malformed json request or similar. Not something you can really afford when spawning a new system is more in the order of several hundred ms.

One critique I have is that this presents a binary option of either full monolith and microservices.

The truth is somewhere in between where you have dependent services large enough to warrant being their own monolith being split off. Breaking up of a monolith is almost never (anecdotal observation after 15 years of seeing this argument surface in every company I've been in) a technical need, but a combination of organizational and business requirements. This is not to say it's never technical, just that it's rare for it to be a reason.

Organizational: The teams are so huge that they prefer their own development cycle, deployment, and rollout safeguards.

Business: Partitioning of products for different SLAs, which is difficult to guarantee when you have a different team introducing some bug every other week that takes down the servers.

devoutsalsa · 2 years ago

A monolith doesn’t have to be one giant ball of mud. It can be discrete, well-factored services all by itself. I recently worked on decomposing a monolith into micro services, but it felt like we were just spreading one big problem over multiple services. All of the services ended up being tightly coupled, even the the goal was to avoid that. We created a macrolith.

stepbeek · 2 years ago

And it’s incredibly difficult to work with a distributed monolith. I find that the joy I get from development just fades completely when even trivial changes involve too much faff.

taneq · 2 years ago

Exactly! All this debate seems to stem from people being forced to work with a big ball of mud and then concluding that no one process should ever be allowed to become that big, because big processes are balls of mud. And so, having missed the real lesson (which isn’t ‘don’t make it big’ but rather ‘don’t make it out of mud’!) they build a big ball of little balls of mud.

mjr00 · 2 years ago

It's not about monoliths always being a ball of mud. Even the most well-composed monolith still has problems with teams wanting to do conflicting release cycles, needing clearer ownership over who has responsibility for what part of the codebase, and knowing who should be responsible for on-call for which services. And there is, of course, dependency hell, since everything in your monolith probably should depend on the same version of third-party libraries.

intelVISA · 2 years ago

Agreed, transforming a monolith into a 'distributed monolith' just creates a different set of even worse problems, the root issues are still unsolved.

Still, thanks to AWS it's vogue, expensive and probably keeps half of us employed when we have to untangle it all!

lawrjone · 2 years ago

You’re right, that’s a very fair point. I should’ve made it clear that the middle ground of moderately-sized but logically separate services is very much a good world to be in.

I’d actively encourage splitting a monolith into services when the candidate split would be:

1. A service that has a clear purpose that can be explained either separately from the monolithic whole or as a complementary piece

2. It will receive enough active development that the cost of upgrading/on-going investment is an acceptable percent of total work time

3. It can improve the organisations state of ‘ownership’ and better align the codebase with the team structure that own it

There is always a middle ground, and I should’ve done better to explain that.

fendy3002 · 2 years ago

And one technical / business side that's very valid: error-proofing the service and database. It is to prevent the database to be populated incorrectly, such as financial data, that you proof it behind a set of well-defined API, to prevent any leakage of sensitive data (PII or credit-card info) and to be easily audited.

Olphs · 2 years ago

Do you think it's good idea to migrate from monolith to microservices before scaling the team?

We are essentially doing this right now, I personally don't think the teams are yet large enough to warrant this migration, but leadership says that we can't scale further before we have microservices. Imo most of the problems could be solved with increased testing and knowledge about the system. The tech also could definitely still scale with just by boosting the hardware and occasional performance passes on slow endpoints/queries.

yakshaving_jgt · 2 years ago

> but leadership says that we can't scale further before we have microservices

These are depressingly common alarm bells.

There are companies with valuations over 12 figures which don't do microservices, so the scaling justification is dubious at best.

mikeryan · 2 years ago

can't scale further before we have microservices.

Scale what? Load (Seems not)

Scale the team? This is often an overlooked aspect of microservices you're able to isolate the domain knowledge to smaller services and hire more specialized engineers who can focus on a specific silo of the app and only have to work to discrete APIs.

Scale features? There's a legit argument to be made that if you see yourself going to a microservice architecture in the future based on your current codebase or product roadmap that doing it sooner then later is always going to be preferable. The bigger the monolith the harder it is to unwind. It could also be that there's some tech debt that's something of a limiting factor, making the change in architecture could be an opportunity to address some of that.

iamflimflam1 · 2 years ago

Are you doing “extreme” micro services? A esperare database for each micro service, one endpoint for each microservice etc.. or the slightly more sensible - let’s split the monolith up into obviously independent system (very much like the article talks about).

npn · 2 years ago

> This is not to say it's never technical, just that it's rare for it to be a reason.

It will be a lot more technical reason from now on, when we begin to utilize more "AI" stuff, like language models.

all of those stuffs are slow to start, because they need time to load the huge model, so it is better to make a service that dedicated to the AI stuff, so the development of the business side do not get hold back by the slow restart cycle.

and AI is just an example, there is a lot of case when it more reasonable to split it to another service too, for example some CPU intensive tasks. Even Erlang/BEAM can't do anything about it if the code is writing in C/Zig/Rust and get called using NIF.

KronisLV · 2 years ago

dumpster_fire · 2 years ago

ahofmann · 2 years ago

One often finds derogatory remarks about PHP, or that the popularity of the language is declining. Interestingly, the concept of PHP prevents exactly such problems, as a monolith written in PHP only ever executes the code paths that are necessary for the respective workload. The failure that the Rails app experienced in the post simply wouldn't have happened with PHP. Especially for web applications, PHP's concept of "One Request = One freshly started process" is absolutely brilliant.

So, OP is right, write more monoliths, and I would add: write them in languages that fit the stateless nature of http requests.

InformalAnybody · 2 years ago

> "One request = one freshly started process"

This isn't true for all (most?) PHP applications today. PHP installations include the FastCGI Process Manager (php-fpm). According to Wikipedia (https://en.wikipedia.org/wiki/FastCGI),

> Instead of creating a new process for each request, FastCGI uses persistent processes to handle a series of requests.

According to the PHP Internals book (https://www.phpinternalsbook.com/php7/memory_management/zend...), is close to a "share-nothing architecture" thanks to custom implementations of `malloc()` and `free()` that know about the lifecycle of a request.

hu3 · 2 years ago

It depends on the configuration.

One can set pm.max-requests=1 and have the process respawn per request.

https://www.php.net/manual/en/install.fpm.configuration.php#...

And the main point still stands: If a request manages to crash a php-fpm child process, other requests are unaffected and another process is spawned to replace the crashed one.

anditherobot · 2 years ago

PHP is amazing, yes the syntax is a little awkward. I love working with it.

Especially with libraries/ framework that are 10+ years old Their documentation is extremely detailed.

So you have more confidence that you'll deliver a working product. Instead of reading github issues to fix an obscure error message.

Old does not always mean outdated.

pphysch · 2 years ago

> Interestingly, the concept of PHP prevents exactly such problems, as a monolith written in PHP only ever executes the code paths that are necessary for the respective workload.

This is just lovely when you have hundreds of PHP endpoints written by your predecessors and each endpoint has rewritten an arbitrary slice of the stack (usually data model layer) because there is no common code path required. Refactoring anything below the html layer becomes impossible.

In fact, calling it a software monolith is misleading, because each PHP script is its own little microservice with poorly-defined API boundaries.

withinboredom · 2 years ago

Hmmm. When was the last time you wrote php? One file per route has been outdated for over 10 years.

Small note that the app in the post was not Rails, it’s a Golang app.

You’re right in that PHP couldn’t fail like this though, as the isolation level is at a unix process, where the OS should contain all your errors.

stefano_c · 2 years ago

The irony is that in a normal Rails app workloads are already split by default (requests and background jobs are handled by different processes that share the same codebase).

quickthrower2 · 2 years ago

It comes down to how you want to do concurrency I guess? Processes? Threads? Event loop? Is there a clear winner? I often see debates or even flamewars about what is the best approach.

martin_drapeau · 2 years ago

This is a good point. Having a single process to handle all requests is asking for trouble.

Author here, thanks for sharing!

This is a strategy I’ve used across many projects in several languages, from big Rails apps to the Go monolith we deploy at incident.io today.

It’s just one way you can make a monolith much more robust without moving into separate services, which helps you keep the benefits of a monolithic application for longer.

Hope people find it interesting.

Jupe · 2 years ago

While I completely agree that the efforts to build and maintain a set of micro-services are often better leveraged by a single monolith (even one that has several "run modes" as you've done here), a few questions inevitably come up:

How do you coordinate the efforts of 200 engineers on a single repo/code base?

Do engineers frequently get into long/drawn-out merge sessions as common code may be modified by a number of engineers who are all trying to merge around the same time? This is actually one of the reasons I really like GOLANG: "A little copying is better than a little dependency."

krab · 2 years ago

I have worked in a company with a monolith and about ten teams working on it. This is what helped:

- Merging was automated (a robot tried to run tests against fresh master and merge only if green).

- Deploy was fully automated and limited to working hours.

- We added tests for problematic parts. For example static analysis for database migrations to prevent only safe actions in an automated fashion.

However, if something goes wrong in some component, you have to revert and stop the deploys for everyone which sucks. I'd say around 8 - 10 deploys per day, it makes sense to start splitting the components or at least not adding new teams to the same monolith.

herodoturtle · 2 years ago

Thank you for sharing this article!

> But having spent half a decade stewarding a Ruby monolith from 20 to 200 engineers and watched its modest 10GB Postgres database grow beyond 5TB, there’s definitely a point where the pain outweighs the benefits.

Woah!

I for one would also love to hear your insights on scaling personnel from 20 to 200.

We’re in a similar boat / anticipated growth phase (at 20 odd engineers), and whilst there’s a lot of content on this topic, I’d appreciate your practical “from the coal face” take (much like what you’ve done with this monolith article).

Perhaps a follow up article? ^_^

Glad you enjoyed the article!

The experience I reference was from my time at GoCardless, where I was a Principal SRE leading the team who were on-call for the big Rails monolith.

I’ll put the topic of “what does 10x eng team scaling feel like” on my todo list for posts, but if you’re interested there’s a bunch of articles on my personal blog from this time.

One that might be interesting is “Growing into Platform Engineering” about how the SRE function evolved:

https://blog.lawrencejones.dev/growing-into-platform-enginee...

rfoo · 2 years ago

Question: Have you considered one separate binary (but still sharing the entire codebase) for each "workload" instead of using a flag to switch? If so, can you highlight the advantage of the flag way?

Yep: this is definitely a valid alternative.

The reasons we went with one binary are:

- We can ensure common runtime setup by only having one codepath that runs the initialisation. It's nice that every deployment definitely instantiates the Prometheus server and initialises secrets from Google Secret Manager at the same point and in the same way.

- Compiling three targets adds time to our build process. Even when Go will cache much of the compilation, linking separate binaries is still another ~30s and ~300MB of binary to our resulting docker image.

- It's very convenient in development to run all deployments inside of a single binary for hot-reloading purposes. This is how we develop the app locally.

I don't think it makes much difference which you choose in terms of the operational elements. But for the reasons above, there are clear advantages to the single binary for our use cases.

jameshart · 2 years ago

* edit: I am an idiot who can’t recognize golang code (facepalm)

So it’s a Ruby* app, which means code you put in the container but never use doesn’t do you any harm - it doesn’t bloat a binary or take up space in memory.

but…

… whenever you make a change in one of your pub-sub handlers, you still have to redeploy your web workers.

… code and dependencies for your pub-sub handlers is sitting idle in your web workers, increasing your security risk and attack surface

Why not just turn this into a monorepo that publishes three artifacts? Having just one container image seems like a weird thing to optimize for

twosdai · 2 years ago

I think that publishing 3 artifacts causes a number of other changes which expand the difficulty of management of the backend of the app. While they do allow, hopefully, independent scaling and seperate recovery processes. Separate artifacts also make it a bit more difficult to do local development, and deployments become more complex so it's not a completely clean tradeoff.

I know systems kubernetes / dockerswarm which handle the management of multiple artifacts but that is now another tool needed in order to do development.

So I guess what I'm saying is that "just" publishing multiple artifacts has more knock-on effects which need to be considered. Eventhough I do agree with you that it's probably for the best in the long term if these processes are in seperate containers.

In this model there are already three different command lines to run the thing in three different modes. Deployment complexity of moving that decision one notch left to buildtime instead of startup time seems like it doesn’t increase the accidental complexity at all.

In fact now you can eliminate variation in how your three different processes are started up and health checked and monitored, so instead of code that knows how to operate three different things, you have code that knows how to start and operate an arbitrary thing, and you parameterize it with your three different containers.

Oh hey - that’s what kubernetes is for.

Maybe all this ‘complexity’ of tooling isn’t dumb after all.

> … whenever you make a change in one of your pub-sub handlers, you still have to redeploy your web workers.

We have a similar model with Django. We deploy at least four times a day, so it's much easier to just redeploy everything than reason about what needs and what doesn't need to be redeployed.

It’s a Go app, not a Ruby one. Hence the Go code examples.

And the rationale behind one binary rather than several is to reduce build time and deployment artefact size, along with benefits in development from one binary. All of which mean deploying is much quicker/easier, which means it’s less of a big deal that we deploy everything whenever any part changes (we deploy anywhere up to 30 times a day).

Hope that helps it make more sense.

Apologies - read the ‘Ruby monolith’ up top and then glossed over the code without noticing the language mismatch. Mondays, am I right?

I don’t understand how bundling three sets of functionality into one binary reduces build time or artifact size.

Surely the binary containing all the web and pubsub and scheduled tasks is strictly bigger than the binary for just the web would be, and takes longer to build and test that the binary for just the web?

electroly · 2 years ago

> reduce build time and deployment artefact size

I'm confused... doesn't this plan _maximize_ your build time and deployment size? You can't do worse than building and deploying every line of code you own on every build and deploy; it's the worst case scenario. Go is fast enough in compilation that it doesn't matter, right? But then what's the point?

rezonant · 2 years ago

This is standard practice on monolithic apps. It is effectively an N-tier architecture. On Rails it is very expected that you would deploy the same codebase to two tiers (web and worker) and simply run rails s in one and sidekiq in the other.

heipei · 2 years ago

We're doing this with our monolithic Node.js / Express application as well, but I wasn't aware that it's a common pattern. For us having a monolithic app greatly simplifies development, deployment, updating dependencies, shared code, etc. Just having a single set of third-party dependencies to update makes things so much easier.

It’s ‘standard practice’ for many frameworks like Rails which encourage the pattern via a separate deployment of the async worker tier.

But outside of these frameworks, you won’t often see it. And especially in languages like Go which tend to avoid opinionated frameworks it’s common not to see a split like this, even when it might be really beneficial.

As with most programming concepts, standard practice varies a lot depending on the tools you use. It’s why I thought this post was worth writing, in that there are many people who haven’t run Rails/etc frameworks before and may not have considered this approach!

Rapzid · 2 years ago

It's so common Heroku has worker dynos and cron support via the scheduler add-on.

KyeRussell · 2 years ago

Ditto in Django + Celery

slig · 2 years ago

Anyone using a more modern/lightweight alternative to Celery?

fideloper · 2 years ago

ditto Laravel (PHP)

jabradoodle · 2 years ago

I appreciate the keep your monolith argument. It seems like a lot of people have the attitude of let's start with a monolith, and switch to microservices when the monolith becomes painful.

Having seen how much effort goes into getting microservices running well for any non trivial setup, I wonder what could have been achieved if 25% of that resource had gone into improving the monolith.

afhammad · 2 years ago

> Bad code in a specific part of the codebase bringing down the whole app, as in our November incident.

This is a non-issue if you're using a Elixir/Erlang monolith given its fault tolerant nature.

The noisy neighbour issue (resource hogging) is still something you need to manage though. If you use something like Oban[1] (for background job queues and cron jobs), you can set both local and global limits. Local being a single node, and global across the cluster.

Operating in a shared cluster (vs split workload deployments) give you the benefit of being much more efficient with your hardware. I've heard many stories of massive infra savings due to moving to an Elixir/Erlang system.

1. https://github.com/sorentwo/oban