Avoiding cyclic dependencies is good, sure. And they do name specific problems that can happen in counterexample #1.
However, the reasoning as to why it can't be a general DAG and has to be restricted to a polytree is really tenuous. They basically just say counterexample #2 has the same issues with no real explanation. I don't think it does, it seems fine to me.
There's no particular reason an Auth system must be designed like counterexample #2. There's many ways to design that system and avoid cycles. You can leverage caching of role information - propagated via messages/bus, JWT's with roles baked-in and IDP's you trust, etc. Hitting an Auth service for every request is chaotic and likely a source of issue.
There’s a million reasonable situations where this pattern could arise because of you want to encapsulate a domain behind a micro service.
Take the simplest case of a CRM system a service provides search/segmentation and CRUD on top of customer lists. I can think of a million ways other services could use that data.
Suppose we were critiquing an article that was advocating the health benefits of black coffee consumption, say, we might raise eyebrows or immediately close the tab without further comment if a claim was not backed up by any supporting evidence (e.g. some peer reviewed article with clinical trials or longitudinal study and statistical analysis).
Ideally, for this kind of theorising we could devise testable falsifiable hypotheses, run experiments controlling for confounding factors (challenging, given microservices are _attempting_ to solve joint technical-orgchart problems), and learn from experiments to see if the data supports or rejects our various hypotheses. I.e. something resembling the scientific method.
Alas, it is clearly cost prohibitive to attempt such experiments to experimentally test the impacts of proposed rules for constraining enterprise-scale microservice (or macroservice) topologies.
The last enterprise project I worked on was roughly adding one new orchestration macroservice atop the existing mass of production macroservices. The budget to get that one service into production might have been around $25m. Maybe double that to account for supporting changes that also needed to be made across various existing services. Maybe double it again for coordination overhead, reqs work, integrated testing.
In a similar environment, maybe it'd cost $1b-$10b to run an experiment comparing different strategies for microservice topologies (i.e. actually designing and building two different variants of the overall system and operating them both for 5 years, measuring enough organisational and technical metrics, then trying to see if we could learn anything...).
Anyone know of any results or data from something resembling a scientific method applied to this topic?
Came here to say the same thing. A general-purpose microservice that handles authentication or sends user notifications would be prohibited by this restriction.
I might have a different take. I think microservices should each be independent such that it really doesn't matter how they end up being connected.
Think more actors/processes in a distributed actor/csp concurrent setup.
Their interface should therefore be hardened and not break constantly, and they shouldn't each need deep knowledge of the intricate details of each other.
Also for many system designs, you would explicitly want a different topology, so you really shouldn't restrict yourself mentally with this advice.
> I might have a different take. I think microservices should each be independent such that it really doesn't matter how they end up being connected.
The connections you allow or disallow are basically the main interesting thing about microservices. Arbitrarily connected services become mudpits, in my experience.
> Think more actors/processes in a distributed actor/csp concurrent setup.
A lot of actor systems are explicitly designed as trees, especially with regard to lifecycle management and who can call who. E.g. A1 is not considered started until its children A2 and A3 (which are independent of each other and have no knowledge of each other) are also started.
> Also for many system designs, you would explicitly want a different topology, so you really shouldn't restrict yourself mentally with this advice.
Sometimes restrictions like these are useful, as they lead to shared common understanding.
I'd bet an architecture that designed with a restricted topology like this has a better chance of composing with newly introduced functionality over time than an architecture that allows any service to call any other[1]. Especially so if this tree-shaped architecture has some notion of "interface" services that hide all of the subservices in that branch of the tree, only exposing the public interface through one service. Reusing my previous example, this would mean that some hypothetical B branch of the tree has no knowledge of A2 and A3, and would have to access their functionality through A1.
This allows you to swap out A2 and A3, or add A4 and A5, or A2-2, or whatever, and callers won't have to know or care as long as A1's interface is stable. These tree-shaped topologies can be very useful.
Well, in practice you're likely to have hard dependencies between services in some respect, in that the service won't be able to do useful work without some other service. But I agree that in general it's a good idea to have a graceful degradation of functionality as other services become unavailable.
As we are talking about micro services, K8s has two patterns that are useful.
A global namespace root with sub namespaces will just desired config and current config will the complexity hidden in the controller.
The second is closer to your issue above, but it is just dependency inversion, how the kubelet has zero info on how to launch a container or make a network or provision storage, but hands that off to CRI, CNI or CSI
Those are hard dependencies that can follow a simple wants/provides model, and depending on context often is simpler when failures happen and allows for replacement.
E.G you probably wouldn’t notice if crun or runc are being used, nor would you notice that it is often systemd that is actually launching the container.
But finding those separation of concerns can be challenging. And K8s only moved to that model after suffering from the pain of having them in tree.
I think a DAG is a better aspirational default though.
I agree with this, and also I’m confused by the article’s argument—wouldn’t this apply equally to components within a monolith? Or is the idea that—within a monolith—all failures in any component can bring down the entire system anyway?
> it really doesn't matter how they end up being connected.
I think you just mean that it should be robust to the many ways things end up being connected but it always does matter. There will always be a cost to being inefficient even if its ok to be.
> Even without a directed cycle this kind of structure can still cause trouble. Although the architecture may appear clean when examined only through the direction of service calls the deeper dependency network reveals a loop that reduces fault tolerance increases brittleness and makes both debugging and scaling significantly more difficult.
While I understand the first counterexample, this one seems a bit blurry. Can anybody clarify why a directed acyclic graph whose underlying undirected graph is cyclic is bad in the context of microservice design?
Without necessarily endorsing the article's ideas....I took this to be like the diamond-inheritance problem.
If service A feeds both B and C, and they both feed service D, then D can receive an incoherent view of what A did, because nothing forces B and C to keep their stories straight. But B and C can still both be following their own spec perfectly, so there's no bug in any single service. Now it's not clear whose job it is to fix things.
At first this sounds cool but I feel like it falls apart with a basic example.
Let's say you're running a simple e-commerce site. You have some microservices, like, a payments microservice, a push notifications microservice, and a logging microservice.
So what are the dependencies. You might want to send a push notification to a seller when they get a new payment, or if there's a dispute or something. You might want to log that too. And you might want to log whenever any chargeback occurs.
Okay, but now it is no longer a "polytree". You have a "triangle" of dependencies. Payment -> Push, Push -> Logs, Payment -> Logs.
These all just seem really basic, natural examples though. I don't even like microservices, but they make sense when you're essentially just wrapping an external API like push notifications or payments, or a single-purpose datastore like you often have for logging. Is it really a problem if a whole bunch of things depend on your logging microservice? That seems fine to me.
Is your example really a "triangle" though? If you have a broker/queue, and your services just push messages into the ether, there's no actual dependency going on between these services.
Nothing should really depend on your logging service. They should push messages onto a bus and forget about them... ie. aren't even aware of the logging service's existence.
That example is still an undirected cycle so not a polytree and so, by the reasoning of the author of tfa not kosher for reasons they don’t really explain.
Honestly I think the author learned a bit of graph theory, thought polytrees are interesting and then here we are debating the resulting shower thought that has been turned into a blog post.
The issue is that one of the services is the events hub for the rest to remain in loose coupling (observer pattern).
The criticality of Kafka or any event queue/streams is that all depend on it like fish on having the ocean there. But between fishes, they can stay acyclicly dependent.
Only good reason would be for bulk log searching, but a lot of cloud providers will already capture and aggregate and let you query logs, or there are good third party services that do this.
Pretty handy to search a debug_request_id or something and be able to see every log across all services related to a request.
Logs need to go somewhere to be collected, viewed, etc. You might outsource that, but if you don't it's a service of it's own (probably actually a collection of microservices, ingestion, a web server to view them, etc)
This is a fair enough point, but you should also try to keep that tree as small as possible. You should have a damn good reason to make a new service, or break an existing one in two.
People treat the edges on the graph like they're free. Like managing all those external interfaces between services is trivial. It absolutely is not. Each one of those connections represents a contract between services that has be maintained, and that's orders of magnitude more effort then passing data internally.
You have to pull in some kind of new dependency to pass messages between them. Each service's interface had to be documented somewhere. If the interface starts to get complicated you'll probably want a way to generate code to handle serialization/deserialization (which also adds overhead).
In addition to share code, instead of just having a local module (or whatever your language uses) you now have to manage a new package. It either had to be built and published to some repo somewhere, it has to be a git submodule, or you just end up copying and pasting the code everywhere.
Even if it's well architected, each new services adds a significant amount of development overhead.
A contract that needs to be maintained at some level of quality even when you're deploying or overloaded.
Load shedding is a pretty advanced topic and it's the one I can think of off the top of my head when considering how Chesterton's Fence can sneak into these designs and paint you into a corner that some people in the argument know is coming and the rest don't believe will ever arrive.
But it's not alone in that regard. The biggest one for me is we discover how we want to write the system as we are writing it. And now we discover we have 45 independent services that are doing it the old way and we have to fix every single one of them to get what we want.
the problem with "microservices" is the "micro". Why we thought we need so many tiny services is beyond me. How about just a few regular sized services?
At the time “microservices” was coined, “service oriented architecture” had drifted from being an architectural style to being associated with inplementation of the WS-* technical standards, and was frequently used to describe what were essentially monoliths with web services interfaces.
“Microservices” was, IIRC, more about rejecting that and returning to the foundations of SOA than anything else. The original description was each would support a single business domain (sometimes described “business function”, and this may be part of the problem, because in some later descriptions, perhaps through a version of the telephone game, this got shortened to “function” and without understanding the original context...)
Micro is a relative term. And was coined by these massive conglomerates, where micro to them is "normal sized" to us.
They work better if you ignore what "micro" normally means.
But "not too too large services" doesn't quite roll off the tongue.
I always took it to be a minimum and that "micro" meant "we don't need to wait for a service to have enough features to exist. They can be small." Instead, people see it as a maximum and services should be as small as possible, which ends up being a mess.
"Micro" refers to the economy, not the technology. A service in the macro economy is provided by another company. Think of a SaaS you use. Microservices takes the same model and moves it under the umbrella of a micro economy (i.e. a single company). Like traditional SaaS, each team is responsible for their own product, with communication between teams limited to sharing of documentation. You don't get to call up a developer when you need help.
It's a (human) scaling technique for large organizations. When you have thousands of developers they can't possibly keep in communication with each other. You have to draw a line between them. So, we draw the line the same way we do at the global scale.
Kind of - AFAIK "micro" was never actually throughly defined. In my mind I think of it as mapping to one table (IE, users = user service, balances = balances service) but that might still be a "full service" worth of code if you need anything more than basic CRUD
The original sense was one business domain or business function (which often would include more than one table in a normalized relational db); the broader context was that, given the observation that software architecture tends to reflect software development organization team structure, software development organizations should parallel businesses organizations and that software serving different business functions should be loosely coupled, so that business needs in any area could be addressed with software change with only the unavoidable level of friction from software serving different business functions, which would be directly tied to the business impacts of the change on those connected functions, rather than having unrelated constraints from coupling between unrelated (in business function) software components inhibiting change driven by business needs in a particular area.
This seems cool if all you need is: call service -> Get response from service -> do something with response.
How do you structure this for long running tasks when you need to alert multiple services upon their completion?
Like what does your polytree look like if you add a messaging pub/sub type system into it. Does that just obliterate all semblance of the graph now that any service can subscribe to events? I am not sure how you can keep it clean and also have multiple long running services that need to be able to queue tasks and alert every concerned service when work is completed.
> Like what does your polytree look like if you add a messaging pub/sub type system into it.
A message bus is often considered a clean way to deal with a cycle, and would exist outside the tree. I hear your point about the graph disappearing entirely if you use a message bus for everything, but this would probably either be for an exceptionally rare problem-space, or because of accidental complexity.
Message busses (implemented correctly) work because:
* If the recipient of the message is down the message will still get delivered when it comes back up. If we use REST calls for completion callbacks then the sender might have to do retries and whatnot over protracted periods.
* We can deal with poison messages. If a message is causing a crash or generally exceptional behavior (because of unintentional incompatible changes), we can mark it as poisoned and have a human look at it - instead of the whole system grinding to a halt as one service keeps trashing another.
REST/RPC should be for something that can provide an answer very quickly, or for starting work that will be signaled as complete in another way. Using a message bus for RPC is just as much of a smell as using RPC for eventing.
And, as always, it depends. The line may be somewhere completely different for you. But, and I have seen this multiple times, a directed cycle in a distributed system's architecture turns it into a distributed monolith: eventually you will reach a situation where everything needs to deploy at the same time. Many, many, engineers can talk about their lessons in this - and you are, as always, free to ignore people talking about the consequences of their mistakes.
I think for a lot of teams, part of the microservices pitch is also that at least some of the services are off the shelf things managed by your cloud provider or a third party.
The explanation given makes sense. If they're operating on the same data, especially if the result goes to the same consumer, are they really different services? On the other hand, if the shared service provides different data to each, is it really one microservice or has it started to become a tad monolithic in that it's one service performing multiple functions?
I like that the author provides both solutions: join (my preferred) or split the share.
I don't understand this. Can you help explain it with a more practical example? Say that N1 (the root service) is a GraphQL API layer or something. And then N2 and N3 are different services feeding different parts of that API—using Linear as my example, say we have a different service for ticket management and one for AI agent management (e.g. Copilot integration). These are clearly different services with different responsibilities / scaling needs / etc.
And then N4 is a shared utility service that's responsible for e.g. performance tracing or logging or something similar. To make the dependency "harder", we could consider that it's a shared service responsible for authentication and authorization. So it's clear why many root services are dependent on it—they need to make individual authorization decisions.
How would you refactor this to remove an undirected dependency loop?
Most components need to depend on an auth service, right? I don’t think that means it’s all necessarily one service (does all of Google Cloud Platform or AWS need to be a single service)?
I think it does indeed make a lot of sense in the particular example given.
But what if we add 2 extra nodes: n5 dependent on n2 alone, and n6 dependent on n3 alone? Should we keep n2 and n3 separate and split n4, or should we merge n2 and n3 and keep n4, or should we keep the topology as it is?
The same sort of problem arises in a class inheritance graph: it would make sense to merge classes n2 and n3 if n4 is the only class inheriting from it, but if you add more nodes, then the simplification might not be possible anymore.
However, the reasoning as to why it can't be a general DAG and has to be restricted to a polytree is really tenuous. They basically just say counterexample #2 has the same issues with no real explanation. I don't think it does, it seems fine to me.
Take the simplest case of a CRM system a service provides search/segmentation and CRUD on top of customer lists. I can think of a million ways other services could use that data.
Deleted Comment
Ideally, for this kind of theorising we could devise testable falsifiable hypotheses, run experiments controlling for confounding factors (challenging, given microservices are _attempting_ to solve joint technical-orgchart problems), and learn from experiments to see if the data supports or rejects our various hypotheses. I.e. something resembling the scientific method.
Alas, it is clearly cost prohibitive to attempt such experiments to experimentally test the impacts of proposed rules for constraining enterprise-scale microservice (or macroservice) topologies.
The last enterprise project I worked on was roughly adding one new orchestration macroservice atop the existing mass of production macroservices. The budget to get that one service into production might have been around $25m. Maybe double that to account for supporting changes that also needed to be made across various existing services. Maybe double it again for coordination overhead, reqs work, integrated testing.
In a similar environment, maybe it'd cost $1b-$10b to run an experiment comparing different strategies for microservice topologies (i.e. actually designing and building two different variants of the overall system and operating them both for 5 years, measuring enough organisational and technical metrics, then trying to see if we could learn anything...).
Anyone know of any results or data from something resembling a scientific method applied to this topic?
I think the article is just nonsense.
Think more actors/processes in a distributed actor/csp concurrent setup.
Their interface should therefore be hardened and not break constantly, and they shouldn't each need deep knowledge of the intricate details of each other.
Also for many system designs, you would explicitly want a different topology, so you really shouldn't restrict yourself mentally with this advice.
The connections you allow or disallow are basically the main interesting thing about microservices. Arbitrarily connected services become mudpits, in my experience.
> Think more actors/processes in a distributed actor/csp concurrent setup.
A lot of actor systems are explicitly designed as trees, especially with regard to lifecycle management and who can call who. E.g. A1 is not considered started until its children A2 and A3 (which are independent of each other and have no knowledge of each other) are also started.
> Also for many system designs, you would explicitly want a different topology, so you really shouldn't restrict yourself mentally with this advice.
Sometimes restrictions like these are useful, as they lead to shared common understanding.
I'd bet an architecture that designed with a restricted topology like this has a better chance of composing with newly introduced functionality over time than an architecture that allows any service to call any other[1]. Especially so if this tree-shaped architecture has some notion of "interface" services that hide all of the subservices in that branch of the tree, only exposing the public interface through one service. Reusing my previous example, this would mean that some hypothetical B branch of the tree has no knowledge of A2 and A3, and would have to access their functionality through A1.
This allows you to swap out A2 and A3, or add A4 and A5, or A2-2, or whatever, and callers won't have to know or care as long as A1's interface is stable. These tree-shaped topologies can be very useful.
1 - https://www.youtube.com/watch?v=GqmsQeSzMdw
A global namespace root with sub namespaces will just desired config and current config will the complexity hidden in the controller.
The second is closer to your issue above, but it is just dependency inversion, how the kubelet has zero info on how to launch a container or make a network or provision storage, but hands that off to CRI, CNI or CSI
Those are hard dependencies that can follow a simple wants/provides model, and depending on context often is simpler when failures happen and allows for replacement.
E.G you probably wouldn’t notice if crun or runc are being used, nor would you notice that it is often systemd that is actually launching the container.
But finding those separation of concerns can be challenging. And K8s only moved to that model after suffering from the pain of having them in tree.
I think a DAG is a better aspirational default though.
It's a nearly universal rule you'll want on every kind of infrastructure and data organization.
You can get away for some time with making things linked by offline or pre-stored resources, but it's a recipe for an eventual disaster.
I think you just mean that it should be robust to the many ways things end up being connected but it always does matter. There will always be a cost to being inefficient even if its ok to be.
While I understand the first counterexample, this one seems a bit blurry. Can anybody clarify why a directed acyclic graph whose underlying undirected graph is cyclic is bad in the context of microservice design?
If service A feeds both B and C, and they both feed service D, then D can receive an incoherent view of what A did, because nothing forces B and C to keep their stories straight. But B and C can still both be following their own spec perfectly, so there's no bug in any single service. Now it's not clear whose job it is to fix things.
Let's say you're running a simple e-commerce site. You have some microservices, like, a payments microservice, a push notifications microservice, and a logging microservice.
So what are the dependencies. You might want to send a push notification to a seller when they get a new payment, or if there's a dispute or something. You might want to log that too. And you might want to log whenever any chargeback occurs.
Okay, but now it is no longer a "polytree". You have a "triangle" of dependencies. Payment -> Push, Push -> Logs, Payment -> Logs.
These all just seem really basic, natural examples though. I don't even like microservices, but they make sense when you're essentially just wrapping an external API like push notifications or payments, or a single-purpose datastore like you often have for logging. Is it really a problem if a whole bunch of things depend on your logging microservice? That seems fine to me.
Nothing should really depend on your logging service. They should push messages onto a bus and forget about them... ie. aren't even aware of the logging service's existence.
Honestly I think the author learned a bit of graph theory, thought polytrees are interesting and then here we are debating the resulting shower thought that has been turned into a blog post.
The criticality of Kafka or any event queue/streams is that all depend on it like fish on having the ocean there. But between fishes, they can stay acyclicly dependent.
Pretty handy to search a debug_request_id or something and be able to see every log across all services related to a request.
People treat the edges on the graph like they're free. Like managing all those external interfaces between services is trivial. It absolutely is not. Each one of those connections represents a contract between services that has be maintained, and that's orders of magnitude more effort then passing data internally.
You have to pull in some kind of new dependency to pass messages between them. Each service's interface had to be documented somewhere. If the interface starts to get complicated you'll probably want a way to generate code to handle serialization/deserialization (which also adds overhead).
In addition to share code, instead of just having a local module (or whatever your language uses) you now have to manage a new package. It either had to be built and published to some repo somewhere, it has to be a git submodule, or you just end up copying and pasting the code everywhere.
Even if it's well architected, each new services adds a significant amount of development overhead.
Load shedding is a pretty advanced topic and it's the one I can think of off the top of my head when considering how Chesterton's Fence can sneak into these designs and paint you into a corner that some people in the argument know is coming and the rest don't believe will ever arrive.
But it's not alone in that regard. The biggest one for me is we discover how we want to write the system as we are writing it. And now we discover we have 45 independent services that are doing it the old way and we have to fix every single one of them to get what we want.
“Microservices” was, IIRC, more about rejecting that and returning to the foundations of SOA than anything else. The original description was each would support a single business domain (sometimes described “business function”, and this may be part of the problem, because in some later descriptions, perhaps through a version of the telephone game, this got shortened to “function” and without understanding the original context...)
It's a (human) scaling technique for large organizations. When you have thousands of developers they can't possibly keep in communication with each other. You have to draw a line between them. So, we draw the line the same way we do at the global scale.
Conway's Law, as usual.
The name was properly chosen poorly and led to many confusions.
How do you structure this for long running tasks when you need to alert multiple services upon their completion?
Like what does your polytree look like if you add a messaging pub/sub type system into it. Does that just obliterate all semblance of the graph now that any service can subscribe to events? I am not sure how you can keep it clean and also have multiple long running services that need to be able to queue tasks and alert every concerned service when work is completed.
A message bus is often considered a clean way to deal with a cycle, and would exist outside the tree. I hear your point about the graph disappearing entirely if you use a message bus for everything, but this would probably either be for an exceptionally rare problem-space, or because of accidental complexity.
Message busses (implemented correctly) work because:
* If the recipient of the message is down the message will still get delivered when it comes back up. If we use REST calls for completion callbacks then the sender might have to do retries and whatnot over protracted periods.
* We can deal with poison messages. If a message is causing a crash or generally exceptional behavior (because of unintentional incompatible changes), we can mark it as poisoned and have a human look at it - instead of the whole system grinding to a halt as one service keeps trashing another.
REST/RPC should be for something that can provide an answer very quickly, or for starting work that will be signaled as complete in another way. Using a message bus for RPC is just as much of a smell as using RPC for eventing.
And, as always, it depends. The line may be somewhere completely different for you. But, and I have seen this multiple times, a directed cycle in a distributed system's architecture turns it into a distributed monolith: eventually you will reach a situation where everything needs to deploy at the same time. Many, many, engineers can talk about their lessons in this - and you are, as always, free to ignore people talking about the consequences of their mistakes.
I like that the author provides both solutions: join (my preferred) or split the share.
And then N4 is a shared utility service that's responsible for e.g. performance tracing or logging or something similar. To make the dependency "harder", we could consider that it's a shared service responsible for authentication and authorization. So it's clear why many root services are dependent on it—they need to make individual authorization decisions.
How would you refactor this to remove an undirected dependency loop?
But what if we add 2 extra nodes: n5 dependent on n2 alone, and n6 dependent on n3 alone? Should we keep n2 and n3 separate and split n4, or should we merge n2 and n3 and keep n4, or should we keep the topology as it is?
The same sort of problem arises in a class inheritance graph: it would make sense to merge classes n2 and n3 if n4 is the only class inheriting from it, but if you add more nodes, then the simplification might not be possible anymore.