System design and the cost of architectural complexity (2013)

In the late 90s I travelled the world with a couple of CD-ROMs installing NT 4.0 on individual physical servers. I understood the entire stack. I was the only engineer and I was often very remote.

In the mid 2000s we installed Server 2003/8 on a whole bunch of physical servers in a data centre. This was for one of Australia’s ‘big four’ banks. Me and a team of less than ten, most of whom I still know, managed the entire thing. We were the ‘3rd level infrastructure’ team. Most of us understood most of the stack end-to-end, although there were some specialities.

In 2023 I work for an IT integrator. These days we use Azure, because Microsoft has fooled us in to thinking that it’s cheaper.

My current project is, essentially, a website migration. Not even a big or complex website. The project has been running for 6 months. I only joined recently so I can’t be sure but we’ve probably burned AU$2m. The schedule I’m managing has us doing the cutover in June. That is optimistic.

The architects are currently trying to work out how Azure [details] connect to Azure [details] while still [security] and being able to [complex integration].

Every day some new issue appears that the architects have to figure out how to work around.

No single person has a goddamned clue how the end-to-end thing fits together. Not one, not a clue. That scares me.

‘Architectural complexity’ has crept down to the level of infrastructure. It’s painful to see. I hope it stops soon but I am not hopeful.

jiggawatts · 2 years ago

We might be working on the same project... or the "same" project, if you know what I mean.

The level of abstraction that is commonplace these days is insane.

At $dayjob, there is a reverse proxy in front of a API gateway that is itself a load-balanced service. There are load-balancers behind it, pointing at a reverse proxy on a service fabric. Within that, some "architect" decided to add additional "mid tier" servers because that was the cool thing to do in the 1990s. Behind all of this is an Azure App Service, which is in turn a load-balanced server that includes a reverse proxy.

They want to move this to Kubernetes, so they can now have proxies seven layers deep, nested hypervisors running containers, and services ping-ponging across six zones (data centre locations) in two different clouds. That connectivity will go through a software virtual router platform, so it's not even direct point-to-point connectivity.

I pity the poor fool who will have to diagnose an operational issue in this madness.

No mere mortal will be able to make it go fast.

userbinator · 2 years ago

Someone is paying for all of that, and someone is getting paid.

Look no further beyond those getting paid to find the reason such monstrosities exist.

cutemonster · 2 years ago

How high is the load? (From what I'm seeing between the lines, it's not much?)

> I pity the poor fool who will have to diagnose an operational issue

Why? They're getting paid, have a job, it's great

Pity the business owners instead maybe. But they're clueless and happy too?

Hmm. I start thinking these things are the natural course of events. Techies adding more techies and complexity, maybe just like middle managers can hire more managers and bureaucrats and make not the tech, but the organization, more complex (and thus better, from their own personal perspective?)

redandblack · 2 years ago

well LLMs have proved themselves by stacking layers upon layers - maybe your architects have figured out the magic also (without interpretation/comprehension obviously).

Good luck as we all live in interesting times

throwawaaarrgh · 2 years ago

Automation moves complexity from one place to another. In this case the complexity moved into silos.

Before, all the pieces of the architecture were in a couple hundred *.cpp files in one directory hierarchy. Now it's millions of files across thousands of directories each turned into a service run by a dedicated team, and none of them can see each other.

You can become an expert in all those pieces. But it requires actually using them all together, to discover the parts that don't fit and how to work around them. This is why the modern ideal of completely independent APIs is a terrible design. There is literally no way to know if anything works with anything else until you try to run it in production. Monoliths are terrible at scaling, but easy to understand. SOA apps are great at scaling, but impossible to understand.

Good luck on your migration. Changing the wheels on a moving tractor trailer always sucks.

lifeisstillgood · 2 years ago

Isn't the point that there is a line somewhere that one app stops being one single app and becomes "eco system" - and you stop having deterministic understanding and start having "town planning" and "social expectations".

I mean my use of bad metaphors kind of underlines the point we don't really understand his problem - but the large organisation that builds these is itself an example of such impossible to understand interactions - maybe we will learn from Azure etc and take those learnings into taking our own orgs.

lifeisstillgood · 2 years ago

This is the essence - that a system needs to fit inside one persons head. All of it. There maybe a few people who understand all of a Boeing airplanes systems - there certainly were when they first flew 747s. Maybe a couple now.

but once it stops fitting inside one persons head then the thing is literally only possible to design by committee - it cannot fit together and perhaps should stop being called a single system

bob1029 · 2 years ago

> it cannot fit together and perhaps should stop being called a single system

The magic is the enterprise message bus. It is meme crap until you actually need one. When it works, it really works. This is the only thing I've ever seen realistically tie together certain industrial environments.

cortesoft · 2 years ago

There is no one who understands FULLY how a single computer works (at least not to the specific levels of detail on how to make every part from scratch... how to mine the heavy metals, how to chemically process them, etc) That bas been impossible for hundreds of years.

The important part is where you draw up the system boundaries

userbinator · 2 years ago

These days we use Azure, because Microsoft has fooled us in to thinking that it’s cheaper.

All the cloud providers do that. You could say that peoples' judgment has been clouded...

More seriously, this is how they operate; complexity is a good thing to them, because it means more lock-in and ways to charge you in non-obvious ways, and the whole "consulting" aspect of the industry feeds off that.

vsareto · 2 years ago

I wouldn't be surprised if most people don't learn it in-depth because:

A) you can google or ChatGPT things more easily these days

B) there's a truck load of services and ways to connect and do things. It's hard to learn all of these or pick a best option for everyone

C) SSO + identity models + API permissions generally made everything more complex

D) Many things in Azure are changing often, so you don't want to get too invested in a particular "version"

So things got more in-depth while expanding your available options (which are usually just as complex)

sgarland · 2 years ago

> you can google or ChatGPT things more easily these days

This should not be a valid excuse to not understand your job.

> there's a truck load of services and ways to connect and do things. It's hard to learn all of these or pick a best option for everyone

Agreed, and I think this is where people go wrong. Giving devs the ability/permission to pick whatever random tooling they want is not a good path for maintainability. "It's in AWS, so we can use it." I'd go so far as to argue that if you want to use Managed Service X, you need to first successfully launch it on an EC2 with no other help than the official docs; otherwise how can you possibly hope to understand it when it goes wrong?

> SSO + identity models + API permissions generally made everything more complex

Fully agree, IAM is a nightmare. But again, if it's your job, that's not really an excuse for anything other than higher pay.

jen729w · 2 years ago

I think people don’t learn it [all] in-depth because to do so is essentially impossible.

Back in the bank days all of understood Windows, multi-tier AD, DNS, DHCP, SMS* (now MECM via SCCM), that distributed file system whose name I forget, file & print, Exchange, and all the other things that had a plugin to mmc.exe — we could do ‘em all. Oh and a handful of networking because that was also much simpler. Oh and all of the client configuration.

You could give me a server and a day and I could stand you up a basic infrastructure. I could know it all.

(*Though the maniacal way that SMS 2.0 did its magic with text-based log files shuttled in from the endpoint always drove me bonkers.)

I'm always thinking about "Can I (or anyone) get back into this easily 6 months from now?"

In my situation, I probably will have to do that so there's a selfish reason there for sure.

I recently had a whole series of frustrating situations where I dug through rediscovering how old code / systems work to make small changes or to find out the small change was enormous. Really deflating stuff. It's not my fault but it can be so demoralizing. Feels like a weight on you... I was done for the day after both of those horror shows.

Then yesterday I had a 3 day project start and in 2 hours I ... did the thing. It was super flexible / powerful, handled errors gracefully, and easy to change / test. All because a year ago someone (well myself and anther person) took the time to simplify the original spaghetti code that originally existed and break it into more digestible functional-esque chunks. Dropping something "in between the chunks" (fancy technical terms here) was easy to do, test and read. Completely the opposite experience, it was energizing and fun.

r3trohack3r · 2 years ago

For my consulting, I primarily practice "reference first architectures."

The idea is we identify the rough shape of what we are going to build and the components needed to deliver it (Linux? Terraform? K8S? HTML/CSS/JS? etc.).

Next we measure up what we can "take for granted" for the engineering skillset the organization hires for. Then we pick books, official project documentation, etc. that will act as our "reference." We spend our upfront time pouring ourselves into this documentation and come away with a general "philosophy" of an approach to the architecture.

Then we build the architecture, updating our philosophy with our learnings along the way.

At the end of the project, we commit the philosophy to paper. We deliver the system we built, the philosophy behind it, and the stack of references used to build the system.

What this means is I can take any engineer at the target level they hire for, hand them the deliverable and say "go spend a week reading these, you'll come back with sufficient expertise to own this system."

It also acts as documentation for myself for future contracts if I get brought back in. Prior to starting the contract I can go back in and review all of those deliverables myself to hit the ground running once I'm back on the project.

amw-zero · 2 years ago

Sounds like an architecture decision record. Here's an example ADR template: https://github.com/joelparkerhenderson/architecture-decision....

jrvarela56 · 2 years ago

This sounds like the right way to do it. For me it has been tough to come up with principles that don't sound like they apply to any system. You start off with a generic CRUD app but as it grows the default/usual web framework constructs tend to leave you with a ball of mud. You can couple anything in there together and since you're pressed for time, you tend to do it. Abstractions feel premature and when they start emerging there's lack of conviction to push through with them and clean up the whole thing.

Do you have any starter resources to come up with principles for a system? Maybe something showing how certain principles lead you to implementation choices that would've been different under another philosophy.

Deleted Comment

bob1029 · 2 years ago

> All because a year ago someone (well myself and anther person) took the time

I've been saying for half a decade or longer:

"Going slower today means we can go faster tomorrow".

It took a long time for some of my team members to process this, but I believe they've all taken it to heart by now. The aggressive, rapid nature of a startup can make it very difficult to slow down enough to consider more boring, pedantic paths. Thinking carefully about this stuff can really suck today, but when its 3am on Saturday and production is down, it will all begin to make a lot of sense.

Having empathy for your future self/team is a superpower.

RangerScience · 2 years ago

“Slow is smooth, smooth is fast.”

cogman10 · 2 years ago

Yuup. Unfortunately, there's profit disincentives to this. Time to market for new features is a thing. Getting out features fast gets you kuddos from the suits. So you get a class of dev that spins out code like wickedly fast while at the same time leaving a mess for others to clean up.

It's hard to correct that sort of behavior (without being an actual manager that knows code and can spot bad architecture).

amw-zero · 2 years ago

Have you ever measured this alleged speedup when "tomorrow" comes?

Deleted Comment

latchkey · 2 years ago

> I'm always thinking about "Can I (or anyone) get back into this easily 6 months from now?"

As I age, my memory is getting worse and worse and I realize that quite clearly. Therefore, I always try to write documentation as I'm writing code, so that I can remember why I did something. It helps a lot so that 6 months later, I can do exactly that... but I also know that anyone else looking at my stuff will also realize why things are the way they are.

duxup · 2 years ago

I’m the same way, notes, good documentation, etc.

Sometimes I think I get some tasks done faster than when I was younger…

temporallobe · 2 years ago

> I'm always thinking about "Can I (or anyone) get back into this easily 6 months from now?"

People I work with get very annoyed with me because of this, but I am obsessive about documentation for this reason. Sure, it requires a lot of tedious writing and screenshots, etc., but it has saved me countless times. I still can easily get back into things years later thanks to documentation.

The caveat is that when people who are not as passionate as you maintain a product and seemingly forget about documentation.

In the old days, documentation was a very strict requirement on many of projects I was involved with. Now, in modern agile projects, it’s an afterthought at best, despite having amazing documentation tools that we’ve never had before.

amw-zero · 2 years ago

What ways have you found for keeping the documentation in sync across frequent changes?

bosie · 2 years ago

Do you mind expanding on which tools you are using for documentation (creating, maintaining etc) please?

quickthrower2 · 2 years ago

Like real clouds, the cloud wont look the same in 6 months. I got a stream of daily emails from Azure, end of life this, upgrade that, secure this etc. Those bits will rot unfortunately.

There might be sense in renting that machine from Herzner and sticking Ubuntu on it after all.

iancmceachern · 2 years ago

"we found that differences in architectural complexity could account for 50% drops in productivity, three-fold increases in defect density, and order-of-magnitude increases in staff turnover."

I think I can speak for many of us technical professionals when I say, been there, done that.

jkubicek · 2 years ago

This is exactly why I think the "myth of the 10x engineer" is so obviously false.

An average software engineer will create an overly complex system.

If a skilled engineer can come in and create something that doubles team productivy, decreases bugs by 60% and improves retention across the org by 10x, that's huge! That's not a 10x engineer, that's a 100x engineer.

nerpderp82 · 2 years ago

Engineers that make simple systems are not rewarded is most organizations. In a ten person company, sure thing! In a large corporate org, no way.

niels_bom · 2 years ago

Reminds me of “I apologize for such a long letter - I didn't have time to write a short one.”― Mark Twain

Exactly this. I've done well for myself by focusing on building things that work, and shipping them on time. Anything else is extra and not to be done at the expense of the first two things.

contemplatter · 2 years ago

That's not a 10x engineer, that's a 100x engineer.

You're speaking of ChatGPT!

pphysch · 2 years ago

Yep, working with well-architected software systems is the difference between a low-stress well-paying job and rapid burnout. No surprise that the latter results in massive turnover.

It's remarkable how "add feature X" can be a 1 hour task or a 1 month task depending on whether the system was designed to evolve and scale in that direction. But the non-technical management or customer just sees it as a simple request, and the (Nth) software developer is left to pick up the pieces.

kps · 2 years ago

> order-of-magnitude increases in staff turnover.

I suspect this goes both ways. I have worked at a place characterized by frequent internal shuffles, and more than once spent ages reverse-engineering code when the original team could have handled the problem in a fraction of the time.

kerneltime · 2 years ago

MIT has an amazing program for System Design and Management. Dan and I are both graduates of this program. Some of the courses I would recommend include System Architecture, System Safety, and System Dynamics. Most of the content is available on OCW.

What is taught is not software-specific, but is entirely applicable to software, outside of the world of 'throw everything on the wall and see what sticks' as long as the venture capitalist can be shown growth at all costs. I wish more software developers were mindful of complexity and architecture.

What the author is up to now https://www.silverthreadinc.com

waitingforset · 2 years ago

Sounds super interesting. Can you link to the specific OCW courses? I see a few from different departments.

ESD.34 ESD.342 16.863J 15.871 15.783J ESD.33 to name a few Also, 15.965 my favorite and offered by my thesis advisor Michael A. M. Davies: Based on which it is likely that OpenAI won't be walking away with the cake.

agomez314 · 2 years ago

Could you elaborate a bit more on what you got out of the program? I never heard about it before but i'm intrigued. Did you find a particular course memorable?

The tag line within SDM was that it is a program for those who want to lead engineering and not leave engineering (MBA) I think the meta framework for thinking and being able to step away from the madness of releasing a v1 product and having tools for thinking about the bigger picture. Also, MIT.. it is a very rewarding ecosystem to be in.

neals · 2 years ago

I've gotta say, this study really hits home for me. As a developer who's worked on a few projects with some gnarly architectural complexity, I can totally see how that would lead to these kinds of costs. Just last year, I was part of a team that had to deal with a super convoluted codebase that felt like a patchwork of different styles and approaches, with no clear hierarchy or modularity. It was an absolute nightmare to navigate and make changes to, and our productivity took a nosedive. Not to mention the bugs that kept creeping in and the insane amount of time we spent debugging. I even saw a few of my colleagues jump ship because of the frustration. I wish management would've realized the potential benefits of investing in some proper refactoring efforts to improve the architecture. It might've saved them a lot of money (and headaches) in the long run!

tunesmith · 2 years ago

Two things that get in the way of "long run" thinking.

What's the timeframe that teams should adopt? If it will pay for itself in three years, is that too long? What's a good argument to make here that will appeal to the bean counters that are used to thinking in terms of quarterlies? I personally am comfortable with "eventually/infinite" but that is a tough sell.

Capitalization vs Expense. Maintenance/refactoring is not capitalizable, and thus gets discouraged in businesses that care about P&L. New (capitalizable) projects/features are encouraged instead. What are some good ways to encourage maintenance/refactoring in this kind of environment?

treis · 2 years ago

Stuff like this contributes massively to a culture of low expectations too. When it becomes accepted that simple things take a long time to do the normal reward system goes flying out the window. Both internally and organizationally.

Internally because the sense of accomplishment from seeing the new bit of functionality is tiny compared to the effort it took to get done. Organizationally it becomes harder to reward productivity because with no sense of how long something should take there's no way of knowing if an engineer is fast or not.

It's a productivity death spiral. Engineers slack off because there is no internal or external reward for doing good work. That slacking off slows development and then that velocity gets accepted as normal. Engineers continue to slack off against the new "normal" and establish an even slower normal to slack off against. That's how hours long tasks turn to days and then weeks and then months until the project dies.

holoduke · 2 years ago

I still dont know what is better. Create quick throw away code you replace in a few years. Or build a well structured system that lasts longer. I have worked all my carreer in only the first 0-8 years of company startups. Most of the joy and success I got from 'quick hacking together working systems'. A lot of people arround me dont share that opinion and are better suited at structured companies. Maybe there is no better.

smith7018 · 2 years ago

I've been a software engineer now for 12 years and I think there's an obvious answer: architecture. If you build software with a solid architecture from it's inception, then "hackily building features" will happen much faster and cleaner than a mess of a codebase. I wish more people realized that "build things fast now without caring about quality" translates to years of "God, I wish we could just throw all this out" and "no one knows how this part of the code base works but it does so we don't touch it." Plus, realistically speaking, adding that solid foundation shouldn't take _that_ much longer if you know what you're doing.

dimal · 2 years ago

But you don't know the architecture you really need when you first start.

I think the key is that you accept that this is true, and that having a good architecture is a continuous process that never ends. You build the best architecture you can imagine, given the current level of knowledge. Then you have to be willing to refactor every single day after that, as you gain new knowledge. Too many people think that refactoring is only for special occasions, when things have gotten really bad. Every single PR can contain a small refactor. Small changes can accumulate and eventually lead to major architectural shifts.

majgr · 2 years ago

Every messy project starts from ‘architecture’. At the beginning it is clean, and super well organized. Then, instead of, these well thought out before extensions, some breaking feature needs to be implemented. And it is implemented in the most aligned to starting ‘architecture’ fashion. After couple of years of these implementations a mess is created.

It is better to have simple solution at start, then, after each breaking feature, whole thing needs to be refactored, because initial assumptions about that software might have been changed.

dasil003 · 2 years ago

This assumes you know where you are going. All code is a liability and all architecture decisions are tradeoffs.

jasondigitized · 2 years ago

When you have a savvy competitor that is taking the opposite tack and killing you in feature bake-off, you don't have this luxury. Your Ferrari is being built in the garage while paying customers are driving around in your competitor's Yugo.

hbrn · 2 years ago

No plan survives first contact with reality: "well structured systems" usually get thrown away in the same few years. The only difference is that they often get thrown away along with the company that built them.

Quick throwaway code is orders of magnitude better (unless you're landing airplanes or building x-ray devices). Especially if you're consciously treating it as a throwaway.

The only way to build well structured system that lasts long is to have vast domain expertise, meaning having done literally the same thing multiple times in the past. This rarely happens in general, and pretty much never happens if you're innovating.

I've never seen an innovation that wasn't the same as some other piece of code with different variable names. It's like the old joke about how every piece of software, given enough time, gains the ability to send email.

> Or build a well structured system that lasts longer.

What if you had the ability to build a system that was initially structured well enough such that it could be made to last indefinitely? From a cost/benefit standpoint, is the Ship of Theseus approach not the most ideal for a business owner?

Even for a developer, the notion that you have to constantly "trojan horse" your new product iterations into production via the existing paths means you will achieve mastery over clever workarounds and other temporary shims. Once you gain competence in this area, it is possible that you will never want for a new boat again. You start to get attached to the old one and all of the things it takes care of that you absolutely forgot about years ago.

WJW · 2 years ago

> if you had the ability to build a system that was initially structured well enough such that it could be made to last indefinitely

If you could have that, it would obviously be amazing. Do you (or anyone) have that ability though? So far it seems the answer is no. People are just not smart enough to predict the future.

As a more nuanced answer, a system that is scalable enough to grow to 1000x the current load is usually way too expensive to build with the resources you have "right now". The best you can do is build a system for 10-100x the current load and hope you haven't forgotten any edge cases, but usually you encounter them way sooner than 10x the load. Building so that you can easily refactor your current system is the way to go, but even then you will sooner or later run into problems your original design did not consider.

nhoughto · 2 years ago

In startup/mvp land it is a genuine tension between shipping it and over engineering at different extremes. It is quite possible to correctly think “this is bad engineering” and still ship it and all of those decisions to be correct. Bootstrapping and early stage code almost inevitably gets replaced so isn’t worth polishing too much. It feels totally wrong and requires some real soul searching for some engineering personalities but in the end it’s optimizing for the most important outcome, the actual business success.. speaking from exp of not doing a few times and then the whole thing failing..

codr7 · 2 years ago

Speed is the main reason I do architecture.

In the end it mostly comes down to reducing complexity, but the goal is always allowing new features to be added as fast/easy as possible.

Because I'm lazy.

preseinger · 2 years ago

it is basically never the case that the time you spend typing code into your editor is the bottleneck that impacts how quickly you deliver a system or feature

quick hacking together of prototype systems delivers a dopamine high that is quickly reduced to zero when those systems need to be maintained and extended into an indefinite future

building systems that are well structured and not terrible requires knowledge and experience, but absolutely 100% does not take more time than building shitty hackathon prototypes

discreteevent · 2 years ago

It's a balancing act:

"There is no theoretical reason that anything is hard to change about software. If you pick any one aspect of software then you can make it easy to change, but we don’t know how to make everything easy to change. Making something easy to change makes the overall system a little more complex, and making everything easy to change makes the entire system very complex. Complexity is what makes software hard to change."

https://martinfowler.com/ieeeSoftware/whoNeedsArchitect.pdf

eternalban · 2 years ago

Worth pointing out that this study does -not- equate "architectural complexity" with abstraction. Many consider use of "hierarchies, modules, abstraction layers" to be 'unnecessary complexity' where as the thesis clearly states they "play an important role in controlling complexity." OP is not a call to get rid of software architects, arguably it says ~'hire competent architects as bad architecture negatively impacts faults, staff turnover, and product success.'

"Architecture controls complexity", and under-/poorly- designed architecture while superficially "simple" will give rise to un-intended complexity. Microservices are the du jour poster child here.

ParetoOptimal · 2 years ago

Not a direct response, but some thoughts that come to mind without any direct conclusion:

Good abstractions make simpler software. Leaky abstractions multiply the complexity of software by a lot.

Some respond to this by making "simple" software that dispenses entirely with abstraction. This ends up in a lack of architecture where complexity still multiplies, though perhaps less than the typical mix of mostly leaky abstractions and a few sound ones.

However, it's kind of the nihlism of software and throws away the opportunity for us to actually improve our craft... so I'm not all too interested in it.

People regularly forget Rule Zero of patterns: Don't implement a pattern if it doesn't solve a problem. That's the difference between unnecessary complexity and controlling complexity.

farrugp · 2 years ago

Can't wait to read this, really resonates with me as I'm dealing with this right now at a big FAANG company.

In my experience the problem is that as with all things it's all about balance. We shouldn't throw away architecture entirely and write stupidly simple, quick solutions because they will be messy. But we also shouldn't over-abstract things so much that only the person/people who built the system can understand it and work with it. Both could have dire consequences for an organization, making building new features and delivering value to users slow, difficult and costly.

Once I finish reading this MIT paper, I want to dig further into exactly what makes software 'complex'. In my experience:

- too many layers of indirection - overly theoretical concepts not grounded in real world thinking - lack of documentation - bad naming - lack of testability - tight de-coupling - following 'best practices and patterns' without clear justification - trying to solve for problems before they exist

We should be building systems that are grounded in concepts that are easily understandable - which is exactly why Object Oriented Programming has been so successful. We write programming languages as a means of communicating with each other about program logic, why not do it in terms that we as users already understand in the real world?