“Who Should Write the Terraform?”

cube2222 · 3 years ago

Software Engineer at Spacelift[0] here - a CI/CD specialized for Infra as Code (including Terraform).

A pattern we're seeing increasingly commonly are Platform Engineering teams doing the bulk of the work, including all the fundamentals, guidelines, safety railing, and conventions, while Software Engineers only use those, or write their own simple service-specific Terraform Stacks which however extensively use modules developed by the former.

This does also seem like the sweet spot to me, where most of the Terraform code (and especially the advanced Terraform bits) is handled by a team that's specialized for it. If you don't have a Platform Engineering team, or one that is playing its role (even if its called DevOps or Ops or SRE) in even a medium company, you'll probably start having as many approaches to your infrastructure as there are teams, complexity will explode, and implementation/verification of compliance requirements will be a chore. Just a few people responsible for handling this will yield huge benefits.

And yes, I can wholeheartedly recommend Spacelift if you're trying to scale Terraform usage across people and teams - and not just because I work there.

Disclaimer: Opinions are my own.

[0]: https://spacelift.io

steveBK123 · 3 years ago

I think a platform team taking ownership is the correct model, but the early product teams need to have "embeds".

The platform team owning base terraform functionality works well for the product teams that are the 3rd or 4th user of said functionality.

For the early days of the platform, and the early users.. your product is constantly in dependency & priority battles with said platform team. This is where "embeds" help continually unblock while making sure the work is done in a platform centric manner that will be reusable for other product teams.

Simply saying the product teams need to go down into the weeds at this level just puts too much disparate responsibility on product teams who exist to deliver a single product. Similarly it encourages vastly different approaches to similar problems, with all the wasted duplicate & re-work.

lhorie · 3 years ago

I tend to think of embeds as being very similar to the open source contribution model: you want some sort of BDFL entity that drives the overall direction of the platform, but also some sense of community/collaboration where individuals can feel empowered to contribute features to scratch their own itches, or bring up discussions, etc.

Having a team owning the platform doesn't necessarily need to mean shutting yourself in a cave. Granted, promoting cross-functional collaboration is a challenge in and of itself, but similar to OSS, projects that invest in the community aspect are the ones that eventually gain critical mass and set themselves apart from the rest.

danwee · 3 years ago

Having a single "platform" team per company is a bottleneck as soon as the number of product teams is greater than N.

> ...you'll probably start having as many approaches to your infrastructure as there are teams, complexity will explode, and implementation/verification of compliance requirements will be a chore. Just a few people responsible for handling this will yield huge benefits.

Agree with the centralization of "how infrastructure should be managed/defined". A "platform" team composed of M platform engineers (where each platform engineer works 80% of their time for a given product team) can handle such centralization.

PragmaticPulp · 3 years ago

> Having a single "platform" team per company is a bottleneck as soon as the number of product teams is greater than N.

This is my experience as well. Having a single platform team has been a great experience for laying foundations, establishing shared architectures, and centralization documentation.

As soon as two or more teams need something from the platform team, it becomes a battle of priorities. A good platform team will recognize this and work on a division of labor and coordination strategy that can start to scale. A bad platform team will treat this as an opportunity to claim the company’s wins for themselves and leverage their bottleneck position for political gain.

The company’s management of the platform team is key. I’ve also seen a single platform team abused as the engineers who are expected to own all the hard work while other teams get to walk all over them with demands. This results in a lot of employee turnover, which is the opposite of what you want on a team tasked with holding the core knowledge of the company’s infrastructure.

james_s_tayler · 3 years ago

You can have more than one platform team.

I think reality is more complicated than a one size fits all approach. It's going to be specific to your org, your project, the stage it's at etc. To add to that, the right thing to do is often in flux.

Dedicated capacity is necessary, as is embedding. Not always at the same time or in that order. That's where only the information found inside the walls of your organisation can help you decide what is necessary to solve your problem.

codeduck · 3 years ago

It also creates the unrealistic expectation that one size fits all. An architecture that works well for stateless microservices fails spectacularly when faced with monolithic session-bound legacy telecoms services.

Yet so many people insist that the one is the same as the other, when one is a duck and the other is an elephant wearing two swimming fins on its face.

bigiain · 3 years ago

> Platform Engineering teams doing the bulk of the work, including all the fundamentals, guidelines, safety railing, and conventions, while Software Engineers only use those

So, sysadmins and programmers - but with new 2020s vintage titles? (and renumeration...)

NateEag · 3 years ago

Basically, yeah, but with the difference that these sysadmins are generalizing and abstracting the patterns they've learned over years.

I personally think of "devops" not quite so much as being about "dev" and "ops" collaborating (though that is a noble and worthwhile goal) as about having "developer-operators", people who know how to do operations effectively and who can turn that knowledge into automated, generalized software systems.

The abstract modules and tools can live in their own repositories (or folders, in a monorepo), and your devoperators can work closely with the product teams to use them (and abstract specific changes to meet projects' individual needs to be more generally applicable).

mountainriver · 3 years ago

Seen this at a couple companies and it doesn’t work well. The platform team becomes a bottleneck and the devs don’t want to have to deal with or learn the mess that is terraform.

It’s time for the ecosystem to move beyond the half baked config language known as HCL

james_s_tayler · 3 years ago

Pulumi

Too · 3 years ago

What do you suggest instead?

Spivak · 3 years ago

I think this would easier to adopt if it could be plopped into an existing agnostic CI system. We built something like this in-house on top of Gitlab CI and it works really well for us. Locking isn't as much of an issue as you make it seem in the pitch, we just have our infra contaiers wait to acquire and renew a distributed lease while they're running. Some kinds of failures just release the lock and others panic and stop the world for human intervention.

Presumably your core competency isn't building CI systems or job runners so why bother? I'm sure at the core of your own infra it's job agnostic The value-add is the management plane on top of it.

cube2222 · 3 years ago

The semantics of standard CI/CD providers are in practice very ill-suited to more advanced Infrastructure-as-Code use cases (triggering other stacks based on changes, multi-repo workflows, etc.), so building on top of them would add a lot of complexity. I don't want to go too much into it.

Overall, if your setup works for you and you're happy with it - keep using it!

We've seen a lot of companies (many now our customers :) ) try to build their own on top of existing systems (GitHub, Gitlab, Jenkins, etc.) and waste a ton of time and engineering resources, while ultimately not achieving anything that works well.

What Spacelift does is it gives you a bunch of much better-suited building blocks which let you build your required workflow very quickly.

And it obviously does integrate very deeply with your VCS provider - Commits, Pull Requests, Comments, etc. - everything is supported and customizable using - amongst others - Push Policies[0].

[0]: https://docs.spacelift.io/concepts/policy/git-push-policy

zwischenzug · 3 years ago

Author here: yup

Pendulum back to the center

cube2222 · 3 years ago

Partly yes, but not fully.

The idea is not to go back to the Software Engineer asking the Ops team "Hey, can you provision a Postgres database for me please?" and then waiting a week for it.

It's that the Software Engineer takes a module that was prepared by the Platform team - i.e. "terraform-postgres-mycompany" - which already includes all the requirements the company has for handling databases (think backups, monitoring, encryption, etc.). They can then proceed to use it in the small service-specific Terraform configuration, which really is just putting such ready-made modules together.

The important bit being - the Platform team isn't a bottleneck here.

historynops · 3 years ago

Thank you for the plug. Surprised such an open commercial promotion got top comment.

solatic · 3 years ago

ITT people arguing for embedding infrastructure engineers into product teams.

Ayyyy, dios mio.

a) If you need to embed, then actually, you need to embed InfoSec, UX, IT, Customer Success, Product, Compliance, etc. etc. for exactly the same reasons. In today's labor-constrained economy, good luck finding qualified people for every role on every team! And if one of them leaves, who ensured that they documented everything for the next guy? Or that you'll find someone to fill the role quickly? If you have a 30 person company, fine, no big deal. 150+ and it starts to become a serious problem.

b) Particularly for infrastructure, you will shoot yourself in the foot on your production cloud bill. If you share no infrastructure with other teams, then you will find no shared efficiency in sharing the same infrastructure. Conway's Law will burn your runway. If you're 100% serverless then this doesn't really apply, but if you're spinning up eight different Kubernetes clusters for eight different teams then you probably need to collaborate a bit better.

Product teams need to own their product top to bottom. Platform teams need to make that easy for them, because modern stacks are huge, it's not possible to staff a single team with all the necessary experts, and all that expertise is a genuine necessity. The lines are drawn in different places in different companies depending on available labor and technical requirements.

lmm · 3 years ago

> If you need to embed, then actually, you need to embed InfoSec, UX, IT, Customer Success, Product, Compliance, etc. etc. for exactly the same reasons. In today's labor-constrained economy, good luck finding qualified people for every role on every team!

If those people are part of your core value proposition, the thing that's supposed to give you your competitive advantage, then yes (though if you need all of them, you probably don't have a very good value proposition). If not, if they're just a cost center doing commodity-level work, then they don't need to be part of the product team - but in that case you should be looking to minimize or outsource them.

> Product teams need to own their product top to bottom. Platform teams need to make that easy for them, because modern stacks are huge, it's not possible to staff a single team with all the necessary experts, and all that expertise is a genuine necessity. The lines are drawn in different places in different companies depending on available labor and technical requirements.

If the "platform team" are doing something so independent from the products that they don't need to be part of the same team, why are they in-house at all? If you're offering a generic platform, either you're doing it better than Amazon and should be in the business of competing with them, or (more likely) you're doing it worse than Amazon and should just use Amazon.

solatic · 3 years ago

> why are they in-house at all?

Someone needs to answer to Compliance, to InfoSec, to Finance. Someone needs to make sure that they all understand exactly what production looks like in their language. Compliance wants to know whether we keep EU data in the EU. InfoSec wants to know whether all our code in production passed security review. Finance wants to prevent costs from spiraling out of control and to judge which projects to fund.

Good luck trying to get AWS's "platform" to do any of that as a managed service, out of the box and without any in-house engineering time!

roguas · 3 years ago

> but if you're spinning up eight different Kubernetes clusters for eight different teams then you probably need to collaborate a bit better

Why? there can be reasonable scenario for that - say 8 reasonably seperated projects run by 100 people?

Also I do not see how being serverless "doesnt apply". It does apply because a lot of your infra is security, especially company-wide security configuration.

I understand the deeper meaning of the message, but at the same time devops is a thing because it likely hurt more than other cases mentioned. But I think the whole thing is often a balance between integrated / standalone.

Every team and project requires breathing room but also requires certain level of integration. Devops was needed and is proceeding - find an engineer who has no docker experience today, compared to the past where often engineers had 0 idea of delivery. Other groups may rise their own requests if they feel, but they will lose some flexibility from being standalone.

readlikeasloth · 3 years ago

> If you're 100% serverless then this doesn't really apply, but if you're spinning up eight different Kubernetes clusters for eight different teams then you probably need to collaborate a bit better.

This is exactly the situation I´m currently in. Company decided to migrate from big on-prem kubernetes to AWS. Now every team got their own account and well... good luck, you´re on your own now. We´re a small team of three developers. Although we have three certifications under our belts (AWS Dev, CKA, CKAD) it took us almost three months to configure AWS and set up the Terraform pipeline and define processes like "upgrading cluster". The "enabling" part was basically missing in the whole cloud strategy of the company. It was more like: good luck, you´re on your own now.

In fact we made contact with a neighboring team. Only to find out that their use case was so different from ours that collaboration didn´t make any sense. For them Kubernetes was not a good fit, for us it was the way to go.

Speaking of sharing a cluster or AWS ressources: we figured out that it is not allowed due to billing reasons. Company policy is: One product per AWS account.

If you ask me I see a shift of paradigms happening here. Now you hear a lot about "enabling teams" instead a dedicated team for infrastructure providing services (e.g. the Kubernetes podcast from Google). I´m not convinced yet. I think this is more like kicking down responsibility down the chain. And then it feels more like: Someone needs to do the dirty work but nobody wants to do it.

It might work if you don´t have to provide Service-level agreements (in our case: we don´t). For us it is just more work to do. And our work shifts from dev to ops. Instead of writing software we´re mostly busy with configuring cloud resources. This will ease a bit once everything is running. However: I see this whole change more as ... uh, strong word... ideologically motivated. Cui bono? Neither our team, nor our users nor our infrastructure bill.

solatic · 3 years ago

> it is not allowed due to billing reasons. Company policy is: One product per AWS account.

That's kinda funny because half the reason why AWS has tags in the first place is to get finer granularity into understanding billing. Not to mention products like Kubecost. Sounds like whoever wrote the policy doesn't understand how AWS works.

> Someone needs to do the dirty work but nobody wants to do it.

There are plenty of people willing to do the dirty work, they're just already working for other companies and their salaries are quite high. The labor market is tight.

> Cui bono? Neither our team, nor our users nor our infrastructure bill.

HR benefits. Having open positions that HR is failing to fill is a bad look for HR.

hinkley · 3 years ago

It’s tough getting lectured by people who aren’t following the same level of discipline standards that you are. Infrastructure code usually looks more brittle than production code, because they aren’t specialists in high quality general purpose code. Not as bad as QA code, but not great.

I think the instinct is that if you’re going to take the moral high ground, you’d better walk up the hill and join us first. And the simplest way to do that seems to be the obvious one, which is to combine them under the same org chart and governance.

chii · 3 years ago

> you need to embed InfoSec, UX, IT, Customer Success, Product, Compliance, etc. etc. for exactly the same reasons

aka, a full-stack engineer! The idea that you have some specialist take care of each role in a team is just fantasy.

Get a smart person, and train them full-stack. Including customer success (aka, sales and after-sales support), compliance (i mean, GDPR is required understanding now, so might as well be the engineer who knows it).

theamk · 3 years ago

Re this segment:

> "There were endless complaints about the time taken to get ‘central IT’ to do their bidding, and frequent demands for more autonomy and freedom. A cloud project was initiated by the centralised DBA team to enable that autonomy. [...] Cue howls of despair from the development teams that they need a centralised DBA service"

Author makes it sound like users didn't know what they wanted. This is not true -- I have seen this play in practice, and what author omits is _it was a different set of people_ who were complaining before and after.

If a dev team has at least 2 engineers who are happy working with infrastructure, then the team will benefit from autonomy. If there is no one like that on the dev team, they will cry in despair.

aranelsurion · 3 years ago

> _it was a different set of people_ who were complaining before and after.

That's something I've observed as well. Seems to me there are (at least) two developer personas, one kind only wants to deliver their task, specialize in what they do well, and generally can't care less if their DB is oversized or has no maintenance windows set or no recovery plans or who has access to it etc. They usually lack the cloud/platform skills as well, and won't develop much in that regard because they don't care to. Even if they did, they're unlikely to get rewarded much for that effort. They are easy to make happy, and you rarely hear from them other than the occasional "thanks" in some Slack channel.

The other kind is either internally very curious about the subject, or already has the experience, or at least they think they have it. They want to have full access, invent things anew in the "right way" they believe. For them there is nothing worse than relying on another team while they could geek out on the subject themselves and they believe they could do it better. Sometimes they're right about it, and other times they're either oversimplifying the work needed, or optimizing locally around themselves/their team/their task. There seems to be no way to make them really happy other than giving them complete freedom to do whatever.

Seems to me most developers (I've worked with) are of the first kind, and they can be made happy after some level of maturity is reached within the company, but the second kind is way more vocal and they won't ever be happy with whatever a central team builds.

More and more I'm getting convinced that the only way to really win both personas is to build two products instead of one. So you build the golden path, the Helm chart or the portal or whatever for the first kind, and give ownership and loosely govern with compliance/policy tooling with the second kind.

This optimizes for the short/mid-term satisfaction, but of course it can also go wrong since team compositions are not set in stone and what one builds may not be maintained properly by the other, and there'll be some duplication of efforts and quality of solutions built might vary between the teams. I guess for some companies this is acceptable, and for others it won't ever be.

tl;dr ¯\_(ツ)_/¯

theamk · 3 years ago

For the second group, I don't think it is about inventing things anew.. most of the time, it's just an efficient way to get things done.

A lot of time the centralized ops team is very slow or just not very good. Your tickets may take weeks to be processed, or critical requirements are ignored, or maybe central ops team only cares about closing tickets and does minimal possible work to satisfy the letter of the requirement.

If there is no one on your team who can do better.. well you suffer and work with central team. But if your team has someone who can do this and the autonomy to proceed, then your can work much faster -- no need to wait weeks for to allow the other team to access your data, you can grant the permission yourself in under a day.

physicsguy · 3 years ago

I’d probably fall into the latter category of “complainers”, but I actually care very little about doing infrastructure work even though I’m interested in it, I just want a quick turn around on simple requests.

My current place is just awful. Over complicated architecture born from a platform team that couldn’t be less helpful, so people have worked around it with all sorts of hacks.

Aeolun · 3 years ago

> For them there is nothing worse than relying on another team while they could geek out on the subject themselves and they believe they could do it better.

It may not be optimal, but it’s almost invariably faster.

Personally I like the way my company does it, where people have (more or less) full access to the AWS account, but there’s a lot of automated guardrails/scanning that alerts you when you’ve done something stupid (public S3 buckets etc.)

ragona · 3 years ago

I've started to believe that product engineers should manage their own infrastructure. I think the key ingredient is _isolation_, so that it's not that they have to figure out how to fit their service into the unholy single production account with 15,000 running instances, it's that they get to start fresh from a basic template and then move from there. Most services, when isolated from the other microservices, are just not _that_ complicatedl

For what it's worth this is how AWS operates, and I think it's the mindset with which they build products. You certainly _can_ go your own way and run something like k8s on top of it and build a mini-cloud in the cloud, but it's incredibly expensive.

It's a mistake I've made repeatedly -- "Oh, I'll just add this little abstraction to make it easier for developers!" But now the poor developer has to understand both the tools I built on top of _and_ whatever I was thinking at the time, and inevitably it's an under-resourced area.

Now, at a certain scale for core services, sure, you'll end up with infrastructure specialized folks. But I'm unconvinced that the place you want to start is, "Okay, I need a new service, better go talk to the beleaguered central team that never has quite enough time for anyone."

pas · 3 years ago

It's a constant cost-benefit struggle. AWS can do it because they are printing money printers.

Sure, this does not excuse most traditional big corps that have huge internal engineering budget yet force a top-down rigid inefficient structure. (Though again, it takes a very principled way of doing things to be able to scale out and keep things sort of consistent and coordinated.)

ragona · 3 years ago

I get it -- it's such a trap to say, "well, <tech giant> does it!" But in this particular case, I actually think they're walking the walk of having small teams that act like startups. You end up with small engineering teams owning a tiny little bubble, and it does a LOT to keep complexity down in terms of what one team has to manage.

(This is of course much more true for greenfield projects, early stage stuff. Of course the giant services are large and complicated.)

yarosh · 3 years ago

A bit of ranting...

As for me

1, HashiCorp is forcing enterprise upsales whenever possible, even if it'll hurt Adoption Rates and overall Development Experience

2. Existing TF design issues are ignored, which is causing people some state management trouble irrelevant for TFE. So, yet again, why fix something that will end up in upsales ?

3. MPL requires for the PR's to be available in case someone will really fix something, but it's near impossible to contribute into Terraform with any major design improvements.

4. Existing Providers issues are neglected, and Accepting Working PR's takes around 3-4 weeks...

5. Some Providers (helm) are neglected in favour of the New Product Release (Waypoint provider) and there a Forced Obsolescence Factor alongside with Forced Adoption.

Deficient Relationship Marketing is the Key Factor in deciding who Will actually write Terraform (maybe not even HashiCorp), Who will Wrap Terraform and Into What (terragrunt, terraspace, pulumi, crossplane etc or some custom gitops SaaS), and Who will Support the target providers when Hashicorp solutions will magically turn into an abandonware due to upsales.

crb · 3 years ago

If you're interested in more of the authors thoughts on DevOps, Kubernetes or writing, check out an interview I did with him recently: https://kubernetespodcast.com/episode/185-writing/

Dave3of5 · 3 years ago

> The development team didn’t want to – or couldn’t – do the Ops work

Most devs I've spoke to are in this camp they don't want to do any Ops work at all. They want a 9-5 job without evenings are weekends wasted by services failing. No on call rota and all that jazz just writing code that's all.

dilyevsky · 3 years ago

People used to have same attitude with testing in reality it’s just way better quality and often velocity if folks dont just throw it over the fence.

goodoldneon · 3 years ago

That isn't a good comparison since testing can still be part of your 9-5. There are good engineers who adamantly never want to work outside 9-5

raffraffraff · 3 years ago

I could be wrong, but 20 years of experience tells me that company size has a lot to do with this.

Tiny organisms like amoeba can be simple. But as organism size increases, so too does complexity. They eventually need a nervous system, circulatory system, extra sensors, a more powerful brain to process sensory information and handle movement, motion tracking for hunting. Suddenly, packs of these animals will hunt together, so they'll evolve communication: signals, sounds, language...

Well, if you're a 4-person start-up sitting in the same room, decisions can be made quickly, you don't need departments, managers. But as you grow your need to be extremely careful that you build a nervous system, circulatory system, sensors ... "management brain".

The biggest failures in ops aren't "who does X?". It's about creating right-sized teams that own functions that are important enough to have specific owners. With further growth, certain functions get more complex, and suddenly you might need dedicated network, database & security teams. And if it gets huge, then you probably need to need multiple copies of those specific functions embedded inside large subsections of the organisation. And they all need to communicate effectively with each other. It's a constant dance. You can't make a single rule and just stick rigidly to it. You need to keep tabs on complexity, workload, morale, lead times. You need to be ready to refactor your teams.

When I hear stores like "it was taking 8 weeks to get a DB provisioned" I think "if that company makes it to IPO and the CTO gets a few $100M, there's absolutely no justice in the world".