Terraform best practices for reliability at any scale

Here's my #1 tip, most important:

Try to keep your stateful resources / services in different "stacks" than your stateful things.

Absolutely 100% completely obvious, maybe too obvious? Because none of these guides ever mention it.

If you have state in a stack it becomes 10x more expensive and difficult to "replace your way out of trouble" (aka destroy and recreate as last resort). You want as much as possible in stateless, disposable stacks. DONT put that customer exposed bucket or DB in the same state/stack as app server resources.

I don't care about your folder structure, I care about what % of the infra I can reliably burn to the ground and replace using pipelines and no manual actions.

coryrc · 2 years ago

You mean keep stateless separate from stateful?

Everyone else seems to be reading over the typo or I'm more confused than I thought.

wahnfrieden · 2 years ago

Yes

sverhagen · 2 years ago

Is a "stack" here a (root) folder on which you'd do a "terraform apply"? I've never know what to call those, surely they aren't "modules".

And, so, you're saying: try to have a separate deployment (stack then?) that contains the state, so you can wipe away everything else if you want to, without having to manage the state?

xyzzy123 · 2 years ago

It's not exactly about the folder, the IaC from a single folder / project can be instantiated in multiple places. Each time you do that, it has a unique state file, so I usually hear it referred to as a "state". In cfn you can similarly deploy the same thing lots of times and each instantiation is called a "stack", so stack/state tend to get used inter-changeably.

And yes, that's a succinct rephrasing.

When you first use iac it maybe seems logical to put your db and app server in the same "thing" (stack or state file) but now that thing is "pet like" and you have to take care of it forever. You can't safely have a "destroy" action in your pipeline as a last resort.

If you put the stateful stuff in a separate stack you can freely modify the things in the stateless one with much less worry.

raffraffraff · 2 years ago

I have the same issue. They're all modules, but the ones at the tip of the directory tree (right at the end of your env/region/stack) are called root modules. Which makes no sense because the term "root" always implies that they are at the beginning, not the tippy-toe end. So I call mine "stacks". But as another answer suggested, "states" is also fine. Even though the actual state isn't inside that directory, it's probably in an object store.

At the end of the day I don't care what other people call them.

GauntletWizard · 2 years ago

I have adopted the term "Root Module" vs "Submodule" because those line up with terraform's own definitions, but I agree that they're terribly, terribly named.

thwway23432 · 2 years ago

The "stack" nomenclature used here is jarring since it is unrepresented in Terraform HCL literature.

A CDK stack, (assuming that's what is used here), would be loosely equivalent to a Terraform HCL module.

robertlagrant · 2 years ago

Makes sense, but how do you connect the two so e.g. credentials from one are surfaced in the other?

dharmab · 2 years ago

Use Data Sources to reference resources in a different state: https://developer.hashicorp.com/terraform/language/data-sour...

no_circuit · 2 years ago

IMO you shouldn't be storing credentials in shared state, as suggested by the other comments, since that means that the principals able to read the state to deploy their service can also read the credentials for other services bundled in that state file. This could be the case if one had broken down the root modules into scopes/services like the linked page suggests.

It is reasonable to assume if you are using Terraform to manage your infra, than your infra likely has access to a secrets manager from your infra vendor, e.g., AWS. Instead I'd recommend using a Terraform data resource to pull a credential from the secret manager by name -- and the name doesn't even necessarily have to be communicated through Terraform state. Then the credentials could directly be fed into where it is needed, e.g., a resource like a Kubernetes Secret. One can even skip this whole thing if the service can use the secret manager api itself. Finally access to the credentials itself would be locked down with IAM/RBAC.

paulddraper · 2 years ago

terraform_remote_state

The root module can have outputs just like any other module. These outputs can be accessed from other stacks from the backend.

And if you use CDKTF the references are handled transparently.

Deleted Comment

harha_ · 2 years ago

I've never used terraform, but I have used CloudFormation and AWS CDK. It's been a while though, is there a clear indication on the major cloud provider docs which resources are stateful? Or is it always obvious?

tyingq · 2 years ago

Difficult question, as people mean different things when they say state. One example might be a relatively simple AWS Lambda. Most people would say that's easily stateless.

But, what if that Lambda depends on a VPC with some specific networking config to allow it to connect to some partner company private network? And, it's difficult to recreate that VPC without service disruption for a variety of reasons that are out of your control. Well, now you have state because you need to track which existing VPC the Lambda needs if you tear the Lambda down and recreate it.

hansoolo · 2 years ago

Just stopped here because I had said XYZZY way too often in the last three hours xD

Genuine question for DevOps people:

Other than the fact it seems to be an industry standard so it's good for your job prospects, what are the benefits to Terraform over CloudFormation/CDK or whatever the equivalent is for your particular cloud provider?

Most companies/people pick a provider and then stick with it and it doesn't seem like there's much portability between configurations if you do decide to switch providers later down the line so I'm not sure what the benefits are. I haven't delved into Terraform yet but I tried doing a project in Pulumi once and felt by the end of it that I might as well have just wrote it in AWS CDK directly.

abrookewood · 2 years ago

After you have waited 20 minutes for CloudFormation to fail and tell you that it can't delete a resource (but won't tell you why), and this is the third time it has happened in a week, you start looking at alternatives.

androidbishop · 2 years ago

This 1000%. Also recently discovered Google Deployment Manager is shit for the exact same reasons. I honestly don't get it.

solatic · 2 years ago

> your particular cloud provider

Just this week, I wrote a Terraform module that uses the GCP, Kubernetes, and Cloudflare providers to allow us to bring up a single business-need (that will be needed, hopefully, many times in the future) that spans those three layers of the stack. 200 lines of Terraform written in an afternoon replaced a janky 2,000 line over-engineered Python microservice (including much retry and failure-handling logic that Terraform gives you for free) whose original author (not DevOps) moved on to better pastures.

CDK is fine if you're all-in on AWS. It has its tradeoffs compared to Terraform. Pick the right tool for the job.

Centigonal · 2 years ago

I worked closely with the folks that wrote our platform's IaC, first in CDK, then in Terraform. I wrote a bit of CDK and zero TF myself, but here are some of the reasons we switched:

A big plus is that Terraform works outside of AWS land.

CDK is a nightmare to work with. You're writing with programming-language syntax, which tempts you to write dynamic stuff - but everything still compiles down to declarative CFN, which just makes the ergonomics feel limited. The L2 and L3 constructs have a lot of implicit defaults that came back to bite us later.

With CDK you get synth and deploy, which felt like a black box. Minor changes would do the same 8 minute long deploy process as large infrastructure refactors. Switching to TF significantly sped up our builds for minor commits. There might be a better way to do this with CDK (maybe deploying separate apps for each part of our infrastructure) and we may have just missed it.

androidbishop · 2 years ago

Terraform, and by extension HCL, is more powerful and flexible. It can be used across clouds. It has providers for all kinds of things, like kubernetes. It can be abstracted and modularized. It supports cool features like workspaces and junk, depending on how you want to use it.

Also recently I was forced to use Google Cloud Deployment Manager scripts for some legacy project we were migrating to Terraform, and I was shocked at how buggy and useless it was. Failed to create resources for no discernible reason, couldn't update existing resources with dependencies, couldn't delete resources, was just unfathomably shit all around. Finished the Terraform migration earlier this morning and everything went off without a hitch, plus we got more coverage for stuff Deployment Manager doesn't support. It's also organized much nicer now, with versioned modules and what-have-you.

Cloudformation is ugly and again, surprisingly isn't well supported by AWS. I don't understand how it's possible, but terraform providers seem to be more up to date with products and APIs. Maybe that's just me but I've seen others complain about the same thing.

wodenokoto · 2 years ago

Isn’t google cloud deployment just bash calls to the google cloud cli disguised as declarations by way of yaml?

yellowapple · 2 years ago

- In the event that you are working with different cloud providers, Terraform is one thing to learn that then applies to all of them, as opposed to learning each provider's bespoke infra-as-code offering. Most companies stick to one PaaS/IaaS, but individual personnel ain't necessarily as limited over the courses of their careers.

- Not all cloud providers have an infra-as-code offering of their own in the first place (especially true with traditional server hosts), whereas pretty much every provider with some sort of API most likely has a Terraform provider implemented for it.

- Terraform providers include more than just PaaS/IaaS providers / server hosts; for example, my current job includes provisioning Datadog metrics and PagerDuty alerts alongside applications' AWS infra in the same per-app Terraform codebase, and a previous job entailed configuring Keycloak instances/realms/roles/etc. via Terraform.

androidbishop · 2 years ago

Also pretty neat that there's a Terraform provider for Kubernetes native resources.

RulerOf · 2 years ago

I've got a lot of opinions here, but the only one I'll share is that HCL knocks the socks off of json and yaml. Json is too rigid. YAML is too nested. HCL gets this just right.

Venturing away from opinions, the provider ecosystem with terraform enables some wonderful design options. For example, I have a module template that takes some basic container configs (e.g. ports, healthchecks) and a GitHub URL, then stands the service up on ECS and configures CI in the linked repo. CF can't do that.

nuker · 2 years ago

Im 10 years working with AWS. I strongly prefer Cloudformation, just separate things smartly between stacks. It has export/import for stack outputs too. Just look at the “root module” mess in this discussion and you’ll get why.

raffraffraff · 2 years ago

For me personally, I chose terraform because it can work with AWS and a heap of other 3rd party services and software (Cloudflare, PostgreSQL, Keycloak, Kubernetes/Helm, Github, Azure)

x3n0ph3n3 · 2 years ago

I have used both terraform and cloudformation substantially and they each have pros and cons. One thing terraform has over cloudformation is its rapid support for new services and features. AWS has done an awful job ensuring that cloudformation support is part of each team's definition of "done" for each release. It just doesn't get the support it really needs from AWS.

koolba · 2 years ago

CloudFormation is the ugly step child of AWS. It has bugs that have languished for years

dgrin91 · 2 years ago

Companies choose providers and tend to stick with them, but people don't always stick with companies. If I know TF there is a decent chance my skills will be applicable when I change companies.

Also some big corps run their own internal datacenters and have cloud-like interfaces with them. You can write TF providers for that (its not going to be as nice as the public cloud ones, but still nice). Then you can utilize Terraforms multi-provider functionality to have 1 project manage deployments on multiple clouds that include on-prem.

Also terraforms multi-provider functionality is also useful for non aws/azure/gcp such as Cloudflare. As far as I know CDK does not support that.

Illotus · 2 years ago

> Other than the fact it seems to be an industry standard so it's good for your job prospects, what are the benefits to Terraform over CloudFormation/CDK or whatever the equivalent is for your particular cloud provider?

For me the killer feature is that both plan and apply show the actual diff of changes vs running infrastructure. It makes understanding effects of changes much easier.

Bellyache5 · 2 years ago

Agreed, Terraform does a good job of this. But CloudFormation & CDK can also do this via Change Sets and CDK diff.

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGui...

https://blog.mikaeels.com/what-does-the-aws-cdk-diff-command...

thedougd · 2 years ago

Providers I regularly use, even mixed in a single project. There are others I could use if they were available.

AWS GitHub Opsgenie Okta Scalr TLS DNS

koolba · 2 years ago

You forgot the greatest escape hatch of all: null

jahsome · 2 years ago

Third-party integrations and the universality/reusability across multiple products and familiarity of HCL are big for me.

333throwaway342 · 2 years ago

> Most companies/people pick a provider and then stick with it and it doesn't seem like there's much portability between configurations if you do decide to switch providers later down the line so I'm not sure what the benefits are.

This smells like kubernetes

> Terraform over CloudFormation/CDK

They both work. It's more about which providers you need.

maccard · 2 years ago

We don't just have AWS resources. Our CI pipelines are managed by terraform [0], they communicate with GitHub [1]. I like that it's declarative and limited, it stops people trying to do "clever shit" with our infra, which is complicated enough as it is.

[0] https://buildkite.com/blog/manage-your-ci-cd-resources-as-co...

[1] https://registry.terraform.io/providers/integrations/github/...

cwp · 2 years ago

It's subtle and so difficult to see the differences at smaller scales. If you're going to provision a handful of EC2 instances, all the tools work fine.

I think HCL is an under appreciated aspect of Terraform. It was kinda awful for a while, but it's gotten a lot better and much easier to work with. It hits a sweet spot between data languages like JSON and YAML and fully-general programming languages like Python.

Take CloudFormation. The "native" language is JSON, and they've added YAML support for better ergonomics. But JSON is just not expressive enough. You end up with "pseudoparameters" and "function calls" layered on top. Attribute names doubling as type declarations, deeply nested layers of structure and incredible amounts of repetitious complexity just to be able to express all the details need to handle even moderate amounts of infrastructure.

So, ok, AWS recognizes this and they provide CDK so you can wring out all the repetion using a real programming language - pick your favourite one, a bunch are supported. That helps some, but now you've got the worst of both worlds. It's not "just JSON" anymore. You need a full programming environment. The CDK, let's say the Python version, has to run on the right interpreter. It has a lot of package dependencies, and you'll probably want to run it in a virtualenv, or maybe a container. And it's got the full power of Python, so you might have sources of non-determinism that give you subtle errors and bugs. Maybe it's daylight saving gotchas or hidden dependencies on data that it pulls in from the net. This can sound paranoid, but these things do start to bite if you have enough scale and enough time.

And then, all that Python code is just a front end to the JSON, so you get some insulation from it, but sometimes you're going to have to reason about the JSON it's producing.

HCL, despite its warts, avoids the problems with these extremes. It's enough of a programming language that you can just use named, typed variables to deal with configuration, instead of all the { "Fn::GetAtt" : ["ObjectName", "AttName"] } nonsense that CloudFormation will put you through. And the ability to create modules that can call each other is sooo important for wringing out all the repetition that these configurations seem to generate.

On the other hand, it's not fully general, so you don't have to deal with things like loops, recursion, and so on. This lack of power in the language enables more power in the tools. Things like the plan/apply distinction, automatically tracking dependencies between resources, targeting specific resources, move blocks etc. would be difficult or impossible with a language as powerful as Python.

HCL isn't the only language in this space - see CUE and Dhall, for example - but it's undoubtedly the most widely used. And it makes a real difference in practice.