Yoke: Infrastructure as code, but actually

Hill I will die on: Terraform being less expressive than a real language is a feature, not a drawback.

CDK/Pulumi/Yoke is optimised for being easy to write, but code should be optimised to be easy to READ.

Sure, cdk/pulumi/yoke lets you write the most clever and succinct construction you can compose in your favourite language.. however, whoever comes across your clever code next will probably want to hit you, especially if it's not a dev from your immediate team, and especially if you have succumbed to blurring the lines between your idk code and your app code.

If they instead come across some bog-standard terraform that maybe has a bunch of copy-paste and is a bit more verbose... Who cares? Its function will be obvious, there is no mental overhead needed.

On the flipside Helm templating is an absolute abomination and i would probably take anything over needing to immerse myself in that filth, maybe Yoke is worth a look after all. But the REAL answer is a real config language, still.

Aeolun · 6 months ago

> code should be optimised to be easy to READ

You say that as if it’s impossible to write clear code. As soon as you have any form of multiple resources (e.g. create x of y) I’ll take the real programming language over terraform.

dijksterhuis · 6 months ago

> As soon as you have any form of multiple resources

terraform handles this with for_each. need 10 EBS volumes on 10 EC2 instances? for_each and link instance id of the each value. done. theres a bunch of stuff i now don’t have to worry about (does the instance actually exist yet? other validation edge cases?)

https://developer.hashicorp.com/terraform/language/meta-argu...

> You say that as if it’s impossible to write clear code.

not the parent, but i feel their usage of the word “code” was in error. i don’t care about how, i care about what.

the HCL is purely a defintion/description of what the infrastructure looks like. what resources will be created? that is all it is. i want that. to define the infra and move on. i don’t want low level control of every minutia to do with infrastructure. i want to read a config file and just know what resources will exist in the account. wanna know every resource that exists? `terraform state list` … job done. no reading code required.

HCL/terraform is to define my cloud resources, not to control them or their creation. if i want control, then i need to whip out some go/python.

that’s my vibe on CDK libraries/platform APIs versus terraform.

Spivak · 6 months ago

You can understand every single terraform codebase using nothing other than the terraform documentation itself. All abstractions are provided by the language itself.

Clear isn't really the word I would call it, more that the real work being done is exposed and always visible.

paulddraper · 6 months ago

Fortunately, Terraform has CDKTF [1] which allows you to use common languages such as Python, Java, and TypeScript to author Terraform infra.

I used it daily and find it greatly liberating.

[1] https://developer.hashicorp.com/terraform/cdktf

patrick451 · 6 months ago

>whoever comes across your clever code next will probably want to hit you, especially if it's not a dev from your immediate team, and especially if you have succumbed to blurring the lines between your idk code and your app code.

If you want to maximize the number of people who have a chance of understanding what is happening, python is your huckleberry. They are going to want to hit the guy who wrote everything in a bizarre language called HCL that nobody outside of infra has ever seen or heard of.

> If they instead come across some bog-standard terraform that maybe has a bunch of copy-paste and is a bit more verbose... Who cares? Its function will be obvious, there is no mental overhead needed.

"bog standard" is doing a lot of heavy lifting here. You can write simple python or esoteric python and you can write simple terraform or esoteric terraform.

liampulles · 6 months ago

As the Go proverb goes: "clear is better than clever". https://go-proverbs.github.io/

╷ │ Error: googleapi: Error 400: The request has errors, badRequest │ │ with google_cloudfunctions_function.function, │ on main.tf line 46, in resource "google_cloudfunctions_function" "function": │ 46: resource "google_cloudfunctions_function" "function" ¨

These sorts of posts are fascinating "nerd snipes" to cryptids like me. On the surface, they look incredibly interesting and I want to learn more! Terraform isn't code? Please explain to me why not, you have my attention.

Then I get to the real meat of the issue, which is often along the lines of, "I'm a software developer who has to handle my own infrastructure and I hate it, because infrastructure doesn't behave like software." Which, fair! That is a fair critique! Infrastructure does not behave like software, and that's intentional!

It's almost certainly because I come from the Enterprise Tech world rather than Software Dev world, where the default state of infrastructure is permanent and mutable, forever. Modern devs, who (rightly!) like immutable containers and block storage and build tools to support these deployments by default, just don't get why the Enterprise tech stack is so much more different, and weird, and...crufty compared to their nifty and efficient CI/CD pipeline, just like I cannot fully appreciate the point of such a pipeline when I'm basically deploying bespoke machines for internal teams on the regular because politics dictates customer service over enterprise efficiency. It's the difference between building an assembly line for Corollas and Camrys (DevOps), and building a Rolls-Royce Phantom to spec for a VIP client (BizTech). That's not to say there hasn't been immense pressure to transform the latter into more like the former, and I've been part of some of those buildouts and transitions in my career (with some admittedly excellent benefits - Showback! Tenancy! Lifecycles!), but these gripes about Terraform are admittedly lost on me, because I'll never really encounter them.

And if I did, I don't need to pickup programming to fix it necessarily. I just need to improve my existing system integrations so Ansible runbooks can handle the necessary automation for me.

JohnMakin · 6 months ago

Thanks for posting this, I favorited it - having carved out a weird niche in my career as an "infra" guy, inevitably I deal with a lot of IAC. I run into this attitude a lot by devs - they are indeed annoyed by managing infrastructure, because it innately is not like software! I know I'm reiterating what you said but it is so important to understand this.

Here is a thing I run into a lot:

"Our infra is brittle and becoming a chore to manage, and is becoming a huge risk. We need IAC!" (At this point, I don't think it's a bad idea to reach for this)

But then -

"We need to manage all our IAC practices like dev ones, because this is code, so we will use software engineering practices!"

Now I don't entirely disagree with the above statement, but I have caveats. I try to treat my IAC like "software" as much as I can, but as you pointed out, this can break down. Example: managing large terraform repositories that touch tons of things across an organization can become a real pain with managing state + automation + normal CI/CD practices. I can push a terraform PR, get approved, but I won't actually know whether what I did was valid until you try to push it live. As opposed to software, where you can be reasonably confident that the code is going to mostly work how you intend before you deploy it. Often in infra, the only way to know is to try/apply it. Rollback procedures are entirely different, etc.

It also breaks down as others have noted trying to use terraform to manage dynamic resources that aren't supposed to be immutable (like Kubernetes). I still do it, but it's loaded with foot guns I wouldn't recommend to someone that hasn't spent years doing this kind of thing.

mdaniel · 6 months ago

> I can push a terraform PR, get approved, but I won't actually know whether what I did was valid until you try to push it live

Our concession to this risk was that once a merge request was approved, the automation was free to to run the apply pipeline step, leaving open the very likely possibility that TF shit itself. However, since it wasn't actually merged yet, push fixes until TF stopped shitting itself

I'm cognizant that solution doesn't "scale," in that if you have a high throughput repo those merge requests will almost certainly clash, but it worked for us because it meant less merge request overhead (context switching). It also, obviously, leveraged the "new pushes revoke merge request approval" which I feel is good hygiene but some places are "once approved, always approved"

anonfordays · 6 months ago

>It's almost certainly because I come from the Enterprise Tech world rather than Software Dev world, where the default state of infrastructure is permanent and mutable, forever. Modern devs, who (rightly!) like immutable containers and block storage and build tools to support these deployments by default, just don't get why the Enterprise tech stack is so much more different

This is generally true, but the interesting thing about Terraform is it was created specifically to work in the world of "immutable by default." This is why Terraform automatically creates and destroys instead of mutating in many (most?) cases, shys away from using provisioners to mutate resources after creation, etc.

stego-tech · 6 months ago

Yep, and that's why I only very recently picked it up in Enterprise world, where the AWS team used it to deploy resources. What used to take them ~45min by hand using prebuilt AMIs, now takes ~500 lines of Terraform "code" and several hours of troubleshooting every time Terraform (or whatever fork they're now using post-Hashicorp) updates/changes, because Enterprise architecture is mutable by default and cannot simply be torn down and replaced.

ctrlp · 6 months ago

what sort of cryptid are you?

stego-tech · 6 months ago

As the username implies, the "dinosaur on the internet" kind. The classic trope of the IT person who live(d) in their windowless cave, surrounded by a cacophony of whirling fans and grinding hard drives, retired kit repurposed into a lab since the budget never allowed for a proper one. Graphic tees and blue jeans, an enigmatic mystery to the masses who complain stuff is broken but also that they don't know why I'm here since everything always works.

So just your average IT person, really. What we lack in social graces, we make up for with good humor, excellent media recommendations, and a loved passion for what we create because we like seeing our users smile at their own lives being made easier. I guess the "cryptid" part comes in because I'm actively trying to improve said sociability and round out my flaws, unlike the stereotypical portrayals of the BOFH or IT Crowd.

let front_id = if instance_exists("front_balancer") { return fetch_instance("front_balancer").id } else { return create_new_instance("front_balancer", front_balancer_opts).id }

deployment: bitwarden: { spec: { template: { spec: { containers: [{ image: "vaultwarden/server:1.32.7" env: [{ name: "ROCKET_PORT" value: "8080" }, { name: "ADMIN_TOKEN" valueFrom: secretKeyRef: { name: "bitwarden-secrets" key: "ADMIN_TOKEN" } }] volumeMounts: [{ name: "data" mountPath: "/data" subPath: "bitwarden" }] ports: [{ containerPort: 8080 name: "web" }] }] volumes: [{ name: "data" persistentVolumeClaim: claimName: "local-pvc" }] } } } }

solatic · 6 months ago

> If you really do think that Terraform is code, then go try and make multiple DNS records for each random instance ID based on a dynamic number of instances. Correct me if I'm wrong, but I don't think you can do that in Terraform.

It depends on where the source of dynamism is coming from, but yes you can do this in Terraform. You get the instances with data.aws_instances, feed it into aws_route53_record with a for_each, and you're done. Maybe you need to play around with putting them into different modules because of issues with dynamic state identifiers, but it's not remotely the most complicated Terraform I've come across.

That's a separate question from whether or not it's a good idea. Terraform is a one-shot CLI tool, not a daemon, and it doesn't provide auto-reconciliation on its own (albeit there are daemons like Terraform Enterprise / TerraKube that will run Terraform on a schedule for you and thus provide auto-reconciliation). Stuff like DNS records for Kubernetes ingress is much better handled by external-dns, which itself is statically present in a Kubernetes cluster and therefore might be more properly installed with Terraform.

ljm · 6 months ago

K8S is at a point now where I'd probably try to configure whatever I can inside the cluster as an operator or controller.

There are going to be situations where that isn't practical, but the ability to describe all the pieces of your infra as a CRD is quite nice and it takes some pain out of having things split between terraform/pulumi/cdk and yaml.

At that point, you're just running your own little cloud instead of piggybacking on someone else's. Just need a dry-run pipeline so you can review changes before applying them to the cluster.

Sure, but the Kubernetes cluster itself, plus its foundational extra controllers (e.g. FluxCD) are basically static and therefore should be configured in Terraform.

url00 · 6 months ago

Can you expand a bit on the kinds of things you are doing in operators and controllers? I've been wary to put to much in the cluster... but maybe I should be doing more.

klooney · 6 months ago

https://registry.terraform.io/providers/hashicorp/random/lat... is also very useful for this sort of thing, in case you want a persistent random value per resource- shuffle, id, pet, and password are all super handy.

akdor1154 · 6 months ago

danw1979 · 6 months ago

I think a majority of the rants about Terraform I read are written from the perspective of someone managing inherently ephemeral infrastructure - things that are easily disposed of and reprovisioned quickly. The author of such a critique is likely managing an application stack on top of an account that someone else has provided them, a platform team maybe. CDK probably works for you in this case.

Now, if you belong to that platform team and have to manage the state of tens of thousands of "pet" resources that you can't just nuke and recreate using the CDK (because some other team depends on their avaiability) then Terraform is the best thing since sliced bread; it manages state, drift, and the declarative nature of the DSL is desirable.

Horses for courses.

bayindirh · 6 months ago

> Horses for courses.

I think with YMMV, these are the two most important things we need to keep in our mind. With plethora of technologies and similar tools, we generally read the tin superficially but not the manual, and we declare "This is bollocks!".

Every tool is targeted towards a specific use and thrive in specific scenarios. Calling a tool bad for something not designed for is akin to getting angry to your mug because it doesn't work as well when upside down [0].

[0]: https://i.redd.it/mcfym6oqx5p11.jpg

robertlagrant · 6 months ago

For me Terraform's biggest strength is also its biggest source of pain: it can integrate all sorts of technologies under one relatively vendor-agnostic umbrella and enforce a standard workflow across a huge amount of change. However, that means any bug in any provider is sort of Terraform's fault, if only in the developer's mind.

gregmac · 6 months ago

Having debugged this sort of thing before, it's actually really hard to figure that out.

The entire stack is kind of bad at both logging and having understandable error messages.

You get things like this:

Is this a problem with the actual terraform or passing a variable in or something? Is it a problem with the googleapi provider? Is it a problem with the API? Or did I, as the writer of this, simply forget a field?

In complex setups, this will be deep inside a module inside a module, and as the developer who did not use any google_cloudfunctions_function directly, you're left wondering what the heck is going on.

voidfunc · 6 months ago

I ditched Terraform years ago and just interact with the raw cloud provider SDKs now. It's much easier to long-term evolve actual code and deal with weird edgecases that come up when you're not in beholden to the straight jacket that is configuration masquerading as code.

Oh yea, and we can write tests for all that provisioning logic too.

plmpsu · 6 months ago

How are you handling creating multiple resources in parallel? or rolling back changes after an unsuccessful run?

gorgoiler · 6 months ago

Not OP, but for rolling back we just… revert the change to the setup_k8s_stuff.py script !

In practice it’s a module that integrates with quite a large number of things in the monolith because that’s one of the advantages of Infrastructure as Actual Code: symbols and enums and functions that have meaningful semantics in your business logic are frequently useful in your infrastructure logic too. The Apples API runs on the Apples tier, the Oranges API runs on the Oranges tier, etc. etc.

People call me old fashioned (“it’s not the 1990s any more”) but when I deploy something it’s a brand new set of instances to which traffic gets migrated. We don’t modify in place with anything clever and I imagine reverting changes in a mutable environment is indeed quite hard to get right (and what you are hinting at?)

inopinatus · 6 months ago

A very small shell script.

kikimora · 6 months ago

I’ve been thinking about this for a long time. But doesn’t it brings a host of other issues? For example, I need to update instance RAM from 4 to 8 Gb but how do I know if the instance exists or should be created? I need to make a small change, how do I know what parts of my scripts to run?

Here are the things that TF does that you are probably not going to get around to in a comprehensive way-

- State tracking, especially all of the tedious per cloud resource details

- Parallelism- TF defaults to 10 threads at a time. You won't notice this when you write a demo to deploy one thing, but it really matters as you accrete more things.

- Dependency tracking- hand in hand with the parallelism, but this is what makes it possible. It is tedious, resource by resource blood sweat and tears stuff, and enabled by the inexpressive nature of HCL

Plus, you know, all of the work that has already done by other people to wrap a million quirky APIs in a uniform way.

You write code to do these things? If there's a requirement for you to be able to do such a thing make it a feature, implement it with tests and voila, no different than any other feature or bug you work on is it?

diggan · 6 months ago

> For example, I need to update instance RAM from 4 to 8 Gb but how do I know if the instance exists or should be created?

Or however else you would manage that sort of thing in your favorite programming language.

> I need to make a small change, how do I know what parts of my scripts to run?

Either just re-run the parts you know you've changed (manually or based on git diffs), or even better, make the entire thing idempotent and you won't have to care, re-run the entire program after each change and it'll automagically work.

evantbyrne · 6 months ago

I went through the same evolution, even built a PaaS for AWS, but I kept going and now just deploy my own stuff to VMs with Swarm via one command in Rove. It's great. And yes I know kubernetes I use it at work. It's an unnecessary waste of time.

> Swarm

docker swarm is so simple and easy compared to the utter behemoth that is k8s, and basically is all you need for CRUD webapps 80-90% of the time. add an RDS instance and you’re set.

i will always pick swarm in a small company* whenever possible until k8s or ECS makes sense because something has changed and it’s needed.

dont start with complexity.

* - bigger companies have different needs.

Terraform added tests somewhat recently: https://developer.hashicorp.com/terraform/language/tests

imp0cat · 6 months ago

And eventually, you end up with your own in-house Terraform.

beacon294 · 6 months ago

I agree that the SDK is better for many use cases. I do like terraform for static resources like aws vpc, networking, s3 buckets, etc.

abound · 6 months ago

I think I've commented this elsewhere, but using Cue [1] is also great for this purpose, with no extra infrastructure. E.g. you define a Cue Template [2], which seems analogous to Yoke/ATC's CRDs, and then your definitions just include the data.

Here's an example of Vaultwarden running on my K8s cluster:

And simpler services are, well, even simpler:

    deployment: myapp: spec: template: spec: containers: [{
      ports: [{
       containerPort: 8080
       name:          "web"
      }]
     }]

And with Cue, you get strongly typed values for everything, and can add tighter constraints as well. This expands to the relevant YAML resources (Services, Deployments, etc), which then get applied to the cluster. The nice thing of this approach is that the cluster doesn't need to know anything about how you manage your resources.

[1] https://cuelang.org/

[2] https://cuelang.org/docs/tour/types/templates/

Cyphus · 6 months ago

I really want to dive in with Cue, but one thing that I got burned on when using jsonnet to generate CloudFormation templates years ago was lack of discoverability for newcomers to the repo.

Taking your sample code as an example, someone might look at the myapp deployment definition and ask: “does this deployment get created in the default namespace or does it automatically create a myapp namespace? What’s the default number of replicas? Are there any labels or annotations that get automatically added?” Etc.

On the flip side, there’s potential lack of “greppability.” The user may have found a problem with a deployed resource in, say, the development cluster, and go to grep for some resource-specific string in the repo, only to come up empty because that string is not in the source but rather generated at by the templating system.

To be clear, both of these problems can affect any method of generating config, be it yoke, helm, ksonnet, kustomize, or cue. It’s like a curse of abstraction. The more you make things into nice reusable components, the easier it is for you to build upon, and the harder it is for others to others to jump in and modify.

At least with Cue you get properly typed values and parameter validation built in, which puts it miles ahead of “everything is a string” templating systems like the helm templates the article complains about.

strangelove026 · 6 months ago

I was kind of interested in cue earlier last year as IIRC it can be served by helm and is much much better than templating yaml. Never really got started with it. Wish they had an LSP too.

https://github.com/cue-lang/cue/issues/142

What the hell is going on with their bot copy-pasting every comment on that issue? What a mess

Anyway, I wanted to ask what you meant by "served by helm?" I knew about https://github.com/stefanprodan/timoni and https://github.com/holos-run/holos but I believe they are merely "inspired by helm" and not "cue for helm"

nosefrog · 6 months ago

Reminds me of gcl (yikes).

bbu · 6 months ago

Looks promising but it starts with a (justified) rant about terraform and then goes into how to replace Helm.

I am confused. Can yoke be used to create and manage infrastructure or just k8s resources?

thayne · 6 months ago

Indeed. This isn't really a replacement for terraform, unless you are only using terraform to manage k8s resources. Which probably isn't most people who are currently using Terraform.

xena · 6 months ago

Author here. It's mainly for k8s resources; but if you install operators like external-dns or something like crossplane into your cluster, you can manage infra too.

groestl · 6 months ago

> into your cluster

I guess the point is: what if you don't have a cluster.

sureglymop · 6 months ago

What alternative to terraform would one use to set up the whole cluster before provisioning any resources?

I currently have a custom script that is a mix between terraform and ansible that sets up a proxmox cluster, then a k3s cluster and a few haproxys with keepalived on top. Granted, maybe not the most standard setup.

e12e · 6 months ago

I've considered dropping terraform (openTofu) for our k8s resources since k8s is stateful anyway.

But that would complicate synchronization with resources outside of k8s, like tailscale, DNS, managed databases, cloud storage (S3 compatible) - and even mapping k8s ingress to load_balancer and external DNS.

So far I feel that everything in terraform is the most simple and reasonable solution - mostly because everything can be handled by a single tool and language.

ok, that makes sense. A better Helm would be nice. timoni.sh is getting better and better, but Cue is a big hurdle.

Unfortunately, I'm not a big fan of the yaml-hell that crossplane is either.

But as a Terraform replacement systeminit.com is still the strongest looking contender.

It’s just a dunk on terraform to promote yet another K8s provisioning thing.

WatchDog · 6 months ago

I'm quite happy with CDK[0].

My experience is only with the main AWS cloudformation based version of CDK, although there is also CDK for terraform, which supports any resource that terraform supports, although some of what I'm about to say is not applicable to that version.

What I like about CDK, is that you can write real code, and it supports a wide range of languages, although typescript is the best experience.

Provided that you don't use any of the `fromLookup` type functions, you can run and test the code without needing any actual credentials to your cloud provider.

CDK essentially complies your code into a cloudformation template, you can run the build without credentials, then deploy the built cloudformation template separately.

You don't need to worry about your terraform server crashing half way though a deployment, because cloudformation runs the actual deployment.

[0]: https://github.com/aws/aws-cdk

chuckadams · 6 months ago

My main problem with CDK is that it only outputs a CloudFormation stack. I can sign up for a new cloud account, spin up a k8s cluster, deploy everything to it, and restore the database snapshot faster than CF will finish a job that's stuck on UPDATE_CLEANUP_IN_PROGRESS.

Of course there's also cdk8s, but I'll probably go with Pulumi instead if I need that. Right now I'm happy with helmfile, though not so much with helm itself. So I'll definitely be giving Yoke a look.

cedws · 6 months ago

In your experience how often have you had template builds succeed but then fail at apply time? This kind of issue is what I find most frustrating about IaC today, your 'code' 'compiling' means nothing because all of the validations are serverside, and sometimes you won't find out something's wrong until Terraform is already half done applying. I want to be able to declare my infrastructure, be able to fully validate it offline, and have it work first try when I apply it.

I find Pulumi very nice here because it persists state after every successful resource creation. If it breaks somewhere in the middle, the next run will just pick up where it left off last time.

CDK… well, CDK doesn’t get in an invalid state often either, but that’s because it spends 30m rolling back every time something goes wrong.

I've had less such issues with CDK, versus raw cloudformation, or terraform, but it can still happen.