How we migrated onto K8s in less than 12 months

A migration with the goal of improving the infrastructure foundation is great. However, I was surprised to see that one of the motivations was to allow teams to use Helm charts rather than converting to Terraform. I haven’t seen in practice the consistent ability to actually use random Helm charts unmodified, so by encouraging its use you end up with teams forking and modifying the charts. And Helm is such a horrendous tool, you don’t really want to be maintaining your own bespoke Helm charts. IMO you’re actually better off rewriting in Terraform so at least your local version is maintainable.

Happy to hear counterexamples, though — maybe the “indent 4” insanity and multi-level string templating in Helm is gone nowadays?

cwiggs · a year ago

Helm Charts and Terraform are different things IMO. Terraform is better used to deploying cloud resources (s3 bucket, EKS cluster, EKS workers, RDS, etc). Sure you can manage your k8s workloads with Terraform, but I wouldn't recommend it. Terraform having state when you already have your start in k8s makes working with Terraform + k8s a pain. Helm is purpose built for k8s, Terraform is not.

I'm not a fan of Helm either though, templat-ed yaml sucks, you still have the "indent 4" insanity too. Kustomize is nice when things are simple, but once your app is complex Kustomize is worse than Helm IMO. Try to deploy an app that has a ING, with a TLS cert and external-DNS with Kustomize for multiple environments; you have to patch the resources 3 times instead of just have 1 variable you and use in 3 places.

Helm is popular, Terraform is popular so they both are talked a lot, but IMO there is a tool that is yet to become popular that will replace both of these tools.

wrs · a year ago

I agree, I wouldn’t generate k8s from Terraform either, that’s just the alternative I thought the OP was presenting. But I’d still rather convert charts from Helm to pretty much anything else than maintain them.

stackskipton · a year ago

Lack of Variable substitution in Kustomize is downright frustrating. We use Flux so we have the feature anyways, but I wish it was built into Kustomize.

3np · a year ago

You can deploy your Helm charts through Terraform, even. It's been several years sinc so the situation might have improved but last I worked this way it was OK except of state drift due togaps in the Helm TF provider. Still found it better then either by itself.

tionate · a year ago

Re your kustomizen complaint, just create a complete env-specific ingress for each env instead of patching.

- it is not really any more lines - doesn’t break if dev upgrades to a different version of the resource (has happened before) - allows you to experiment with dev with other setups (eg additional ingresses, different paths etc) instead of changing a base config which will impact other envs

TLDR patch things that are more or less the same in each env; create complete resources for things that change more.

There is a bit of duplication but it is a lot more simple (see ‘simple made easy’ - rich hockey) than tracing through patches/templates.

solatic · a year ago

My current employer (BigCo) has Terraform managing both infra + deployments in Terraform, at (ludicrous) scale. It's a nightmare. The problem with Terraform is that you must plan your workspaces such that you will not exceed the best-practice amount of resources per workspace (~100-200) or else plans will drastically slow down your time-to-deploy, checking stuff like databases and networking that you haven't touched and have no desire to touch. In practice this means creating a latticework of Terraform workspaces that trigger each other, and there are currently no good open-source tools that support it.

Best practice as I can currently see it is to have Terraform set up what you need for continuous delivery (e.g. ArgoCD) as part of the infrastructure, then use the CD tool to handle day-to-day deployments. Most CD tooling then asks you to package your deployment in something like Helm.

chucky_z · a year ago

You can setup dependent stacks in CDKTF. It’s far from as clean as a standard TF DAG plan/apply but I’m having a lot of success with it right now. If I were actively using k8s at the moment I would probably setup dependent cluster resources using this method, e.g: ensure a clean, finished CSI daemon deployment before deploying a deployment using that CSI provider :)

gouggoug · a year ago

Talking about helm - I personally have come to profoundly loathe it. It was amazing when it came out and filled a much needed gap.

However it is loaded with so many footguns that I spend my time redoing and debugging others engineers work.

I’m hoping this new tool called « timoni » picks up steam. It fixes pretty every qualm I have with helm.

So if like me you’re looking for a better solution, go check timoni.

roryokane · a year ago

Link: https://timoni.sh/

smellybigbelly · a year ago

Our team also suffered from the problems you described of public helm charts. There is always something you need to customise to make things work on your own environment. Our approach has been to use the public helm chart as-is and do any customisation with `kustomize —enable-helm`.

mnahkies · a year ago

Whilst I'll agree that writing helm charts isn't particularly delightful, consuming them can be.

In our case we have a single application/service base helm chart that provides sane defaults and all our deployments extend from. The amount of helm values config required by the consumers is minimal, and there has been very little occasion for a consumer to include their own templates - the base chart exposes enough knobs to avoid this.

When it comes to third-party charts, many we've been able to deploy as is (sometimes with some PRs upstream to add extra functionality), and occasionally we've needed to wrap/fork them. We've deployed far more third-party charts as-is than not though.

One thing probably worth mentioning w.r.t to maintaining our custom charts is the use of helm unittest (https://github.com/helm-unittest/helm-unittest) - it's been a big help to avoid regressions.

We do manage a few kubernetes resources through terraform, including Argocd (via the helm provider which is rather slow when you have a lot of CRDs), but generally we've found helm chart deployed through Argocd to be much more manageable and productive.

brainzap · a year ago

Helm charts are a bit painful but they have a few critical features. Atomic deploy that rolls back onfailure. Ability to generate the full kubernetes definition with helm template. Ability to print out all configuration values with description.

At our company we have all deployments wrapped into a flat helm chart with as little variables as possible. (I always have to fight for that because devs like to abstract helm 100 levels and end up with nothing)

BobbyJo · a year ago

Helm is quite often the default supported way of launching containerized third-party products. I have works at two separate startups whose 'on prem' product was offered this way.

freedomben · a year ago

Indeed. I try hard to minimize the amount of Helm we use, but a significant amount of tools are only shipped as Helm charts. Fortunately I'm increasingly seeing people provide "raw k8s" yaml, but it's far from universal.

JohnMakin · a year ago

It's completely cursed, but I've started deploying helm via terraform lately. Many people, ironically me included, find that managing deployments via terraform is an anti pattern.

I'm giving it a try and I don't despise it yet, but it feels gross - application configs are typically far more mutable and dynamic than cloud infrastructure configs, and IME, terraform does not likey super dynamic configs.

Maybe its normal for a company this size, but I have a hard time following much of the decision making around these gigantic migrations or technology efforts because the decisions don't seem to come from any user or company need. There was a similar post from Figma earlier, I think around databases, that left me feeling the same.

For instance: they want to go to k8s because they want to use etcd/helm, which they can't on ECS? Why do you want to use etcd/helm? Is it really this important? Is there really no other way to achieve the goals of the company than exactly like that?

When a decision is founded on a desire of the user, its easy to validate that downstream decisions make sense. When a decision is founded on a technological desire, downstream decisions may make sense in the context of the technical desire, but do they make sense in the context of the user, still?

Either I don't understand organizations of this scale, or it is fundamentally difficult for organizations of this scale to identify and reason about valuable work.

ianvonseggern · a year ago

Hey, author here, I think you ask a good question and I think you frame it well. I agree that, at least for some major decisions - including this one, "it is fundamentally difficult for organizations of this scale to identify and reason about valuable work."

At its core we are a platform teams building tools, often for other platform teams, that are building tools that support the developers at Figma creating the actual product experience. It is often harder to reason about what the right decisions are when you are further removed from the end user, although it also gives you great leverage. If we do our jobs right the multiplier effect of getting this platform right impacts the ability of every other engineer to do their job efficiently and effectively (many indirectly!).

You bring up good examples of why this is hard. It was certainly an alternative to say sorry we can't support etcd and helm and you will need to find other ways to work around this limitation. This was simply two more data points helping push us toward the conclusion that we were running our Compute platform on the wrong base building blocks.

While difficult to reason about, I do think its still very worth trying to do this reasoning well. It's how as a platform team we ensure we are tackling the right work to get to the best platform we can. Thats why we spent so much time making the decision to go ahead with this and part of why I thought it was an interesting topic to write about.

felixgallo · a year ago

I have a constructive recommendation for you and your engineering management for future cases such as this.

First, when some team says "we want to use helm and etcd for some reason and we haven't been able to figure out how to get that working on our existing platform," start by asking them what their actual goal is. It is obscenely unlikely that helm (of all things) is a fundamental requirement to their work. Installing temporal, for example, doesn't require helm and is actually simple, if it turns out that temporal is the best workflow orchestrator for the job and that none of the probably 590 other options will do.

Second, once you have figured out what the actual goal is, and have a buffet of options available, price them out. Doing some napkin math on how many people were involved and how much work had to go into it, it looks to me that what you have spent to completely rearchitect your stack and operations and retrain everyone -- completely discounting opportunity cost -- is likely not to break even in even my most generous estimate of increased productivity for about five years. More likely, the increased cost of the platform switch, the lack of likely actual velocity accrual, and the opportunity cost make this a net-net bad move except for the resumes of all of those involved.

Spivak · a year ago

> we can't support etcd and helm and you will need to find other ways to work around this limitation

So am I reading this right that either downstream platform teams or devs wanted to leverage existing helm templates to provision infrastructure and being on ECS locked you out of those and the water eventually boiled over. If so that's a pretty strong statement about the platform effect of k8s.

vouwfietsman · a year ago

Hi! Thanks for the thoughtful reply.

I understand what you're saying, the thing that worries me though is that the input you get from other technical teams is very hard to verify. Do you intend to measure the development velocity of the teams before and after the platform change takes effect?

In my experience it is extremely hard to measure the real development velocity (in terms of value-add, not arbitrary story points) of a single team, not to mention a group of teams over time, not to mention as a result of a change.

This is not necessarily criticism of Figma, as much as it is criticism of the entire industry maybe.

Do you have an approach for measuring these things?

WaxProlix · a year ago

People move to K8s (specifically from ECS) so that they can use cloud provider agnostic tooling and products. I suspect a lot of larger company K8s migrations are fueled by a desire to be multicloud or hybrid on-prem, mitigate cost, availability, and lock-in risk.

zug_zug · a year ago

I've heard all of these lip-service justifications before, but I've yet to see anybody actually publish data showing how they saved any money. Would love to be proven wrong by some hard data, but something tells me I won't be.

OptionOfT · a year ago

Flexibility was a big thing for us. Many different jurisdictions required us to be conscious of where exactly data was stored & processed.

K8s makes this really easy. Don't need to worry whether country X has a local Cloud data center of Vendor Y.

Plus it makes hiring so much easier as you only need to understand the abstraction layer.

We don't hire people for ARM64 or x86. We have abstraction layers. Multiple even.

We'd be fooling us not to use them.

teyc · a year ago

People move to K8s so that their resumes and job ads are cloud provider agnostic. Peoples careers stagnate when their employers platform on a home baked tech, or on specific offerings from cloud providers. Employers find Mmoving to a common platform makes recruiting easier.

fazkan · a year ago

This, most of it, I think is to support on-prem, and cloud-flexibility. Also from the customers point of view, they can now sell the entire figma "box" to controlled industries for a premium.

Deleted Comment

timbotron · a year ago

there's a pretty direct translation from ECS task definition to docker-compose file

Flokoso · a year ago

Managing 500 or more VMS is a lot of work.

Aline the VM upgrade, auth, backup, log rotation etc.

With k8s I can give everyone a namespace, policies, volumes, have automatic log aggregation due to demon sets and k8s/cloud native stacks.

Self healing and more.

It's hard to describe how much better it is.

lmm · a year ago

> For instance: they want to go to k8s because they want to use etcd/helm, which they can't on ECS? Why do you want to use etcd/helm? Is it really this important? Is there really no other way to achieve the goals of the company than exactly like that?

I'm no fan of Helm, but there are surprisingly few good alternatives to etcd (i.e. highly available but consistent datastores, suitable for e.g. the distributed equivalent of a .pid file) - Zookeeper is the only one that comes to mind, and it's a real pain on the ops side of things, requiring ancient JVM versions and being generally flaky even then.

friendly_deer · a year ago

Here's a theory about why at least some of these come about:

https://lethain.com/grand-migration/

wg0 · a year ago

If you haven't broken down your software into 50+ different separate applications written in 15 different languages using 5 different database technologies - you'll find very little use for k8s.

All you need is a way to roll out your artifact to production in a roll over or blue green fashion after the preparations such as required database alterations be it data or schema wise.

imiric · a year ago

> All you need is a way to roll out your artifact to production in a roll over or blue green fashion after the preparations such as required database alterations be it data or schema wise.

Easier said than done.

You can start by implementing this yourself and thinking how simple it is. But then you find that you also need to decide how to handle different environments, configuration and secret management, rollbacks, failover, load balancing, HA, scaling, and a million other details. And suddenly you find yourself maintaining a hodgepodge of bespoke infrastructure tooling instead of your core product.

K8s isn't for everyone. But it sure helps when someone else has thought about common infrastructure problems and solved them for you.

javaunsafe2019 · a year ago

But you do know which problems the k8s abstraction solves, right? Cause it has nothing to do with many languages nor many services but things like discovery, scaling, failover and automation …

mplewis · a year ago

Yeah, all you need is a rollout system that supports blue-green! Very easy to homeroll ;)

samcat116 · a year ago

> I have a hard time following much of the decision making around these gigantic migrations or technology efforts because the decisions don't seem to come from any user or company need

I mean the blog post is written by the team deciding the company needs. They explained exactly why they can't easily use etcd on ECS due to technical limitations. They also talked about many other technical limitations that were causing them issues and increasing cost. What else are you expecting?

Deleted Comment

julienmarie · a year ago

I personally love k8s. I run multiple small but complex custom e-commerce shops and handle all the tech on top of marketing, finance and customer service.

I was running on dedicated servers before. My stack is quite complicated and deploys were a nightmare. In the end the dread of deploying was slowing down the little company.

Learning and moving to k8s took me a month. I run around 25 different services ( front ends, product admins, logistics dashboards, delivery routes optimizers, orsm, ERP, recommendation engine, search, etc.... ).

It forced me to clean my act and structure things in a repeatable way. Having all your cluster config in one place allows you to exactly know the state of every service, which version is running.

It allowed me to do rolling deploys with no downtime.

Yes it's complex. As programmers we are used to complex. An Nginx config file is complex as well.

But the more you dive into it the more you understand the architecture if k8s and how it makes sense. It forces you to respect the twelve factors to the letter.

And yes, HA is more than nice, especially when your income is directly linked to the availability and stability of your stack.

And it's not that expensive. I lay around 400 usd a month in hosting.

maccard · a year ago

Figma were running on ECS before, so they weren't just running dedicated servers.

I'm a K8S believer, but it _is_ complicated. It solves hard problems. If you're multi-cloud, it's a no brainer. If you're doing complex infra that you want a 1:1 mapping of locally, it works great.

But if you're less than 100 developers and are deploying containers to just AWS, I think you'd be insane to use EKS over ECS + Fargate in 2024.

epgui · a year ago

I don’t know if it’s just me, but I really don’t see how kubernetes is more complex than ECS. Even for a one-man show.

mountainriver · a year ago

This just feels like a myth to me at this point. Kubernetes isn’t hard, the clouds have made is so simple now that it’s in no way more difficult than ECS and is way more flexible

belter · a year ago

> I run multiple small but complex custom e-commerce shops

How do you handle the lack of multi tenancy in Kubernetes?

xiwenc · a year ago

I’m baffled to see so many anti-k8s sentiments on HN. Is it because most commenters are developers used to services like heroku, fly.io, render.com etc. Or run their apps on VM’s?

elktown · a year ago

I think some are just pretty sick and tired of the explosion of needless complexity we've seen in the last decade or so in software, and rightly so. This is an industry-wide problem of deeply misaligned incentives (& some amount of ZIRP gold rush), not specific to this particular case - if this one is even a good example of this to begin with.

Honestly, as it stands, I think we'd be seen as pretty useless craftsmen in any other field due to an unhealthy obsession of our tooling and meta-work - consistently throwing any kind of sensible resource usage out of the window in favor of just getting to work with certain tooling. It's some kind of a "Temporarily embarrassed FAANG engineer" situation.

methodical · a year ago

Fair point but I think the key point here is unnecessary complexity versus necessary complexity. Are zero-downtime deployments and load balancing unnecessary? Perhaps for a personal project, but for any company with a consistent userbase I'd argue these are a non-negotiable, or should be anyways. In a situation where this is the expectation, k8s seems like the simplest answer, or near enough to it.

I agree with this somewhat. The other day I was driving home and I saw a sprinkler head and broke on the side of the road and was spraying water everywhere. It made me think, why aren't sprinkler systems designed with HA in mind? Why aren't there dual water lines with dual sprinkler heads everywhere with an electronic component that detects a break in a line and automatically switches to the backup water line? It's because the downside of having the water spray everywhere, the grass become unhealthy or die is less than how much it would cost to deploy it HA.

In the software/tech industry it's common place to just accept that your app can't be down for any amount of time no matter what. No one checked to see how much more it would cost (engineering time & infra costs) to deploy the app so it would be HA, so no one checked to see if it would be worth it.

I blame this logic on the low interest rates for a decade. I could be wrong.

darby_nine · a year ago

> It's some kind of a "Temporarily embarrassed FAANG engineer" situation.

FAANG engineers made the same mistake, too, even though the analogy implies comparative competency or value.

bobobob420 · a year ago

Any software engineer who thinks K8 is complex shouldn’t be a software engineer. It’s really not that hard to manage.

moduspol · a year ago

For me personally, I get a little bit salty about it due to imagined, theoretical business needs of being multi-cloud, or being able to deploy on-prem someday if needed. It's tough to explain just how much longer it'll take, how much more expertise is required, how much more fragile it'll be, and how much more money it'll take to build out on Kubernetes instead of your AWS deployment model of choice (VM images on EC2, or Elastic Beanstalk, or ECS / Fargate, or Lambda).

I don't want to set up or maintain my own ELK stack, or Prometheus. Or wrestle with CNI plugins. Or Kafka. Or high availability Postgres. Or Argo. Or Helm. Or control plane upgrades. I can get up and running with the AWS equivalent almost immediately, with almost no maintenance, and usually with linear costs starting near zero. I can solve business problems so, so much faster and more efficiently. It's the difference between me being able to blow away expectations and my whole team being quarters behind.

That said, when there is a genuine multi-cloud or on-prem requirement, I wouldn't want to do it with anything other than k8s. And it's probably not as bad if you do actually work at a company big enough to have a lot of skilled engineers that understand k8s--that just hasn't been the case anywhere I've worked.

drawnwren · a year ago

Genuine question: how are you handling load balancing, log aggregation, failure restart + readiness checks, deployment pipelines, and machine maintenance schedules with these “simple” setups?

Because as annoying as getting the prometheus + loki + tempo + promtail stack going on k8s is —- I don’t really believe that writing it from scratch is easier.

angio · a year ago

I think you're just used to AWS services and don't see the complexity there. I tried running some stateful services on ECS once and it took me hours to have something _not_ working. In Kubernetes it takes me literally minutes to achieve the same task (+ automatic chart updates with renovatebot).

I hear a lot of comments that sound like people who used K8s years ago and not since. The clouds have made K8s management stupid simple at this point, you can absolutely get up and running immediately with no worry of upgrades on a modern provider like GKE

caniszczyk · a year ago

Hating is a sign of success in some ways :)

In some ways, it's nice to see companies move to use mostly open source infrastructure, a lot of it coming from CNCF (https://landscape.cncf.io), ASF and other organizations out there (on top of the random things on github).

maayank · a year ago

It’s one of those technologies where there’s merit to use them in some situations but are too often cargo culted.

tryauuum · a year ago

For me it is about VMs. Feel uneasy knowing that any kernel vulnerability will allow a malicious code to escape the container and explore the kubernetes host

There are kata-containers I think, they might solve my angst and make me enjoy k8s

Overall... There's just nothing cool in kubernetes to me. Containers, load balancers, megabytes of yaml -- I've seen it all. Nothing feels interesting enough to try

vs the Application getting hacked and running lose on the VM?

If you have never dealt with, I have to run these 50 containers plus Nginx/CertBot while figuring out which node is best to run it, yea, I can see you not being thrilled about Kubernetes. For the rest of us though, Kubernetes helps out with that easily.

archenemybuntu · a year ago

Kubernetes itself is built around mostly solid distributed system principles.

It's the ecosystem around it which turns things needlessly complex.

Just because you have kubernetes, you don't necessarily need istio, helm, Argo cd, cilium, and whatever half baked stuff is pushed by CNCF yesterday.

For example take a look at helm. Its templating is atrocious, and if I am still correct, it doesn't have a way to order resources properly except hooks. Sometimes resource A (deployment) depends on resource B (some CRD).

The culture around kubernetes dictates you bring in everything pushed by CNCF. And most of these stuff are half baked MVPs.

---

The word devops has created expectations that back end developer should be fighting kubernetes if something goes wrong.

Containerization is done poorly by many orgs, no care about security and image size. That's a rant for another day. I suspect this isn't a big reason for kubernetes hate here.

dijksterhuis · a year ago

> When applied, Terraform code would spin up a template of what the service should look like by creating an ECS task set with zero instances. Then, the developer would need to deploy the service and clone this template task set [and do a bunch of manual things]

> This meant that something as simple as adding an environment variable required writing and applying Terraform, then running a deploy

This sounds less like a problem with ECS and more like an overcomplication in how they were using terraform + ECS to manage their deployments.

I get the generating templates part for verification prior to live deploys. But this seems... dunno.

Hey, author here, I totally agree that this is not a fundamental limitation of ECS and we could have iterated on this setup and made something better. I intentionally listed this under work we decided to scope into the migration process, and not under the fundamental reasons we undertook the migration because of that distinction.

roshbhatia · a year ago

I'm with you here -- ECS deploys are pretty painless and uncomplicated, but I can picture a few scenarios where this ends up being necessary, for ex; if they have a lot of services deployed on ECS and it ends up bloating the size of the Terraform state. That'd slow down plans and applies significantly, which makes sharding the Terraform state by literally cloning the configuration based on a template a lot safer.

> ECS deploys are pretty painless and uncomplicated

Unfortunately in my experience, this is true until it isn't. Once it isn't true, it can quickly become a painful blackbox debugging exercise. If your org is big enough to have dedicated AWS support then they can often get help from engineers, but if you aren't then life can get really complicated.

Still not a bad choice for most apps though, especially if it's just a run-of-the-mill HTTP-based app

wfleming · a year ago

Very much agree. I have built infra on ECS with terraform at two companies now, and we have zero manual steps for actions like this, beyond “add the env var to a terraform file, merge it and let CI deploy”. The majority of config changes we would make are that process.

Yeah.... thinking about it a bit more i just don't see why they didn't set up their CI to deploy a short lived environment on a push to a feature branch.

To me that seems like the simpler solution.

datadeft · a year ago

> Migrating onto Kubernetes can take years

What a heck am I reading? For who? I am not sure why companies even bother with such migrations. Where is the business value? Where is the gain for the customer? Is this one of those "L'art pour l'art" project that Figma does it just because they can?

kevstev · a year ago

FWIW... I was pretty taken aback by this statement as well- and also the "brag" that they moved onto K8s in less than a year. At a very well established firm ~30 years old and with the baggage that came with it, we moved to K8s in far less time- though we made zero attempt to move everything to k8s, just stuff that could benefit from it. Our pitch was more or less- move to k8s and when we do the planned datacenter move at the end of the year, you don't have to do anything aside from a checkout. Otherwise you will have to redeploy your apps to new machines or VMs and deal with all the headache around that. Or you could just containerize now if you aren't already and we take care of the rest. Most migrated and were very happy with the results.

There was plenty of services that were latency sensitive or in the HPC realm where it made no sense to force a migration though, and there was no attempt to force them to shoehorn in.

xorcist · a year ago

It solves the "we have recently been acquired and have a lot of resources that we must put to use" problem.

tedunangst · a year ago

How long will it take to migrate off?

codetrotter · a year ago

It’s irreversible.

hujun · a year ago

depends on how much "k8s native" code you have, there are application designed to run on k8s which uses a lot of k8s api and also if you app already micro-serviced, it is also not straight forward to change it back

breakingcups · a year ago

I feel so out of touch when I read a blog post which casually mentions 6 CNCF projects with kool names that I've never heard of, for gaining seemingly simple functionality.

I'm really wondering if I'm aging out of professional software development.

renewiltord · a year ago

Nah, there’s lots of IC work. It just means that you’re unfamiliar with one approach to org scaling: abstracting over hardware, logging, retrying handled by platform team.

It’s not the only approach so you may well be familiar with others.