Does your startup need complex cloud infrastructure?

I went through sweat and tears with this on different projects. People wanting to be cool because they use hype-train-tech ending up doing things of unbelievably bad quality because "hey, we are not that many in the team" but "hey, we need infinite scalability". Teams immature to the point of not understanding what LTS means have decided that they needed Kubernetes because yes. I could go on.

I currently have distilled, compact Puppet code to create a hardened VM of any size on any provider that can run one more more Docker services or run directly a python backend, or serve static files. With this I create a service on a Hetzner VM in 5 minutes whether the VM has 2 cores or 48 cores and control the configuration in source controlled manifests while monitoring configuration compliance with a custom Naemon plugin. A perfectly reproducible process. The startups kids are meanwhile doing snowflakes in the cloud spending many KEUR per month to have something that is worse than what devops pioneers were able to do in 2017. And the stakeholders are paying for this ship.

I wrote a more structured opinion piece about this, called The Emperor's New clouds:

https://logical.li/blog/emperors-new-clouds/

hliyan · a year ago

I started my career in a world where we did everything using shell scripts running directly on bare metal servers, usually running Solaris, and later SuSe or RedHat. I never understood the "how would you reproduce your setup without Docker (or X, where X is some other technology)". The scripts were deterministic. The dependency versions were locked. The configurations were identical. The input arguments were identical. The order of execution was identical. It all ran on a deterministic computational device. How could it not be reproducible?

ghomem · a year ago

Well that's exactly the point! Creating complex cloud resources with, for instance, Terraform, is less reproducible than a shell script on an LTS system like Ubuntu or RHEL - that's because the cloud provider interfaces drifts and from time to time stops accepting the terraform manifests that previously worked. And to fix it, you have to interrupt your normal work for yet another unplanned intervention in the terraform code - this happened to my teams several times.

This does not happen with Puppet + Linux, because LTS distributions have a long release cycle where compatibility is not broken.

I tried to explain this topic in the article linked above. Not sure how far I succeeded.

eastbound · a year ago

You said it: Your versions were locked. Therefore it is not constantly up-to-date.

I was pinched myself: Security.

- With the cloud threats, everything needs to be constantly up-to-date. Docker images make it easier than permanent servers that need to be upgraded. We used to upgrade every week, now we’re upgraded by default. So yes, sometimes our images don’t start with the latest version of xyz. But this is rare, downgrade is easy with Docker, and reproduction on a dev engine easier.

- With the cloud threats, everything needs to be isolated. Docker makes it easy to have an Alpine with no other executable than strictly necessary, and only open ports to the required services.

I hate the cloud because 4GB/2CPU should be way enough to run extremely large workloads, but I had to admit that convenience made me switch.

consteval · a year ago

To be fair there's real issues with this approach, too. For example, shell scripts aren't actually very portable. GNU awk vs nawk vs... multiply that by all your tools, and yeah those scripts don't run deterministically (they rely too much on the environment). This alone was a big reason why systemd exists today.

But there's a middle ground here too. To me there's a HUGE gap between Kubernetes distributed systems and shell script free for all.

ookblah · a year ago

reproducibility isn't just on your deployments, it's for development too. got old REAL fast when your fancy build doesn't work the same on every devs device or some one off issue with how your dev has setup their environment steals hours from everyone.

it was a big reason why we moved to containers at the bare minimum, because its quick and easy to spin up and destroy and you are guaranteed what runs locally runs on prod. no more "well it worked on my system!".

altdataseller · a year ago

Wouldnt there be slight differences in different Unix flavors so that the script couldnt run in all of them? If it only worked on Solaris, what would happen if Solaris retired? (Like what happened to Centos)

globular-toast · a year ago

I feel like Kubernetes is always randomly mentioned in rants like this. Instead of saying your hardened VM has Docker you could have just said it has kubelet on it. Then instead of a bunch of ad hoc "docker services" you could pay pennies for a k8s control plane that gives you control over everything on those VMs. I fail to see how your way is anything but worse.

The bad cloud infrastructure is when people try to use every single thing AWS sells and their whole infrastructure is at super high levels of abstraction that they could never migrate to another platform. K8s isn't that at all.

karmarepellent · a year ago

Unfortunately in air-gapped systems you cannot simply pay pennies for a managed k8s platform. In these cases you have to bootstrap and manage k8s on your own in your data centers. While I do not think bootstrapping and managing a cluster is difficult at all (especially if you only handle stateless workloads) it may still not fit or integrate well with a companies overall management infrastructure.

While I am a happy cloud infrastructure user in private, I have to go through some extra hoops to deploy applications at work, regardless of if k8s is used or not.

ownagefool · a year ago

In think in either case, if you already have code that's done, using that is going to be less effort than switching.

However, I ran kubeadm on a hetzner server and it's just sat chugging along forever basically. I use the cluster to run ephemeral apps where I build and deploy 1 golang service, a couple of node services in about 60 seconds ( with cache, obviously ).

As someone old enough and skilled enough to do the same with puppet, why bother when it's simpler easier that even the kids who don't understand TLS can do it with k8s?

zepolen · a year ago

100% best comment in this thread.

With k8s you get a way of saying 'WHAT YOU WANT' without 'HOW TO DO IT', and this is applies not only to the actual infra aspect, but the people maintaining it too. Any cloud platform and devops worth their salt can maintain a k8s system. Good luck finding someone to understand what that 'custom Naemon' plugin is doing.

zepolen · a year ago

How do you monitor this setup?

How do you control access to this setup?

How do you deploy on a different provider to Hetzner?

How do you access logs on this setup?

How do others maintain this setup?

How do you run backups?

How do you run cron jobs?

How do you deal with an offline node?

How do you expose a new ingress?

How do you provision extra storage on this setup?

If any of those is answered with 'something homegrown' or 'just write a script' then you have all the reasons k8s is worth it.

ghomem · a year ago

The questions are short but the answers would be long. Puppet manages all fine grained OS resources (files, dirs, repos, cronjobs, sudo declarations, firewall rules, etc) and you aggregate those resources into classes which are then pushed to different machines. The classes are parametrizable for the differences between systems.

If I was to write an idempotent script for each native resource I would finish in some years :-)

You chose whatever monitoring system you like the most.

For offline nodes you use whatever the level of criticity of your node justifies. This is something people struggle to understand: not every business needs 99.99% uptime. That said, I never had a downtime in Hetzner. On Digital ocean I had one short forced reboot in 4 years. YMMV so protect yourself as much as necessary.

Deploying on a different provider than Hetzner is the same as deploying on Hetzner except the part of launching the machine which is trivial to script - the added value is making the machine work and Ubuntu/Debian/RHEL are the same everywhere. You don't have vendor lock in with this.

If K8s works for you, enjoy it. Nobody is telling you to stop :-)

pella · a year ago

Hetzner and Kubernetes are not mutually exclusive.

- https://github.com/kube-hetzner/terraform-hcloud-kube-hetzne...

- https://www.hetzner.com/hetzner-summit --> "Managed Kubernetes Insights and lessons learned from developing our own Kubernetes platform"

hello0904 · a year ago

Serious question for you, why use Docker at all? You can just get rid of the clunky overhead.

You mentioned Python backend, so literally just replicate build script, directly in VPS: "pip install requirements.txt" > python main.py" > nano /etc/systemd/system/myservice.service > systemd start myservice > Tada.

You can scale instances by just throwing those commands in a bash script (build_my_app.sh) = You're new dockerfile...install on any server in xx-xxx seconds.

ghomem · a year ago

I mentioned Docker because it interests many developers but on VMs that I control I do not need Docker at all. Deploying with Docker provides host OS independence which is nice if you are distributing but unnecessary if the host is yours, running a fixed OS.

For Python backends I often deploy the code directly with a Puppet resource called VcsRepo which basically places a certain tag of a certain repo on a certain filesystem location. And I also package the systemd scripts for easy start/stop/restart. You can do this with other config management tools, via bash or by hand, depending on how many systems you manage.

What bothers me with your question is Pip :-) But perhaps that is off topic...?

RUnconcerned · a year ago

Famously, no one has ever had Python environment problems :D

tcgv · a year ago

> why use Docker at all?

We have a simple cloud infrastructure. Last year, we moved all our legacy apps to a Docker-based deployment (we were already using Docker for newer stuff). Nothing fancy—just basic Dockerfile and docker-compose.yml.

Advantages:

- Easy to manage: we keep a repo of docker-compose.yml files for each environment.

- Simple commands: most of the time, it’s just "docker-compose pull" and "docker-compose up."

- Our CI pipeline builds images after each commit, runs automated tests, and deploys to staging for QA to run manual tests.

- Very stable: we deploy the same images that were tested in staging. Our deployment success rate and production uptime improved significantly after the switch—even though stability wasn’t a big issue before!

- Common knowledge: everyone on our team is familiar with Docker, and it speeds up onboarding for new hires.

bob1029 · a year ago

I think a lot of (justifiable) Docker use comes out of being forced to use other tools & ecosystems that are fundamentally messy and not really intended for galactic-scale enterprise development.

I have found that going all-in with certain language/framework features, such as self-contained deployments, can allow for really powerful sidestepping of this kind of operational complexity.

If I was still in a situation where I had to ensure the right combination of runtimes & frameworks are installed every time, I might be reaching for Docker too.

marcosdumay · a year ago

Python, Ruby, and to a much larger extent PHP are the Docker showcase!

For example, if you have a program that uses wsgi and runs on python 2.7, and another wsgi program that runs on python 3.16, you will absolutely need 2 different web servers to run them.

You can give different ports to both, and install an nginx on port 80 with a reverse proxy. But software tends to come with a lot of assumptions that make ops hard, and they will often not like your custom setup... but they will almost certainly like a normal docker setup.

darby_nine · a year ago

Dockerfiles compose and aren't restricted to running on linux. Those two reasons alone basically mean I never need to care about systemd again

Sammi · a year ago

Honestly most people's dockerfile could just as well be a bash script.

authorfly · a year ago

Because it seems unobvious but docker always saves you. It's actually quicker than running pip install requirements.txt once you get a year in. (Trust me, I used to take your approach).

Forget about "clunky overhead" - the running costs are < 10%. The dockerfile? You don't even need one. You can just pull from the python version you want e.g. Python1.11 and git pull you files from the container to get up and running. You don't need to use container image saving systems, you don't need to save images, or tag anything, you don't need to write set up scripts in the docker file, you can pass the database credentials through the environment option when launching the container.

The problem is after a year or two you get clashes or weird stuff breaking. And modules stopping support of your python version preventing you installing new ones. Case in point, Googles AI module(needed for gemini and lots of their AI API services) only works on 3.10+. What if you started in 2021? Your python - then cutting edge - would not work anymore, it's only 3.5 years later from that release. Yeah you can use loads of curl. Good luck maintaining that for years though.

Numpy 1.19 is calling np.warnings but some other dependence is using Numpy 1.20 which removed .warnings and made it .notices or something

Your cached model routes for transformers changed default directory

You update the dependencies and it seems fine, then on a new machine you try and update them, and bam, wrong python version, you are on 3.9 and remote is 3.10, so it's all breaking.

It's also not simple in the following respect: your requirements.txt file will potentially have dependency clashes (despite running code), might take ages to install on a 4GB VM (especially if you need pytorch because some AI module that makes life 10x easier rather needlessly requires it).

life with docker is worth it. i was scared of it too, but there are three key benefits for the everyman / solodev:

- Literally docker export the running container as a .tar to install it on a new VM. That's one line and guaranteed the exact same VM, no changes. That's what you want, no risks.

- Back up is equally simple; shell script to download regular back ups. Update is simple; shell script to update git repo within the container. You can docker export it to investigate bugs without affecting the production running container, giving you an instant local dev environment as needed.

- When you inevitably need to update python you can just spin up a new VM with the same port mapping on Python 3.14 or whatever and just create an API internally to communicate, the two containers can share resources but run different python versions. How do you handle this with your solution in 4 years time?

- If you need to rapidly scale, your shell script could work fine, I'll give you that. But probably it takes 2 minutes to start on each VM. Do you want a 2 minute wait for your autoscaling? No you want a docker image / AMI that takes 5 seconds for AWS to scale up if you "hit it big".

ffsm8 · a year ago

Clunky overhead from Docker?

Sorry, but you've got no idea what you're talking about.

You can also run OSI images, often called docker images directly via systemds nspawn. Because docker doesn't create an overhead by itself, its at its heart a wrapper around kernel features and iptables.

You didn't need docker for deployments, but let's not use completely made up bullshit as arguments, okay?

dijit · a year ago

I'm with you, but for me Cloud does have one major benefit:

If you use it as IaaS, it's a lot quicker to get prototypes working than if you use anything else, including VPS's from other providers.

Google Cloud in particular has very few vectors for lock-in, and follows more principle of least surprise.

But once you have prototyped, you should ask the question about rebuilding it somewhere that is cheaper.

Near infinite scalability of disk drives is nice, and snapshotting, and cloud in general can allow you to extend your prototype into taking production load and allowing you to measure what you will need; but leaning in to "cloud magick" (cloud run, lambdas, etc) will consume almost as much time to learn and debug as just doing it the old school way anyway. In my lived experience.

ghomem · a year ago

I am not against the cloud. VMs are also cloud, unless you run them on your own servers. For instance, the Hetzner Cloud (mostly VMs, plus load balancers and disks) is so cheap and has such a nice CLI API that it competes aggressively with dedicated servers - I would definitely start any with VMs, not with iron.

The biggest problem is the so called cloud native stuff which is both more expensive and more complex. There are contexts where it makes sense but for startups they are doing more harm than good.

a_c · a year ago

Apart from the operation side, there is a development side parallel too.

Two examples that I came across

- "Test" mean if it passes on CI, it is good. Failing to run test on local? Who do development on local anyway?

- Teams so reliant on "AI" because this is the future of coding. "how to sort a list in python" became a prompt, rather than a lookup on the official documentation.

JamesonNetworks · a year ago

I’ve just recently gotten into ansible and find myself building the same thing. I wrote a script to interact with virsh and build vms locally so I can spin up my infra at home to test and deploy to the cloud if and when I want to spend actual money.

I’m still very much an ansible noob, but if you have a repo with playbooks I’d love to poke around and learn some things! If not, no worries, I appreciate your time reading this comment!

karmarepellent · a year ago

> while monitoring configuration compliance with a custom Naemon plugin.

While I absolutely agree with you and your approach, would you mind elaborating what kind of configuration compliance you are referring to in this statement? I suppose you do not mean any kind of configuration that your Puppet code produces as that configuration is "monitored", or rather managed, by Puppet.

ghomem · a year ago

I don't mind elaborating - the fact that people are asking me questions reminds me that I need to invest a bit more effort on some articles.

This case is actually pretty simple.

Puppet applies the configuration you declare impotently when you run the Puppet agent: whatever is not configured gets configured, whatever is already configured remains the same.

If there is an error the return code of the Puppet agent is different from that of the situations above.

Knowing this you can choose triggering the Puppet agent runs remotely from a monitoring system, (instead of periodical local runs), collecting the exit code and monitoring the status of that exit code inside the monitoring system.

Therefore, instead of having an agent that runs silently leaving you logs to parse, you have a green light / red light system in regards to the compliance of a machine with its manifesto. If somebody broke the machine leaving it in an unconfigurable state or if someone broke its manifesto during configuration maintenance you will soon get a red light and the corresponding notifications.

This is active configuration management rather than what people usually call provisioning.

Of course you need an SSH connection for this execution and with that you need hardened SSH config, whitelisting, dedicated unpriviledged user for monitoring, exceptional finegrained sudo cases, etc. Not rocket science.

itronitron · a year ago

I can't remember the last time I've seen a position description for a software developer (or anything tech related for that matter) that didn't include a requirement for skills in some cloud related tech.

Sometimes the job descriptions are boastful in their reference to those technologies, and other times you can detect some level of despair.

karmarepellent · a year ago

Now I am curious: how do you detect despair regarding cloud tech in job descriptions?

princevegeta89 · a year ago

Your first paragraph resonates strongly with what the folks have done at my startup......lol

ghomem · a year ago

My thoughts and prayers :-\ Wish you a quick recovery!

Who is going to get a new job without k8s on their resume. :)

Seriously, I think a lot of people do things the hard way to learn large scale infrastructure. Another common reason is 'things will be much easier when we scale to a massive number of clients', or we can dynamically scale up on demand.

These are all valid to the people building this, just not as much to founders or professional CTOs.

kristopolous · a year ago

Excuse my harshness but people doing it needlessly is just unprofessional waste and abuse.

Some people seem to have no concern with the needs and timetables of the would be customers but instead burn through cash building fancy nonsense.

It's like going in to a car mechanic for tires and then finding out it took 3 weeks because the guy wanted to put on low rider hydraulics and spinner hubcaps for his personal enrichment.

The worst part is it's inherently ambiguous to the next people. They don't know if the reason something is there is because it's needed or because it's just shiny bling.

mianos · a year ago

I am certainly not saying everything you say is not all true. My comment is dark humour. I really like your last point. Years ago I replaced a huge hadoop cluster data processing job with a single app on one machine with a few CPUs, that reduced a job that took over 8 hours to 20 minutes. What is even dumber is, it was just a python script and gnu parallel, which used to be perl.

BeFlatXIII · a year ago

…but if the bosses at competing mechanic shops hire based on quality of low riders a mechanic can install, of course they'll practice on the paying customers.

sussexby · a year ago

Just take a look at the level of complexity in home lab subreddits!

I don’t quite get if people do it for interest, for love of the tech, or if they are technocratic and believe in levelling up their skill to get k8s on their CV like you say.

All I think is “this looks painful to manage”!

from-nibly · a year ago

K8s is painful to get started, and painful to learn. But once you have it up you can just keep adding stuff to it.

I run a k8s cluster at home. Part of it yes, is to apply my existing skills and keep them fresh. But part of it is that kubernetes can be easier long term.

Ive got magical hard drive storage with rook ceph. I can yoink a hard drive out of my servers and nothing happens to my workloads.

I can do maintenance on one of the servers with 0 down time.

All of my config for what I have deployed is in git.

I manage VMS and kubernetes at work, and im not going to pretend that kubernetes isnt complex, but it's complex up front instead of down the road. VMs run into complexity when things change. I'm sure you can make VMS good but then why not use something like kubernetes, you will have to reinvent a lot of the stuff that's already in kubernetes.

It's a hammer for sure and not everything is a nail, but it can be really powerful and useful even for home labs.

KeplerBoy · a year ago

Assembling complex systems is just inherently fun as long as you don't have deadlines or performance metrics to hit.

It's a bit like factorio with the extra dopamine hit of getting to unbox stuff.

GauntletWizard · a year ago

K8s is painful to manage. It's a lot less painful than getting paged in the middle of the night because your server is down - And much much less than realizing that you've been down for an entire day and didn't notice. (K8s isn't even a complete solution to these problems! Just one part of a complete ~balanced breakfast~ production stack)

You don't need k8s for all of that, but there's not a simpler solution than k8s that handles as much.

Life is full of pain. Deal with it.

udev4096 · a year ago

It's because it is complex. And in the long run, things become simpler. The only difficulty is the initial setup and once you are past that, the overall maintenance workload just becomes easier compared to a single VM setup

t-writescode · a year ago

> I think a lot of people do things the hard way to learn large scale infrastructure

Having seen some of these half-rolled, first-time-understood k8s deployments, and the multi-year projects to unravel the mess that was created, overflowing with anti-patterns and other incorrect ways of doing things, I think I would prefer a narrower scope of true experienced professionals (or at least some experienced pros that can help guide the ship for their mentees) working on and designing k8s infra.

And for those that don't need it (the vast majority of startups, small businesses, regular-sized businesses, etc), just stick to the easier-to-use paradigms out there.

travisjungroth · a year ago

Nubank, the Brazilian bank unicorn, described their approach as “if this works, it’s because we reached massive scale quickly” (paraphrased) and started with an architecture that would support that from the beginning. They were very happy with their choices and have blogged about them in detail.

This is a case where “things will be much easier when we scale to a massive number of clients” turned out to be true.

j45 · a year ago

Resume driven development is worth learning to recognize.