I went through sweat and tears with this on different projects. People wanting to be cool because they use hype-train-tech ending up doing things of unbelievably bad quality because "hey, we are not that many in the team" but "hey, we need infinite scalability". Teams immature to the point of not understanding what LTS means have decided that they needed Kubernetes because yes. I could go on.
I currently have distilled, compact Puppet code to create a hardened VM of any size on any provider that can run one more more Docker services or run directly a python backend, or serve static files. With this I create a service on a Hetzner VM in 5 minutes whether the VM has 2 cores or 48 cores and control the configuration in source controlled manifests while monitoring configuration compliance with a custom Naemon plugin. A perfectly reproducible process. The startups kids are meanwhile doing snowflakes in the cloud spending many KEUR per month to have something that is worse than what devops pioneers were able to do in 2017. And the stakeholders are paying for this ship.
I wrote a more structured opinion piece about this, called The Emperor's New clouds:
I started my career in a world where we did everything using shell scripts running directly on bare metal servers, usually running Solaris, and later SuSe or RedHat. I never understood the "how would you reproduce your setup without Docker (or X, where X is some other technology)". The scripts were deterministic. The dependency versions were locked. The configurations were identical. The input arguments were identical. The order of execution was identical. It all ran on a deterministic computational device. How could it not be reproducible?
Well that's exactly the point! Creating complex cloud resources with, for instance, Terraform, is less reproducible than a shell script on an LTS system like Ubuntu or RHEL - that's because the cloud provider interfaces drifts and from time to time stops accepting the terraform manifests that previously worked. And to fix it, you have to interrupt your normal work for yet another unplanned intervention in the terraform code - this happened to my teams several times.
This does not happen with Puppet + Linux, because LTS distributions have a long release cycle where compatibility is not broken.
I tried to explain this topic in the article linked above. Not sure how far I succeeded.
You said it: Your versions were locked. Therefore it is not constantly up-to-date.
I was pinched myself: Security.
- With the cloud threats, everything needs to be constantly up-to-date. Docker images make it easier than permanent servers that need to be upgraded. We used to upgrade every week, now we’re upgraded by default. So yes, sometimes our images don’t start with the latest version of xyz. But this is rare, downgrade is easy with Docker, and reproduction on a dev engine easier.
- With the cloud threats, everything needs to be isolated. Docker makes it easy to have an Alpine with no other executable than strictly necessary, and only open ports to the required services.
I hate the cloud because 4GB/2CPU should be way enough to run extremely large workloads, but I had to admit that convenience made me switch.
To be fair there's real issues with this approach, too. For example, shell scripts aren't actually very portable. GNU awk vs nawk vs... multiply that by all your tools, and yeah those scripts don't run deterministically (they rely too much on the environment). This alone was a big reason why systemd exists today.
But there's a middle ground here too. To me there's a HUGE gap between Kubernetes distributed systems and shell script free for all.
reproducibility isn't just on your deployments, it's for development too. got old REAL fast when your fancy build doesn't work the same on every devs device or some one off issue with how your dev has setup their environment steals hours from everyone.
it was a big reason why we moved to containers at the bare minimum, because its quick and easy to spin up and destroy and you are guaranteed what runs locally runs on prod. no more "well it worked on my system!".
Wouldnt there be slight differences in different Unix flavors so that the script couldnt run in all of them? If it only worked on Solaris, what would happen if Solaris retired? (Like what happened to Centos)
I feel like Kubernetes is always randomly mentioned in rants like this. Instead of saying your hardened VM has Docker you could have just said it has kubelet on it. Then instead of a bunch of ad hoc "docker services" you could pay pennies for a k8s control plane that gives you control over everything on those VMs. I fail to see how your way is anything but worse.
The bad cloud infrastructure is when people try to use every single thing AWS sells and their whole infrastructure is at super high levels of abstraction that they could never migrate to another platform. K8s isn't that at all.
Unfortunately in air-gapped systems you cannot simply pay pennies for a managed k8s platform. In these cases you have to bootstrap and manage k8s on your own in your data centers. While I do not think bootstrapping and managing a cluster is difficult at all (especially if you only handle stateless workloads) it may still not fit or integrate well with a companies overall management infrastructure.
While I am a happy cloud infrastructure user in private, I have to go through some extra hoops to deploy applications at work, regardless of if k8s is used or not.
In think in either case, if you already have code that's done, using that is going to be less effort than switching.
However, I ran kubeadm on a hetzner server and it's just sat chugging along forever basically. I use the cluster to run ephemeral apps where I build and deploy 1 golang service, a couple of node services in about 60 seconds ( with cache, obviously ).
As someone old enough and skilled enough to do the same with puppet, why bother when it's simpler easier that even the kids who don't understand TLS can do it with k8s?
With k8s you get a way of saying 'WHAT YOU WANT' without 'HOW TO DO IT', and this is applies not only to the actual infra aspect, but the people maintaining it too. Any cloud platform and devops worth their salt can maintain a k8s system. Good luck finding someone to understand what that 'custom Naemon' plugin is doing.
The questions are short but the answers would be long. Puppet manages all fine grained OS resources (files, dirs, repos, cronjobs, sudo declarations, firewall rules, etc) and you aggregate those resources into classes which are then pushed to different machines. The classes are parametrizable for the differences between systems.
If I was to write an idempotent script for each native resource I would finish in some years :-)
You chose whatever monitoring system you like the most.
For offline nodes you use whatever the level of criticity of your node justifies. This is something people struggle to understand: not every business needs 99.99% uptime. That said, I never had a downtime in Hetzner. On Digital ocean I had one short forced reboot in 4 years. YMMV so protect yourself as much as necessary.
Deploying on a different provider than Hetzner is the same as deploying on Hetzner except the part of launching the machine which is trivial to script - the added value is making the machine work and Ubuntu/Debian/RHEL are the same everywhere. You don't have vendor lock in with this.
If K8s works for you, enjoy it. Nobody is telling you to stop :-)
Serious question for you, why use Docker at all? You can just get rid of the clunky overhead.
You mentioned Python backend, so literally just replicate build script, directly in VPS: "pip install requirements.txt" > python main.py" > nano /etc/systemd/system/myservice.service > systemd start myservice > Tada.
You can scale instances by just throwing those commands in a bash script (build_my_app.sh) = You're new dockerfile...install on any server in xx-xxx seconds.
I mentioned Docker because it interests many developers but on VMs that I control I do not need Docker at all. Deploying with Docker provides host OS independence which is nice if you are distributing but unnecessary if the host is yours, running a fixed OS.
For Python backends I often deploy the code directly with a Puppet resource called VcsRepo which basically places a certain tag of a certain repo on a certain filesystem location. And I also package the systemd scripts for easy start/stop/restart. You can do this with other config management tools, via bash or by hand, depending on how many systems you manage.
What bothers me with your question is Pip :-) But perhaps that is off topic...?
We have a simple cloud infrastructure. Last year, we moved all our legacy apps to a Docker-based deployment (we were already using Docker for newer stuff). Nothing fancy—just basic Dockerfile and docker-compose.yml.
Advantages:
- Easy to manage: we keep a repo of docker-compose.yml files for each environment.
- Simple commands: most of the time, it’s just "docker-compose pull" and "docker-compose up."
- Our CI pipeline builds images after each commit, runs automated tests, and deploys to staging for QA to run manual tests.
- Very stable: we deploy the same images that were tested in staging. Our deployment success rate and production uptime improved significantly after the switch—even though stability wasn’t a big issue before!
- Common knowledge: everyone on our team is familiar with Docker, and it speeds up onboarding for new hires.
I think a lot of (justifiable) Docker use comes out of being forced to use other tools & ecosystems that are fundamentally messy and not really intended for galactic-scale enterprise development.
I have found that going all-in with certain language/framework features, such as self-contained deployments, can allow for really powerful sidestepping of this kind of operational complexity.
If I was still in a situation where I had to ensure the right combination of runtimes & frameworks are installed every time, I might be reaching for Docker too.
Python, Ruby, and to a much larger extent PHP are the Docker showcase!
For example, if you have a program that uses wsgi and runs on python 2.7, and another wsgi program that runs on python 3.16, you will absolutely need 2 different web servers to run them.
You can give different ports to both, and install an nginx on port 80 with a reverse proxy. But software tends to come with a lot of assumptions that make ops hard, and they will often not like your custom setup... but they will almost certainly like a normal docker setup.
Because it seems unobvious but docker always saves you. It's actually quicker than running pip install requirements.txt once you get a year in. (Trust me, I used to take your approach).
Forget about "clunky overhead" - the running costs are < 10%. The dockerfile? You don't even need one. You can just pull from the python version you want e.g. Python1.11 and git pull you files from the container to get up and running. You don't need to use container image saving systems, you don't need to save images, or tag anything, you don't need to write set up scripts in the docker file, you can pass the database credentials through the environment option when launching the container.
The problem is after a year or two you get clashes or weird stuff breaking. And modules stopping support of your python version preventing you installing new ones. Case in point, Googles AI module(needed for gemini and lots of their AI API services) only works on 3.10+. What if you started in 2021? Your python - then cutting edge - would not work anymore, it's only 3.5 years later from that release. Yeah you can use loads of curl. Good luck maintaining that for years though.
Numpy 1.19 is calling np.warnings but some other dependence is using Numpy 1.20 which removed .warnings and made it .notices or something
Your cached model routes for transformers changed default directory
You update the dependencies and it seems fine, then on a new machine you try and update them, and bam, wrong python version, you are on 3.9 and remote is 3.10, so it's all breaking.
It's also not simple in the following respect: your requirements.txt file will potentially have dependency clashes (despite running code), might take ages to install on a 4GB VM (especially if you need pytorch because some AI module that makes life 10x easier rather needlessly requires it).
life with docker is worth it. i was scared of it too, but there are three key benefits for the everyman / solodev:
- Literally docker export the running container as a .tar to install it on a new VM. That's one line and guaranteed the exact same VM, no changes. That's what you want, no risks.
- Back up is equally simple; shell script to download regular back ups. Update is simple; shell script to update git repo within the container. You can docker export it to investigate bugs without affecting the production running container, giving you an instant local dev environment as needed.
- When you inevitably need to update python you can just spin up a new VM with the same port mapping on Python 3.14 or whatever and just create an API internally to communicate, the two containers can share resources but run different python versions. How do you handle this with your solution in 4 years time?
- If you need to rapidly scale, your shell script could work fine, I'll give you that. But probably it takes 2 minutes to start on each VM. Do you want a 2 minute wait for your autoscaling? No you want a docker image / AMI that takes 5 seconds for AWS to scale up if you "hit it big".
Sorry, but you've got no idea what you're talking about.
You can also run OSI images, often called docker images directly via systemds nspawn. Because docker doesn't create an overhead by itself, its at its heart a wrapper around kernel features and iptables.
You didn't need docker for deployments, but let's not use completely made up bullshit as arguments, okay?
I'm with you, but for me Cloud does have one major benefit:
If you use it as IaaS, it's a lot quicker to get prototypes working than if you use anything else, including VPS's from other providers.
Google Cloud in particular has very few vectors for lock-in, and follows more principle of least surprise.
But once you have prototyped, you should ask the question about rebuilding it somewhere that is cheaper.
Near infinite scalability of disk drives is nice, and snapshotting, and cloud in general can allow you to extend your prototype into taking production load and allowing you to measure what you will need; but leaning in to "cloud magick" (cloud run, lambdas, etc) will consume almost as much time to learn and debug as just doing it the old school way anyway. In my lived experience.
I am not against the cloud. VMs are also cloud, unless you run them on your own servers. For instance, the Hetzner Cloud (mostly VMs, plus load balancers and disks) is so cheap and has such a nice CLI API that it competes aggressively with dedicated servers - I would definitely start any with VMs, not with iron.
The biggest problem is the so called cloud native stuff which is both more expensive and more complex. There are contexts where it makes sense but for startups they are doing more harm than good.
Apart from the operation side, there is a development side parallel too.
Two examples that I came across
- "Test" mean if it passes on CI, it is good. Failing to run test on local? Who do development on local anyway?
- Teams so reliant on "AI" because this is the future of coding. "how to sort a list in python" became a prompt, rather than a lookup on the official documentation.
I’ve just recently gotten into ansible and find myself building the same thing. I wrote a script to interact with virsh and build vms locally so I can spin up my infra at home to test and deploy to the cloud if and when I want to spend actual money.
I’m still very much an ansible noob, but if you have a repo with playbooks I’d love to poke around and learn some things! If not, no worries, I appreciate your time reading this comment!
> while monitoring configuration compliance with a custom Naemon plugin.
While I absolutely agree with you and your approach, would you mind elaborating what kind of configuration compliance you are referring to in this statement? I suppose you do not mean any kind of configuration that your Puppet code produces as that configuration is "monitored", or rather managed, by Puppet.
I don't mind elaborating - the fact that people are asking me questions reminds me that I need to invest a bit more effort on some articles.
This case is actually pretty simple.
Puppet applies the configuration you declare impotently when you run the Puppet agent: whatever is not configured gets configured, whatever is already configured remains the same.
If there is an error the return code of the Puppet agent is different from that of the situations above.
Knowing this you can choose triggering the Puppet agent runs remotely from a monitoring system, (instead of periodical local runs), collecting the exit code and monitoring the status of that exit code inside the monitoring system.
Therefore, instead of having an agent that runs silently leaving you logs to parse, you have a green light / red light system in regards to the compliance of a machine with its manifesto. If somebody broke the machine leaving it in an unconfigurable state or if someone broke its manifesto during configuration maintenance you will soon get a red light and the corresponding notifications.
This is active configuration management rather than what people usually call provisioning.
Of course you need an SSH connection for this execution and with that you need hardened SSH config, whitelisting, dedicated unpriviledged user for monitoring, exceptional finegrained sudo cases, etc. Not rocket science.
I can't remember the last time I've seen a position description for a software developer (or anything tech related for that matter) that didn't include a requirement for skills in some cloud related tech.
Sometimes the job descriptions are boastful in their reference to those technologies, and other times you can detect some level of despair.
Basically doing this for a small startup - there are some complexities around autoscaling task queues with gpus and whatnot, but the heart of it is on a single VM (nginx, webapp, postgres, redis). We're b2b, so there's very little traffic anyway.
The additional benefit is devs can run all the same stuff on a Linux laptop (or Linux VM on some other platform) - and everyone can have their own VM in the cloud if they like to demo or test stuff using all the same setup. Bootstrapping a new system is checking in their ssh key and running a shell script.
Easy to debug, not complex or expensive, and we could vertically scale it all quite a ways before needing to scale horizontally. It's not for everyone, but seed stage and earlier - totally appropriate imo.
It's one level of indirection away from "check in a public key" in that the user can rotate their own keys without needing git churn
Also, and I recognize this is departing quite a bit from what you were describing, ssh key leases are absolutely awesome because it addresses the offboarding scenario much better than having to reconcile evicting those same keys: https://github.com/hashicorp/vault/blob/v1.12.11/website/con... and while digging up that link I also discovered that Vault will allegedly do single-use passwords, too <https://github.com/hashicorp/vault/blob/v1.12.11/website/con...>, but since I am firmly in the "PasswordLogin no" camp, caveat emptor with that one
True, I use it mainly for a few convenience things - holding ephemeral monitoring data, distributed locks, redis streams for some pub/sub stuff, sorted sets can be handy - things I could do in Postgres, but are a bit simpler in Redis.
I like this but one of the issues with this approach is if no Docker images like traditional configuration management tool, you are going for a world of pain.
Docker and Docker images have tons of best practices already defined for plenty of use cases. If it's already containerized; then, jumping to any orchestrator that supports OCI images is more about adjusting the business to a new set of operations.
I have a custom deployment system which idempotently configures an Ubuntu LTS VM. All the config templates are checked into source control. I don't configure anything by hand - it's either handled in this thing or via a small user-data script run at provisioning time.
Like everything, it's context dependent, but wowzers my life has improved so much since I got on board the Flatcar or Bottlerocket train of immutable OS. Flatcar (née CoreOS) does ship with docker but is still mostly a general purpose OS but Bottlerocket is about as "cloud native" as it comes, shipping with kubelet and even the host processes run in containers. For my purposes (being a k8s fanboy) that's just perfect since it's one less bootstrapping step I need to take on my own
Both are Apache 2 and the Flatcar folks are excellent to work with
I've been running my SaaS first on a single server, then after getting product-market fit on several servers. These are bare-metal servers (Hetzner). I have no microservices, I don't deal with Kubernetes, but I do run a distributed database.
All in all, this approach is ridiculously effective: I don't have to deal with complexity of things like Kubernetes, or with cascading system errors that inevitably happen in complex systems. I save on development time, maintenance, and on my monthly server bills.
The usual mantra is "but how do we scale" — I submit that 1) you don't know yet if you will need to scale, and 2) with those ridiculously powerful computers and reasonable design choices you can get very, very far with just 3-5 servers.
To be clear, I am not advocating that you run your business in your home closet. You still need automation (I use ansible and terraform) to manage your servers.
The scaling thing is a great boogeyman. It preys on this optimism your software is going to be so successful in such a short amount of time which people want to believe.
Scroll down to the bottom, under the section "A few considerations" and try not to laugh.
"A few considerations" turns out to be a pretty significant chunk of security work ESPECIALLY if you are storing/transmitting highly sensitive information.
How do you handle something like HIPPA compliance when you're in this situation?
There are 2 types of programmers: those that think they've seen everything and those that know they've seen next to nothing. And as such, these absolute takes are tiring.
I've written a HIPPA-compliant application that was VPS-hostable. It's been a while, but IIRC, it simply involved a combination of TLS everywhere and encrypting the sensitive fields in the DB. I don't remember if there was any other trick involved, but it wasn't difficult. By far the hardest thing about that project was the complexity of the medical codes-- not HIPAA compliance-- and that is something the cloud wouldn't help with at all.
> How do you handle something like HIPPA compliance when you're in this situation?
I'm a dev who hasn't seen anything related to that. Since you bring it up, can you give some pointers on why something like a MySQL db coupled to a monolithic backend isn't good enough? What shortcomings did you experience?
All of the things raised in the article seem possible to solve without the need for microservices.
There is a core 20% of kubernetes, which is deployments, pods services and the way it handles blue-green deployments and declarative based definitions, namespace seperation, etc. that is really good. Just keeping to those simple basics, using a managed cloud kubernetes service, and running your state (database) out of cluster is a good experience (IMO).
It's when one starts getting sucked down the "cloud native" wormhole of all these niche open source systems and operators and ambassador and sidecar patterns, etc. that things go wrong. Those are for environments with many independent but interconnecting tech teams with diverse programming language use.
For me this is all Kubernetes is. I feel like people are often talking about two different things in discussions like this. For me it's just a uniform way to deploy stuff that is better than docker compose. We pay pennies for the control plane and workers are just generic VMs with kubelet.
But I think for many "kubernetes" means your second paragraph. It doesn't have to be like that at all! People should try settling up a k3s cluster and just learn about workloads, services and ingresses. That's all you need to replace a bunch of ad hoc VMs and docker stuff.
For a lot of company and project I worked on, this is the same conclusion I came to. 99% we only need / want is docker-compose++. Things like 0-downtime deployment out of the box, simple configuration system for replica set and other replication / distribution mechanism, and that is basically it.
I which there was something that did just that, because kube comes with a lot of baggage, and docker-compose is a bit too basic for some important production needs.
Exactly this. Kubernetes has a million knobs and dials you can tweak for any use case you want, but equally they can be ignored and you can use the core functionality and keep it simple.
I can have something with nice deployments, super easy logs and metrics, and a nice developer experience setup in no time at all.
Yeah I found out my work was using kurbernetes. Given its reputation - having never used it before - when I asked if I could set up a server for some internal tooling I was braced for the worst.
What I actually got was a half an hour tutorial from the guy who set it up, in which he explained the whole concept (I had no clue) and gave me enough information to deploy a server, which I did with zero problems. I had automatic deployment from `git push` working very quickly.
To me this seemed like a no brainer. Unless you literally have one service this is waaay easier to use.
Granted I didn't have to set it up - maybe that's where the terrible reputation comes from?
Who is going to get a new job without k8s on their resume. :)
Seriously, I think a lot of people do things the hard way to learn large scale infrastructure. Another common reason is 'things will be much easier when we scale to a massive number of clients', or we can dynamically scale up on demand.
These are all valid to the people building this, just not as much to founders or professional CTOs.
Excuse my harshness but people doing it needlessly is just unprofessional waste and abuse.
Some people seem to have no concern with the needs and timetables of the would be customers but instead burn through cash building fancy nonsense.
It's like going in to a car mechanic for tires and then finding out it took 3 weeks because the guy wanted to put on low rider hydraulics and spinner hubcaps for his personal enrichment.
The worst part is it's inherently ambiguous to the next people. They don't know if the reason something is there is because it's needed or because it's just shiny bling.
I am certainly not saying everything you say is not all true. My comment is dark humour. I really like your last point. Years ago I replaced a huge hadoop cluster data processing job with a single app on one machine with a few CPUs, that reduced a job that took over 8 hours to 20 minutes. What is even dumber is, it was just a python script and gnu parallel, which used to be perl.
…but if the bosses at competing mechanic shops hire based on quality of low riders a mechanic can install, of course they'll practice on the paying customers.
Just take a look at the level of complexity in home lab subreddits!
I don’t quite get if people do it for interest, for love of the tech, or if they are technocratic and believe in levelling up their skill to get k8s on their CV like you say.
K8s is painful to get started, and painful to learn. But once you have it up you can just keep adding stuff to it.
I run a k8s cluster at home. Part of it yes, is to apply my existing skills and keep them fresh. But part of it is that kubernetes can be easier long term.
Ive got magical hard drive storage with rook ceph. I can yoink a hard drive out of my servers and nothing happens to my workloads.
I can do maintenance on one of the servers with 0 down time.
All of my config for what I have deployed is in git.
I manage VMS and kubernetes at work, and im not going to pretend that kubernetes isnt complex, but it's complex up front instead of down the road. VMs run into complexity when things change. I'm sure you can make VMS good but then why not use something like kubernetes, you will have to reinvent a lot of the stuff that's already in kubernetes.
It's a hammer for sure and not everything is a nail, but it can be really powerful and useful even for home labs.
K8s is painful to manage. It's a lot less painful than getting paged in the middle of the night because your server is down - And much much less than realizing that you've been down for an entire day and didn't notice. (K8s isn't even a complete solution to these problems! Just one part of a complete ~balanced breakfast~ production stack)
You don't need k8s for all of that, but there's not a simpler solution than k8s that handles as much.
It's because it is complex. And in the long run, things become simpler. The only difficulty is the initial setup and once you are past that, the overall maintenance workload just becomes easier compared to a single VM setup
> I think a lot of people do things the hard way to learn large scale infrastructure
Having seen some of these half-rolled, first-time-understood k8s deployments, and the multi-year projects to unravel the mess that was created, overflowing with anti-patterns and other incorrect ways of doing things, I think I would prefer a narrower scope of true experienced professionals (or at least some experienced pros that can help guide the ship for their mentees) working on and designing k8s infra.
And for those that don't need it (the vast majority of startups, small businesses, regular-sized businesses, etc), just stick to the easier-to-use paradigms out there.
Nubank, the Brazilian bank unicorn, described their approach as “if this works, it’s because we reached massive scale quickly” (paraphrased) and started with an architecture that would support that from the beginning. They were very happy with their choices and have blogged about them in detail.
This is a case where “things will be much easier when we scale to a massive number of clients” turned out to be true.
This is a retreaded and often tiresome debate. I'll still throw my 2c in...
Should you pick a complex framework from day one? Probably not, unless your team has extensive experience with it.
My objection is towards the idea that managing infrastructure with a bespoke process and custom tooling will always be less effort to maintain than established tooling. It's the idea of stubbornly rejecting the "complexity" bogeyman, even when the process you built yourself is far from simple, and takes a lot of your time from your core product anyway.
Everyone loves the simplicity of copying over a binary to a VPS, and restarting a service. But then you want to solve configuration and secret management, have multiple servers for availability/redundancy so then you want gradual deployments, load balancing, rollbacks, etc. You probably also want some staging environment, so need to easily replicate this workflow. Then your team eventually grows and they find that it's impossible to run a prod-like environment locally. And then, and then...
You're forced to solve each new requirement with your own special approach, instead of relying on standard solutions others have figured out for you. It eventually gets to a question of sunken cost: do you want to abandon all this custom tooling you know and understand, in favor of "complexity" you don't? The difficult thing is that the more you invest in it, the harder it will be to migrate away from it.
My suggestion is: start by following practices that will make your transition to the standard tooling later easier. This means deploying with containers from day 1, adopting the 12 factors methodology, etc. And when you do start to struggle with some feature you need, switch to established tooling sooner later than later. You're likely find that your fear of the unknown was unwarranted, and you'll spend less time working on infra in the long run.
There's no correct answer here. Your choice seems reasonable _if_ you already have some previous familiarity with managing k8s. If not, you might want to consider starting with a managed k8s solution from a cloud provider. The bulk of the work will be containerizing your stack, and getting familiar with all the concepts. You don't want to do all that while also keeping k8s running. After that you would be able to relatively easily migrate to a self-hosted cluster if you need to.
If you do want to self-host, k3s could also be an option, like a sibling comment suggested. It's simpler to start with, though it still has a learning curve since it's a lightweight version of k8s. I reckon that you would still want to run at least 3 nodes for redundancy/failover, and maybe a couple more for just DB workloads. But you can certainly start with one to setup your workflow, and then scale out to more nodes as needed.
k3s single node + ArgoCD/Flux is what I would if I had to build infrastructure of a small startup by myself.
Unfortunately it's HN so people are more likely to do everything in bash scripts and say a big "fuck you" to all new hires that would have to learn their custom made mess
The other aspect of this is it's literally impossible to hire someone from industry already familiar with your home grown SDLC systems. But you can find plenty of "cloud engineers" who do understand these "complex" cloud systems who can deploy and maintain them via terraform. It's a turn-key skill set.
These are the only things I have ever been comfortable using in the cloud.
Once you get into FaaS and friends, things get really weird for me. I can't handle not having visibility into the machine running my production environment. Debugging through cloud dashboards is a shit experience. I think Microsoft's approach is closest to actually "working", but it's still really awful and I'd never touch it again.
The ideal architecture for me after 10 years is still a single VM with monolithic codebase talking to local instances of SQLite. The advent of NVMe storage has really put a kick into this one too. Backups handled by snapshotting the block storage device. Transactional durability handled by replicating WAL, if need be.
Dumbass simple. Lets me focus on the business and customer. Because they sure as hell don't care about any of this and wouldn't pay any money for it. All this code & infra is pure downside. You want as little of it as possible.
This is the most expensive way to build cloud services. When people talk about the cloud being more expensive than on-prem this is often the reason why. If you're just going to run VMs 24/7 there are better options.
Even the book on Microservices says “First build the Monolith”. You don’t know how to split your system until you have actually got some traction with users, and it’s easier to split a monolith than to reorganize services.
You may never need to split your monolith! Stripe eventually broke some stuff out of their Rails monolith but it gets you surprisingly far.
You are not going to get easier to debug than a Django/Rails/etc monolith.
I bit of foresight on where you want to go with your infra can help you though; I built the first versions of our company as a Django Docker container running on a single VM. Deploy was a manual “docker pull; docker stop; docker start”. This setup got us surprisingly far. Docker is nice here as a way of sidestepping dependency packaging issues, this can be annoying in the early stages (eg does my server have the right C header files installed for that new db driver I installed? Setup Will be different than in your Mac!)
We eventually moved to k8s after our seed extension in response to a business need for reliability and scalability; k8s served us well all the way through series B . So the setup to have everything Dockerized made that really easy too - but we aggressively minimized complexity in the early stages.
Yes! Also, use the damn framework, instead of rebuilding shitty versions of features it offers! One good seasoned person will outperform 10 non-seasoned people in this regard. It will add up over time. I think half the real reason people are soured to monoliths is because they are bad, poorly run monoliths.
> Even the book on Microservices says “First build the Monolith”.\
And yet, funnily enough, the book on Monoliths says to break things up into smaller services! It says your data should be stored in its own service (possibly multiple services, if you need multi-paradigm access [e.g. relational, full-text search, etc.]). The user experience should use its own service. And, at very least, you should have another service in between (this is where Django and Rails usually fit). Optionally, it says, you will probably want to have additional services as well (auth, financial transitions, etc.)
I currently have distilled, compact Puppet code to create a hardened VM of any size on any provider that can run one more more Docker services or run directly a python backend, or serve static files. With this I create a service on a Hetzner VM in 5 minutes whether the VM has 2 cores or 48 cores and control the configuration in source controlled manifests while monitoring configuration compliance with a custom Naemon plugin. A perfectly reproducible process. The startups kids are meanwhile doing snowflakes in the cloud spending many KEUR per month to have something that is worse than what devops pioneers were able to do in 2017. And the stakeholders are paying for this ship.
I wrote a more structured opinion piece about this, called The Emperor's New clouds:
https://logical.li/blog/emperors-new-clouds/
This does not happen with Puppet + Linux, because LTS distributions have a long release cycle where compatibility is not broken.
I tried to explain this topic in the article linked above. Not sure how far I succeeded.
I was pinched myself: Security.
- With the cloud threats, everything needs to be constantly up-to-date. Docker images make it easier than permanent servers that need to be upgraded. We used to upgrade every week, now we’re upgraded by default. So yes, sometimes our images don’t start with the latest version of xyz. But this is rare, downgrade is easy with Docker, and reproduction on a dev engine easier.
- With the cloud threats, everything needs to be isolated. Docker makes it easy to have an Alpine with no other executable than strictly necessary, and only open ports to the required services.
I hate the cloud because 4GB/2CPU should be way enough to run extremely large workloads, but I had to admit that convenience made me switch.
But there's a middle ground here too. To me there's a HUGE gap between Kubernetes distributed systems and shell script free for all.
it was a big reason why we moved to containers at the bare minimum, because its quick and easy to spin up and destroy and you are guaranteed what runs locally runs on prod. no more "well it worked on my system!".
The bad cloud infrastructure is when people try to use every single thing AWS sells and their whole infrastructure is at super high levels of abstraction that they could never migrate to another platform. K8s isn't that at all.
While I am a happy cloud infrastructure user in private, I have to go through some extra hoops to deploy applications at work, regardless of if k8s is used or not.
However, I ran kubeadm on a hetzner server and it's just sat chugging along forever basically. I use the cluster to run ephemeral apps where I build and deploy 1 golang service, a couple of node services in about 60 seconds ( with cache, obviously ).
As someone old enough and skilled enough to do the same with puppet, why bother when it's simpler easier that even the kids who don't understand TLS can do it with k8s?
With k8s you get a way of saying 'WHAT YOU WANT' without 'HOW TO DO IT', and this is applies not only to the actual infra aspect, but the people maintaining it too. Any cloud platform and devops worth their salt can maintain a k8s system. Good luck finding someone to understand what that 'custom Naemon' plugin is doing.
How do you control access to this setup?
How do you deploy on a different provider to Hetzner?
How do you access logs on this setup?
How do others maintain this setup?
How do you run backups?
How do you run cron jobs?
How do you deal with an offline node?
How do you expose a new ingress?
How do you provision extra storage on this setup?
If any of those is answered with 'something homegrown' or 'just write a script' then you have all the reasons k8s is worth it.
If I was to write an idempotent script for each native resource I would finish in some years :-)
You chose whatever monitoring system you like the most.
For offline nodes you use whatever the level of criticity of your node justifies. This is something people struggle to understand: not every business needs 99.99% uptime. That said, I never had a downtime in Hetzner. On Digital ocean I had one short forced reboot in 4 years. YMMV so protect yourself as much as necessary.
Deploying on a different provider than Hetzner is the same as deploying on Hetzner except the part of launching the machine which is trivial to script - the added value is making the machine work and Ubuntu/Debian/RHEL are the same everywhere. You don't have vendor lock in with this.
If K8s works for you, enjoy it. Nobody is telling you to stop :-)
- https://github.com/kube-hetzner/terraform-hcloud-kube-hetzne...
- https://www.hetzner.com/hetzner-summit --> "Managed Kubernetes Insights and lessons learned from developing our own Kubernetes platform"
You mentioned Python backend, so literally just replicate build script, directly in VPS: "pip install requirements.txt" > python main.py" > nano /etc/systemd/system/myservice.service > systemd start myservice > Tada.
You can scale instances by just throwing those commands in a bash script (build_my_app.sh) = You're new dockerfile...install on any server in xx-xxx seconds.
For Python backends I often deploy the code directly with a Puppet resource called VcsRepo which basically places a certain tag of a certain repo on a certain filesystem location. And I also package the systemd scripts for easy start/stop/restart. You can do this with other config management tools, via bash or by hand, depending on how many systems you manage.
What bothers me with your question is Pip :-) But perhaps that is off topic...?
We have a simple cloud infrastructure. Last year, we moved all our legacy apps to a Docker-based deployment (we were already using Docker for newer stuff). Nothing fancy—just basic Dockerfile and docker-compose.yml.
Advantages:
- Easy to manage: we keep a repo of docker-compose.yml files for each environment.
- Simple commands: most of the time, it’s just "docker-compose pull" and "docker-compose up."
- Our CI pipeline builds images after each commit, runs automated tests, and deploys to staging for QA to run manual tests.
- Very stable: we deploy the same images that were tested in staging. Our deployment success rate and production uptime improved significantly after the switch—even though stability wasn’t a big issue before!
- Common knowledge: everyone on our team is familiar with Docker, and it speeds up onboarding for new hires.
I have found that going all-in with certain language/framework features, such as self-contained deployments, can allow for really powerful sidestepping of this kind of operational complexity.
If I was still in a situation where I had to ensure the right combination of runtimes & frameworks are installed every time, I might be reaching for Docker too.
For example, if you have a program that uses wsgi and runs on python 2.7, and another wsgi program that runs on python 3.16, you will absolutely need 2 different web servers to run them.
You can give different ports to both, and install an nginx on port 80 with a reverse proxy. But software tends to come with a lot of assumptions that make ops hard, and they will often not like your custom setup... but they will almost certainly like a normal docker setup.
Forget about "clunky overhead" - the running costs are < 10%. The dockerfile? You don't even need one. You can just pull from the python version you want e.g. Python1.11 and git pull you files from the container to get up and running. You don't need to use container image saving systems, you don't need to save images, or tag anything, you don't need to write set up scripts in the docker file, you can pass the database credentials through the environment option when launching the container.
The problem is after a year or two you get clashes or weird stuff breaking. And modules stopping support of your python version preventing you installing new ones. Case in point, Googles AI module(needed for gemini and lots of their AI API services) only works on 3.10+. What if you started in 2021? Your python - then cutting edge - would not work anymore, it's only 3.5 years later from that release. Yeah you can use loads of curl. Good luck maintaining that for years though.
Numpy 1.19 is calling np.warnings but some other dependence is using Numpy 1.20 which removed .warnings and made it .notices or something
Your cached model routes for transformers changed default directory
You update the dependencies and it seems fine, then on a new machine you try and update them, and bam, wrong python version, you are on 3.9 and remote is 3.10, so it's all breaking.
It's also not simple in the following respect: your requirements.txt file will potentially have dependency clashes (despite running code), might take ages to install on a 4GB VM (especially if you need pytorch because some AI module that makes life 10x easier rather needlessly requires it).
life with docker is worth it. i was scared of it too, but there are three key benefits for the everyman / solodev:
- Literally docker export the running container as a .tar to install it on a new VM. That's one line and guaranteed the exact same VM, no changes. That's what you want, no risks.
- Back up is equally simple; shell script to download regular back ups. Update is simple; shell script to update git repo within the container. You can docker export it to investigate bugs without affecting the production running container, giving you an instant local dev environment as needed.
- When you inevitably need to update python you can just spin up a new VM with the same port mapping on Python 3.14 or whatever and just create an API internally to communicate, the two containers can share resources but run different python versions. How do you handle this with your solution in 4 years time?
- If you need to rapidly scale, your shell script could work fine, I'll give you that. But probably it takes 2 minutes to start on each VM. Do you want a 2 minute wait for your autoscaling? No you want a docker image / AMI that takes 5 seconds for AWS to scale up if you "hit it big".
Sorry, but you've got no idea what you're talking about.
You can also run OSI images, often called docker images directly via systemds nspawn. Because docker doesn't create an overhead by itself, its at its heart a wrapper around kernel features and iptables.
You didn't need docker for deployments, but let's not use completely made up bullshit as arguments, okay?
If you use it as IaaS, it's a lot quicker to get prototypes working than if you use anything else, including VPS's from other providers.
Google Cloud in particular has very few vectors for lock-in, and follows more principle of least surprise.
But once you have prototyped, you should ask the question about rebuilding it somewhere that is cheaper.
Near infinite scalability of disk drives is nice, and snapshotting, and cloud in general can allow you to extend your prototype into taking production load and allowing you to measure what you will need; but leaning in to "cloud magick" (cloud run, lambdas, etc) will consume almost as much time to learn and debug as just doing it the old school way anyway. In my lived experience.
The biggest problem is the so called cloud native stuff which is both more expensive and more complex. There are contexts where it makes sense but for startups they are doing more harm than good.
Two examples that I came across
- "Test" mean if it passes on CI, it is good. Failing to run test on local? Who do development on local anyway?
- Teams so reliant on "AI" because this is the future of coding. "how to sort a list in python" became a prompt, rather than a lookup on the official documentation.
I’m still very much an ansible noob, but if you have a repo with playbooks I’d love to poke around and learn some things! If not, no worries, I appreciate your time reading this comment!
While I absolutely agree with you and your approach, would you mind elaborating what kind of configuration compliance you are referring to in this statement? I suppose you do not mean any kind of configuration that your Puppet code produces as that configuration is "monitored", or rather managed, by Puppet.
This case is actually pretty simple.
Puppet applies the configuration you declare impotently when you run the Puppet agent: whatever is not configured gets configured, whatever is already configured remains the same.
If there is an error the return code of the Puppet agent is different from that of the situations above.
Knowing this you can choose triggering the Puppet agent runs remotely from a monitoring system, (instead of periodical local runs), collecting the exit code and monitoring the status of that exit code inside the monitoring system.
Therefore, instead of having an agent that runs silently leaving you logs to parse, you have a green light / red light system in regards to the compliance of a machine with its manifesto. If somebody broke the machine leaving it in an unconfigurable state or if someone broke its manifesto during configuration maintenance you will soon get a red light and the corresponding notifications.
This is active configuration management rather than what people usually call provisioning.
Of course you need an SSH connection for this execution and with that you need hardened SSH config, whitelisting, dedicated unpriviledged user for monitoring, exceptional finegrained sudo cases, etc. Not rocket science.
Sometimes the job descriptions are boastful in their reference to those technologies, and other times you can detect some level of despair.
The additional benefit is devs can run all the same stuff on a Linux laptop (or Linux VM on some other platform) - and everyone can have their own VM in the cloud if they like to demo or test stuff using all the same setup. Bootstrapping a new system is checking in their ssh key and running a shell script.
Easy to debug, not complex or expensive, and we could vertically scale it all quite a ways before needing to scale horizontally. It's not for everyone, but seed stage and earlier - totally appropriate imo.
If it interests you, both major git hosts (and possibly all of them) have and endpoint to map a username to their already registered ssh keys: https://github.com/mdaniel.keys https://gitlab.com/mdaniel.keys
It's one level of indirection away from "check in a public key" in that the user can rotate their own keys without needing git churn
Also, and I recognize this is departing quite a bit from what you were describing, ssh key leases are absolutely awesome because it addresses the offboarding scenario much better than having to reconcile evicting those same keys: https://github.com/hashicorp/vault/blob/v1.12.11/website/con... and while digging up that link I also discovered that Vault will allegedly do single-use passwords, too <https://github.com/hashicorp/vault/blob/v1.12.11/website/con...>, but since I am firmly in the "PasswordLogin no" camp, caveat emptor with that one
Both are Apache 2 and the Flatcar folks are excellent to work with
https://github.com/flatcar/Flatcar#readme
https://github.com/bottlerocket-os#bottlerocket
I've been running my SaaS first on a single server, then after getting product-market fit on several servers. These are bare-metal servers (Hetzner). I have no microservices, I don't deal with Kubernetes, but I do run a distributed database.
These bare-metal servers are incredibly powerful compared to virtual machines offered by cloud providers (I actually measured several years back: https://jan.rychter.com/enblog/cloud-server-cpu-performance-...).
All in all, this approach is ridiculously effective: I don't have to deal with complexity of things like Kubernetes, or with cascading system errors that inevitably happen in complex systems. I save on development time, maintenance, and on my monthly server bills.
The usual mantra is "but how do we scale" — I submit that 1) you don't know yet if you will need to scale, and 2) with those ridiculously powerful computers and reasonable design choices you can get very, very far with just 3-5 servers.
To be clear, I am not advocating that you run your business in your home closet. You still need automation (I use ansible and terraform) to manage your servers.
Did you read the article or just the headline?
Scroll down to the bottom, under the section "A few considerations" and try not to laugh.
"A few considerations" turns out to be a pretty significant chunk of security work ESPECIALLY if you are storing/transmitting highly sensitive information.
How do you handle something like HIPPA compliance when you're in this situation?
There are 2 types of programmers: those that think they've seen everything and those that know they've seen next to nothing. And as such, these absolute takes are tiring.
I'm a dev who hasn't seen anything related to that. Since you bring it up, can you give some pointers on why something like a MySQL db coupled to a monolithic backend isn't good enough? What shortcomings did you experience?
All of the things raised in the article seem possible to solve without the need for microservices.
It's when one starts getting sucked down the "cloud native" wormhole of all these niche open source systems and operators and ambassador and sidecar patterns, etc. that things go wrong. Those are for environments with many independent but interconnecting tech teams with diverse programming language use.
But I think for many "kubernetes" means your second paragraph. It doesn't have to be like that at all! People should try settling up a k3s cluster and just learn about workloads, services and ingresses. That's all you need to replace a bunch of ad hoc VMs and docker stuff.
I which there was something that did just that, because kube comes with a lot of baggage, and docker-compose is a bit too basic for some important production needs.
https://github.com/hadijaveed/docker-compose-anywhere
I can have something with nice deployments, super easy logs and metrics, and a nice developer experience setup in no time at all.
What I actually got was a half an hour tutorial from the guy who set it up, in which he explained the whole concept (I had no clue) and gave me enough information to deploy a server, which I did with zero problems. I had automatic deployment from `git push` working very quickly.
To me this seemed like a no brainer. Unless you literally have one service this is waaay easier to use.
Granted I didn't have to set it up - maybe that's where the terrible reputation comes from?
Seriously, I think a lot of people do things the hard way to learn large scale infrastructure. Another common reason is 'things will be much easier when we scale to a massive number of clients', or we can dynamically scale up on demand.
These are all valid to the people building this, just not as much to founders or professional CTOs.
Some people seem to have no concern with the needs and timetables of the would be customers but instead burn through cash building fancy nonsense.
It's like going in to a car mechanic for tires and then finding out it took 3 weeks because the guy wanted to put on low rider hydraulics and spinner hubcaps for his personal enrichment.
The worst part is it's inherently ambiguous to the next people. They don't know if the reason something is there is because it's needed or because it's just shiny bling.
I don’t quite get if people do it for interest, for love of the tech, or if they are technocratic and believe in levelling up their skill to get k8s on their CV like you say.
All I think is “this looks painful to manage”!
I run a k8s cluster at home. Part of it yes, is to apply my existing skills and keep them fresh. But part of it is that kubernetes can be easier long term.
Ive got magical hard drive storage with rook ceph. I can yoink a hard drive out of my servers and nothing happens to my workloads.
I can do maintenance on one of the servers with 0 down time.
All of my config for what I have deployed is in git.
I manage VMS and kubernetes at work, and im not going to pretend that kubernetes isnt complex, but it's complex up front instead of down the road. VMs run into complexity when things change. I'm sure you can make VMS good but then why not use something like kubernetes, you will have to reinvent a lot of the stuff that's already in kubernetes.
It's a hammer for sure and not everything is a nail, but it can be really powerful and useful even for home labs.
It's a bit like factorio with the extra dopamine hit of getting to unbox stuff.
You don't need k8s for all of that, but there's not a simpler solution than k8s that handles as much.
Life is full of pain. Deal with it.
Having seen some of these half-rolled, first-time-understood k8s deployments, and the multi-year projects to unravel the mess that was created, overflowing with anti-patterns and other incorrect ways of doing things, I think I would prefer a narrower scope of true experienced professionals (or at least some experienced pros that can help guide the ship for their mentees) working on and designing k8s infra.
And for those that don't need it (the vast majority of startups, small businesses, regular-sized businesses, etc), just stick to the easier-to-use paradigms out there.
This is a case where “things will be much easier when we scale to a massive number of clients” turned out to be true.
Should you pick a complex framework from day one? Probably not, unless your team has extensive experience with it.
My objection is towards the idea that managing infrastructure with a bespoke process and custom tooling will always be less effort to maintain than established tooling. It's the idea of stubbornly rejecting the "complexity" bogeyman, even when the process you built yourself is far from simple, and takes a lot of your time from your core product anyway.
Everyone loves the simplicity of copying over a binary to a VPS, and restarting a service. But then you want to solve configuration and secret management, have multiple servers for availability/redundancy so then you want gradual deployments, load balancing, rollbacks, etc. You probably also want some staging environment, so need to easily replicate this workflow. Then your team eventually grows and they find that it's impossible to run a prod-like environment locally. And then, and then...
You're forced to solve each new requirement with your own special approach, instead of relying on standard solutions others have figured out for you. It eventually gets to a question of sunken cost: do you want to abandon all this custom tooling you know and understand, in favor of "complexity" you don't? The difficult thing is that the more you invest in it, the harder it will be to migrate away from it.
My suggestion is: start by following practices that will make your transition to the standard tooling later easier. This means deploying with containers from day 1, adopting the 12 factors methodology, etc. And when you do start to struggle with some feature you need, switch to established tooling sooner later than later. You're likely find that your fear of the unknown was unwarranted, and you'll spend less time working on infra in the long run.
One approach that I’ve considered is to start with the standard tooling (k8s + gitops) from day one, but still run it in a single VM. Any thoughts?
If you do want to self-host, k3s could also be an option, like a sibling comment suggested. It's simpler to start with, though it still has a learning curve since it's a lightweight version of k8s. I reckon that you would still want to run at least 3 nodes for redundancy/failover, and maybe a couple more for just DB workloads. But you can certainly start with one to setup your workflow, and then scale out to more nodes as needed.
Unfortunately it's HN so people are more likely to do everything in bash scripts and say a big "fuck you" to all new hires that would have to learn their custom made mess
These are the only things I have ever been comfortable using in the cloud.
Once you get into FaaS and friends, things get really weird for me. I can't handle not having visibility into the machine running my production environment. Debugging through cloud dashboards is a shit experience. I think Microsoft's approach is closest to actually "working", but it's still really awful and I'd never touch it again.
The ideal architecture for me after 10 years is still a single VM with monolithic codebase talking to local instances of SQLite. The advent of NVMe storage has really put a kick into this one too. Backups handled by snapshotting the block storage device. Transactional durability handled by replicating WAL, if need be.
Dumbass simple. Lets me focus on the business and customer. Because they sure as hell don't care about any of this and wouldn't pay any money for it. All this code & infra is pure downside. You want as little of it as possible.
This is the most expensive way to build cloud services. When people talk about the cloud being more expensive than on-prem this is often the reason why. If you're just going to run VMs 24/7 there are better options.
You may never need to split your monolith! Stripe eventually broke some stuff out of their Rails monolith but it gets you surprisingly far.
You are not going to get easier to debug than a Django/Rails/etc monolith.
I bit of foresight on where you want to go with your infra can help you though; I built the first versions of our company as a Django Docker container running on a single VM. Deploy was a manual “docker pull; docker stop; docker start”. This setup got us surprisingly far. Docker is nice here as a way of sidestepping dependency packaging issues, this can be annoying in the early stages (eg does my server have the right C header files installed for that new db driver I installed? Setup Will be different than in your Mac!)
We eventually moved to k8s after our seed extension in response to a business need for reliability and scalability; k8s served us well all the way through series B . So the setup to have everything Dockerized made that really easy too - but we aggressively minimized complexity in the early stages.
And yet, funnily enough, the book on Monoliths says to break things up into smaller services! It says your data should be stored in its own service (possibly multiple services, if you need multi-paradigm access [e.g. relational, full-text search, etc.]). The user experience should use its own service. And, at very least, you should have another service in between (this is where Django and Rails usually fit). Optionally, it says, you will probably want to have additional services as well (auth, financial transitions, etc.)