HA is overrated, i'd much rather go for a low mean time to repair.
Backups, reinstall, ansible playbook is my way to go if hardware fails, which is quite rare to be honest. HA goes beyond hardware in terms of "electricity, internet connection, storage, location etc.., IMO people quite often underestimate what it means to have real HA --> second location with identical setup to shift workload or even have active-active instead of active-passive.
i have an intel nuc as plain debian server with containers on it.
2 raspberry pi's(act as loadbalancers with traefik and authelia on top) and a hetzner vm all connected through wireguard.
All is configured via ansible and i rely on LE certificates to connect via https or directly via wireguard vpn to http if i want it only exposed via vpn.
encrypted backups are copied via wireguard to an offsite storage box.
full down to back online if hardware is not damaged is less than 15 min.
it is very easy to do and rock solid, unattend-upgrades does the trick.
i tried almost every combination and even though i am a k8s user from version 1.2 onwoards the complexity for k8s at home or even vsphere is too much for me vs. this super simple and stable configuration i now have.
> HA is overrated, i'd much rather go for a low mean time to repair.
Those are almost entirely independent domains
the only point where they connect is that
> full down to back online if hardware is not damaged is less than 15 min.
automation makes both easier to deal with.
"Mean time to repair" is all fine if you don't have to drive 30 minutes to datacenter to swap out stuff.
Even if its cloudy cloud and you have backups, restores can take a lot of time.
Also you will want HA when it's your internet gateway to die
There are also levels of implementation. Making fully redundant set of loadbalancers is relatively easy. But any stateful app is much, much harder. In case of applications active-passive type of setups are also much easier than full on active active redundancy, especially if app wasn't written for it.
Quick to repair is also a lot more versatile. Unfortunately, I had to work in a lot of environments with proprietary, vendor-locked software and hardware. Usually all you can do is make sure you design the system so that you can chuck entire parts of it (possibly for rework, but sometimes not) if it breaks or gets compromised.
Definitely relevant for, say, SCADA controls with terrible security.
I completely agree, but there's an important balance to consider. HA is a good investment for things that are 'stateless' -- eg: VIPs/LBs
Rebuilding existing systems (or adding new ones) is excellent, but sometimes life really gets simpler when you have one well-known/HA endpoint to use in a given context
tl:dr if you need it, go for it :) my 2 cents is not needed at home and even for some of my clients the complexicty vs a simple setup and a few min downtime still speaks for MTTR instead of HA which noone can debug.
not about stress but to have HA proxmox cluster you will need at least 3 machines or fake one with a quorum machine without vms on it. Sure your vms will run if one machines goes down, but do you have HA Storage underneath? ganesha would work but more complexity or another network storage with more machines. Don't get me wrong it is fun to play with, but i doubt any homelab needs HA and cannot have a few minutes downtime. Or what do you do in case your internet provider goes down or when you have a power outage? i don't want to provoke, i have fiber switches and 10G at home and have 2 locations to switch in case one location goes down but i can live with multiple days of downtime if i have to, or if not i take my backups and fire some VM's on some cloudprovider and be back online in a few min and pay for it because backups are in both locations.
i would much rather develop a simple LB + autoscaling group with a deployment pipeline (lambda or some other control loop) and containers on it than a k8s cluster the client is not prepared for. if they outgrow this, most likely the following solution is better than going 100% "cloud" the first time. Most clients go from java 8 Jboss monolith to spring containers in k8s and then wonder why it is such a shitshow. but yeah pays the bills so i am not complaining that often ^^
It's absolutely fine and I'd argue essential to know Linux system administration even for people who mostly work in the cloud. From my recent experience sys admin skills seem to be a lost art - people no longer have to care about the underpinnings of a Linux system and when problems hit it shows.
It's interesting from a job market standpoint how little incentive there is to learn the basics when more and more we are seeing developers being made responsible for ops and admin tasks.
I think nowadays requiring that all application developers should also do ops (DevOps?) is a bad idea. Sure they should have basic shell skills, but when you’re on Kubernetes or similar, understanding what’s underneath is not vital. Instead, rely on specialized teams that actually want to know this stuff, and become the experts you escalate to only when things really go sideways and the abstractions fail (which is rare if you do it well). If your budget is too small for this, there are always support contracts.
As someone who’s been hiring for both sides, I see this reflected in candidates more and more. The good devs rarely know ops, and the good ops rarely code well. For our “platform” teams, we end up just hiring good devs and teaching them ops. I think the people that are really good at both often end up working at the actual cloud providers or DevOps startups.
But then there's the problem that now you have two teams - ops team doesn't understand the app and app team doesn't understand the ops/infra side. I of course agree with your point that there should be two teams but you need a few guys who understand both app dev and ops/infra/OSes/k8s internals etc. And finding these people has been nearly impossible.
I see this in my firm. Non-IT management loves to resort to the lowest denominator of 'systems', that is - systems built by people who are not trained as administrators nor experienced, but mainly follow online guides /and deliver/ (at p50). When I mention the resilience of the systems we maintain to senior directors the most common reply is that 1. the activities are not production and 2. the cost of letting our proper IT-departments handle things go way beyond the willingness to pay. When I reply that our non-production activities are still unmissable for anything more than a few days (which defines it as production for me, but not in our IT-risk landscape) I usually get greeted with something along the lines that in all things cloud all is different. For non-techies I think it's just hard to phantom that most of the effort in maintaining systems is in resilience, not in the just scripting it to work repeatedly.
They most likely resorted to skipping IT involvement because the department is difficult to work with.
"The cloud" is both a technical solution for flexible workload requirements, and a political solution to allow others to continue delivering when IT is quoting them 3 months of "prep work" for setting up a dev environment and 2 to 4 weeks for a single SSL certificate.
As a consultant, I am confronted to hostile IT requirements daily. Oftentimes, departments hiring us end up setting and training their own Ops teams outside of IT. despite leading to incredible redundancy, that's often credited as a huge success for that department, because of the gained flexibility.
So many people somehow believe a single VM with no disk snapshot/ back-up, having a floating IP, running ad hoc scripts to bring up production workload is "production quality", only because it is running in cloud.
I agree, I'm lucky that in my career I've been exposed so many different disciplines of Network and Linux administration that I know (mostly) what "the cloud" is made of.
So when problems do occur, I can make some pretty good guesses on what's wrong before I actually know for sure.
As a person who manages a big fleet of servers containing both pets and cattle, the upkeep of the pets is nowhere near the cloud-lovers drum-up.
A server installed with half-decent care can run uninterrupted for a long long time, given minimal maintenance and usual care (update, and reboot if you change the kernel).
Also, not installing a n+3 Kubernetes cluster with an external storage backend reduces overheads and number of running cogs drastically.
VMs, containers, K8S and other things are nice, but pulling the trigger so blindly assuming every new technology is a silver bullet to all problems is just not right on many levels.
As for home hardware, I'm running a single OrangePi Zero with DNS and SyncThing. That fits the bill, for now. Fitting into smallest hardware possible is also pretty fun.
> A server installed with half-decent care can run uninterrupted for a long long time, given minimal maintenance and usual care (update, and reboot if you change the kernel).
For my earlier home setups, this was actually part of the problem! My servers and apps were so zero-touch, that by the time I needed to do anything, I'd forgotten everything about them!
Now, I could have meticulously documented everything, but... I find that pretty boring. The thing with Docker is that, to some extent, Dockerfiles are a kind of documentation. They also mean I can run my workloads on any server - I don't need a special snowflake server that I'm scared to touch.
We have found out that, while some applications are installed much easier with Docker, operating them becomes much more harder on the long run.
NextCloud is a prime example. Adding some extensions (apps) on NextCloud becomes almost impossible when installed via Docker.
We have two teams with their own NextCloud installations. One is installed on bare metal, and other one is a Docker setup. The bare metal one is much easier to update, add apps, diagnose and operate in general. Docker installation needed three days of tinkering and headbanging to get what other team has enabled in 25 seconds flat.
To prevent such problems, JS Wiki runs a special container just for update duties for example.
I'd rather live document an installation and have an easier time in the long run, rather than bang my head during some routine update or config change, to be honest.
I document my home installations the same way, too. It creates a great knowledge base in the long run.
Not all applications fit into the scenario I told above, but Docker is not a panacea or a valid reason to not to document something, in my experience and perspective.
If you make a habit to perform all changes via CI (ansible/chef/salt, maybe terraform if applicable) you get this for free too. See your playbooks as "dockerfiles".
Where I work maintaining server is a big pain in the ass because of the ever growing security and regulatory compliance requirements. The rules makes patching things an exercise in red tape frustration.
Last year when we’ve been asked to redeploy our servers because the OS version was being added to a nope list we decided to migrate to Kubernetes so we don’t have to manage servers anymore (the nodes are a control plane thing so not our problem). Now we just build our stuff using wathever curated image is there that can be used and ship it without worrying about OS patches.
>Now we just build our stuff using wathever curated image is there that can be used and ship it without worrying about OS patches.
So you basically replaced the "we regularly have to update the OS" with "we regularly have to pull the newest image". It's possible because I have a Linux admin background but I don't see that big of a difference here. Oh and just using whatever curated image is there doesn't necessarily provide you with a secure environment. [0]
The thing that helped me was realizing that for a homelab set up, running without the extra redundancy is fine. Now for me that meant running k8s on a single box because I was specifically trying to get experience with it, and putting the control plane and actual workloads on a single machine simplified the whole thing to the point that it was easy to get up and running; I had gotten bogged down in setting up a full production grade cluster, but that wasn't even remotely needed for what I was doing.
Our biggest project (with a bunch of different infrastructure stuff) is also one with the least time spent per instance). We have months where no ops even logged on their VMs.
Our highest maintenance stuff is entirely "the dev fucked up"/"the dev is clueless". Stuff like server deciding to dump gigabytes of logs per minute or run out of disk space coz of some code error. Only difference compared to k8s is that the dead app would signal dev first, not actual monitoring we had.
And honorable mention for WordPress "developers" that make the first place in "most issues per server" every single year. And we have very few WP compared to everything else. Stuff like "dev uploaded some plugins, got instantly hacked (partially, we force using outgoing proxy and that stopped full compromise), we reverted it, he uploaded same stuff and got hacked again". So, nothing to do with actual servers.
Stuff like setting up automation to deploy ceph cluster took some time... once. Now it takes nothing. We even managed to plug it to k8s cluster.
> A server installed with half-decent care can run uninterrupted for a long long time, given minimal maintenance and usual care (update, and reboot if you change the kernel).
could even not be touched if you set up automated updates and reboots, the red tape is usually a problem, like dumbly written support deals needing notification for every restart even if service is in HA. Or some dumbo selling a service with SLA but only on single machine...
Even changing the kernel can often be done without rebooting the system depending on which distro you're using. Quite a few distros now include some sort of kernel livepatching service including Ubuntu, Amazon Linux, and RHEL.
My home lab sounds pretty similar to the author's - three compute nodes running Debian, and a single storage node (single point of failure, yes!) running TrueNAS Core.
I was initially pretty apprehensive about running Kubernetes on the compute nodes, my workloads all being special snowflakes and all. I looked at off-the-shelf apps like mailu for hosting my mail system, for instance, but I have some really bizarre postfix rules that it wouldn't support. So I was worried that I'd have to maintain Dockerfiles, and a registry, and lots of config files in Git, and all that.
And guess what? I do maintain Dockerfiles, and a registry, and lots of config files in Git, but the world didn't end. Once I got over the "this is different" hump, I actually found that the ability to pull an entire node out of service (or have fan failures do it for me), more than makes up for the difference. I no longer have awkward downtime when I need to reboot (or have to worry that the machines will reboot), or little bits of storage spread across lots of machines.
I fully agree. If you come from "old-school" administration, Docker and Kubernetes seem like massive black boxes that replace all your known configuration screws with fancy cloud terms. But once you get to know them, it just makes sense. Backups get a lot simpler, restoring state is easy and keeping things separated just becomes a lot easier.
That being said, I can only encourage the author with this plan. All those abstractions are great, but at least for me it was massively valuable to know what you are replacing and what an old-school setup is actually capable of.
Once you work in enterprise IDC, setting up home labs is nothing but a liability. Unwanted stuff like UPS, cooling and more bills. I have one Intel NUC running FreeBSD 13 (dual SSD with 32 GB ram and Intel i7 CPU 8th GEN) with one jail and VM. It acts as a backup server for my Ubuntu laptop and MacBook pro. Nightly I dump data to cloud providers for offsite reasons. Finally, I set LXD, Docker and KVM on my Ubuntu dev laptop for testing. That is all. No more FreeNAS or 3 more servers including 2 RPis. I made this change during the first lockdown. They (I mean fun and excitements) went away once I handled all those expensive IT equipment/servers/firewalls/routers/WAFs funded employer ;)
You’re still doing a lot of home serving/IT there.
I take your point and generally agree. Leave the major hardware to the actual data centers. Stuff at home should be either resume driven or very efficient.
Agreed, I used to run enterprise hardware at home, and it was certainly fun to tinker with (Not to mention all of the blinking lights!).
Last year I ran some numbers and realized that it just wasn't worth it, as electricity costs alone make it fairly close to what Hetzner/OVH would charge for similar hardware. I also had power go down in my area, for probably first time in 5 years or so that I lived there, which I took as a sign to just migrate away.
I peer it into my internal network using WireGuard, so I barely notice a difference in my use and now that electricity costs are skyrocketing in Europe I'm certainly very happy I went this route.
I love LXC/LXD for my home server. Far easier to use and maintain than VMs, fast, and use far less resources than VMs. And understanding containers is great foundational knowledge for working with Docker and K8s. They also work great with NextCloud, Plex, PostgreSQL, and Zabbix, and SAMBA. But each are separate, no risk of a library or an OS upgrade taking out an app (and my weekend along with it). Snapshots are the ultimate ctrl-z, and backups are a breeze once you get past the learning curve.
Ansible with ‘pet’ containers is the way to go. Use it to automate the backups and patching. Ansible cookbooks are surprisingly easy to work with. Again, a learning curve, but it pays for itself within months.
Running a single machine with all the apps in a single environment is a recipe for tears as you are always one OS upgrade patch, library requirement change, hard drive failure away from disaster and hours of rebuilding.
This sounds like my setup exactly, except instead of LXC/ansible, I’m using FreeBSD jails/saltstack.
I completely agree about the value of isolation. You can update or up-rebuild for a new os version on one service at a time - which I find helpful when pulling in updated packages/libraries. You can also cheaply experiment with a variation on your environment.
If you build it yourself from source or use distro packages. Running it under cannonical's snap packages is a bit of a nightmare because of issues around the forced auto updates.
I've done that in the past. It was fun and nostalgic. Then I had to upgrade the OS to a newer version and things were not fun anymore. At all. And I remembered why VMs and containers are so extremely useful.
full down to back online if hardware is not damaged is less than 15 min.
it is very easy to do and rock solid, unattend-upgrades does the trick. i tried almost every combination and even though i am a k8s user from version 1.2 onwoards the complexity for k8s at home or even vsphere is too much for me vs. this super simple and stable configuration i now have.
Those are almost entirely independent domains
the only point where they connect is that
> full down to back online if hardware is not damaged is less than 15 min.
automation makes both easier to deal with.
"Mean time to repair" is all fine if you don't have to drive 30 minutes to datacenter to swap out stuff.
Even if its cloudy cloud and you have backups, restores can take a lot of time.
Also you will want HA when it's your internet gateway to die
There are also levels of implementation. Making fully redundant set of loadbalancers is relatively easy. But any stateful app is much, much harder. In case of applications active-passive type of setups are also much easier than full on active active redundancy, especially if app wasn't written for it.
Definitely relevant for, say, SCADA controls with terrible security.
Rebuilding existing systems (or adding new ones) is excellent, but sometimes life really gets simpler when you have one well-known/HA endpoint to use in a given context
What does HA stand for?
Thank you.
not about stress but to have HA proxmox cluster you will need at least 3 machines or fake one with a quorum machine without vms on it. Sure your vms will run if one machines goes down, but do you have HA Storage underneath? ganesha would work but more complexity or another network storage with more machines. Don't get me wrong it is fun to play with, but i doubt any homelab needs HA and cannot have a few minutes downtime. Or what do you do in case your internet provider goes down or when you have a power outage? i don't want to provoke, i have fiber switches and 10G at home and have 2 locations to switch in case one location goes down but i can live with multiple days of downtime if i have to, or if not i take my backups and fire some VM's on some cloudprovider and be back online in a few min and pay for it because backups are in both locations.
i would much rather develop a simple LB + autoscaling group with a deployment pipeline (lambda or some other control loop) and containers on it than a k8s cluster the client is not prepared for. if they outgrow this, most likely the following solution is better than going 100% "cloud" the first time. Most clients go from java 8 Jboss monolith to spring containers in k8s and then wonder why it is such a shitshow. but yeah pays the bills so i am not complaining that often ^^
It's interesting from a job market standpoint how little incentive there is to learn the basics when more and more we are seeing developers being made responsible for ops and admin tasks.
As someone who’s been hiring for both sides, I see this reflected in candidates more and more. The good devs rarely know ops, and the good ops rarely code well. For our “platform” teams, we end up just hiring good devs and teaching them ops. I think the people that are really good at both often end up working at the actual cloud providers or DevOps startups.
"The cloud" is both a technical solution for flexible workload requirements, and a political solution to allow others to continue delivering when IT is quoting them 3 months of "prep work" for setting up a dev environment and 2 to 4 weeks for a single SSL certificate.
As a consultant, I am confronted to hostile IT requirements daily. Oftentimes, departments hiring us end up setting and training their own Ops teams outside of IT. despite leading to incredible redundancy, that's often credited as a huge success for that department, because of the gained flexibility.
So when problems do occur, I can make some pretty good guesses on what's wrong before I actually know for sure.
A server installed with half-decent care can run uninterrupted for a long long time, given minimal maintenance and usual care (update, and reboot if you change the kernel).
Also, not installing a n+3 Kubernetes cluster with an external storage backend reduces overheads and number of running cogs drastically.
VMs, containers, K8S and other things are nice, but pulling the trigger so blindly assuming every new technology is a silver bullet to all problems is just not right on many levels.
As for home hardware, I'm running a single OrangePi Zero with DNS and SyncThing. That fits the bill, for now. Fitting into smallest hardware possible is also pretty fun.
For my earlier home setups, this was actually part of the problem! My servers and apps were so zero-touch, that by the time I needed to do anything, I'd forgotten everything about them!
Now, I could have meticulously documented everything, but... I find that pretty boring. The thing with Docker is that, to some extent, Dockerfiles are a kind of documentation. They also mean I can run my workloads on any server - I don't need a special snowflake server that I'm scared to touch.
NextCloud is a prime example. Adding some extensions (apps) on NextCloud becomes almost impossible when installed via Docker.
We have two teams with their own NextCloud installations. One is installed on bare metal, and other one is a Docker setup. The bare metal one is much easier to update, add apps, diagnose and operate in general. Docker installation needed three days of tinkering and headbanging to get what other team has enabled in 25 seconds flat.
To prevent such problems, JS Wiki runs a special container just for update duties for example.
I'd rather live document an installation and have an easier time in the long run, rather than bang my head during some routine update or config change, to be honest.
I document my home installations the same way, too. It creates a great knowledge base in the long run.
Not all applications fit into the scenario I told above, but Docker is not a panacea or a valid reason to not to document something, in my experience and perspective.
Last year when we’ve been asked to redeploy our servers because the OS version was being added to a nope list we decided to migrate to Kubernetes so we don’t have to manage servers anymore (the nodes are a control plane thing so not our problem). Now we just build our stuff using wathever curated image is there that can be used and ship it without worrying about OS patches.
So you basically replaced the "we regularly have to update the OS" with "we regularly have to pull the newest image". It's possible because I have a Linux admin background but I don't see that big of a difference here. Oh and just using whatever curated image is there doesn't necessarily provide you with a secure environment. [0]
[0] https://snyk.io/blog/top-ten-most-popular-docker-images-each...
Checking uptime on my FreeBSD home server ... 388 days. Probably 15-year-old hardware. That thing's a rock.
I can't say I'm very proud of my security posture, though (:
One day I'll do boot-to-ZFS and transubstantiate into the realm of never worrying about failed upgrades again.
https://forums.freebsd.org/threads/ufs-boot-environments.796...
Our highest maintenance stuff is entirely "the dev fucked up"/"the dev is clueless". Stuff like server deciding to dump gigabytes of logs per minute or run out of disk space coz of some code error. Only difference compared to k8s is that the dead app would signal dev first, not actual monitoring we had.
And honorable mention for WordPress "developers" that make the first place in "most issues per server" every single year. And we have very few WP compared to everything else. Stuff like "dev uploaded some plugins, got instantly hacked (partially, we force using outgoing proxy and that stopped full compromise), we reverted it, he uploaded same stuff and got hacked again". So, nothing to do with actual servers.
Stuff like setting up automation to deploy ceph cluster took some time... once. Now it takes nothing. We even managed to plug it to k8s cluster.
> A server installed with half-decent care can run uninterrupted for a long long time, given minimal maintenance and usual care (update, and reboot if you change the kernel).
could even not be touched if you set up automated updates and reboots, the red tape is usually a problem, like dumbly written support deals needing notification for every restart even if service is in HA. Or some dumbo selling a service with SLA but only on single machine...
I was initially pretty apprehensive about running Kubernetes on the compute nodes, my workloads all being special snowflakes and all. I looked at off-the-shelf apps like mailu for hosting my mail system, for instance, but I have some really bizarre postfix rules that it wouldn't support. So I was worried that I'd have to maintain Dockerfiles, and a registry, and lots of config files in Git, and all that.
And guess what? I do maintain Dockerfiles, and a registry, and lots of config files in Git, but the world didn't end. Once I got over the "this is different" hump, I actually found that the ability to pull an entire node out of service (or have fan failures do it for me), more than makes up for the difference. I no longer have awkward downtime when I need to reboot (or have to worry that the machines will reboot), or little bits of storage spread across lots of machines.
That being said, I can only encourage the author with this plan. All those abstractions are great, but at least for me it was massively valuable to know what you are replacing and what an old-school setup is actually capable of.
I take your point and generally agree. Leave the major hardware to the actual data centers. Stuff at home should be either resume driven or very efficient.
Last year I ran some numbers and realized that it just wasn't worth it, as electricity costs alone make it fairly close to what Hetzner/OVH would charge for similar hardware. I also had power go down in my area, for probably first time in 5 years or so that I lived there, which I took as a sign to just migrate away.
I peer it into my internal network using WireGuard, so I barely notice a difference in my use and now that electricity costs are skyrocketing in Europe I'm certainly very happy I went this route.
Deleted Comment
Ansible with ‘pet’ containers is the way to go. Use it to automate the backups and patching. Ansible cookbooks are surprisingly easy to work with. Again, a learning curve, but it pays for itself within months.
Running a single machine with all the apps in a single environment is a recipe for tears as you are always one OS upgrade patch, library requirement change, hard drive failure away from disaster and hours of rebuilding.
I completely agree about the value of isolation. You can update or up-rebuild for a new os version on one service at a time - which I find helpful when pulling in updated packages/libraries. You can also cheaply experiment with a variation on your environment.