Looking at the htop screenshot, I notice the lack of swap. You may want to enable earlyoom, so your whole server doesn't go down when a service goes bananas. The Linux Kernel OOM killer is often a bit too late to trigger.
You can also enable zram to compress ram, so you can over-provision like the pros'. A lot of long-running software leaks memory that compresses pretty well.
Even better than earlyoom is systemd-oomd[0] or oomd[1].
systemd-oomd and oomd use the kernel's PSI[2] information which makes them more efficient and responsive, while earlyoom is just polling.
earlyoom keeps getting suggested, even though we have PSI now, just because people are used to using it and recommending it from back before the kernel had cgroups v2.
Do you have any insight in to why this isn't included by default in distros like Ubuntu. It's kind of bewildering that the default behavior on Ubuntu is to just lock up the whole system on OOM
> systemd-oomd periodically polls PSI statistics for the system and those cgroups to decide when to take action.
It's unclear if the docs for systemd-oomd are incorrect or misleading; I do see from the kernel.org link that the recommended usage pattern is to use the `poll` system call, which in this context would mean "not polling", if I understand correctly.
Another option would be to have more memory that required over-engineer and to adjust the oom score per app, adding early kill weight to non critical apps and negative weight to important apps. oom_score_adj is already set to -1000 by OpenSSH for example.
Another useful thing to do is effecively disable over-commit on all staging and production servers (0 ratio instead of 2 memory to fully disable as these do different things, memory 0 still uses formula)
vm.overcommit_memory = 0
vm.overcommit_ratio = 0
Also use a formula to set min_free and reserved memory using a formula from Redhat that I do not have handy based on installed memory. min_free can vary from 512KB to 16GB depending on installed memory.
At least that worked for me in about 50,000 physical servers for over a decade that were not permitted to have swap and installed memory varied from 144GB to 4TB of RAM. OOM would only occur when the people configuring and pushing code would massively over-commit and not account for memory required by the kernel. Not following best practices defined by Java and thats a much longer story.
Another option is to limit memory per application in cgroups but that requires more explaining than I am putting in an HN comment.
Another useful thing is to never OOM kill in the first place on servers that are only doing things in memory and need not commit anything to disk. So don't do this on a disked database. This is for ephemeral nodes that should self heal. Wait 60 seconds so drac/ilo can capture crash message and then earth shattering kaboom...
For a funny side note, those options can also be used as a holy hand grenade to intentionally unsafely reboot NFS diskless farms when failing over to entirely different NFS server clusters. setting panic to 15 mins, triggering OOM panic by setting min_free to 16TB at the command line via Ansible not in sysctl.conf, swapping clusters, arp storm and reconverge.
Yeah, no way. As soon as you hit swap, _most_ apps are going to have a bad, bad time. This is well known, so much so that all EC2 instances in AWS disable it by default. Sure, they want to sell you more RAM, but it's also just true that swap doesn't work for today's expectations.
Maybe back in the 90s, it was okay to wait 2-3 seconds for a button click, but today we just assume the thing is dead and reboot.
This is a wrong belief because a) SSDs make swap almost invisible, so you can have that escape ramp if something goes wrong b) SWAP space is not solely an escape ramp which RAM overflows into anymore.
In the age of microservices and cattle servers, reboot/reinstall might be cheap, but in the long run it is not. A long running server, albeit being cattle, is always a better solution because esp. with some excess RAM, the server "warms up" with all hot data cached and will be a low latency unit in your fleet, given you pay the required attention to your software development and service configuration.
Secondly, Kernel swaps out unused pages to SWAP, relieving pressure from RAM. So, SWAP is often used even if you fill 1% of your RAM. This allows for more hot data to be cached, allowing better resource utilization and performance in the long run.
So, eff it, we ball is never a good system administration strategy. Even if everything is ephemeral and can be rebooted in three seconds.
Sure, some things like Kubernetes forces "no SWAP, period" policies because it kills pods when pressure exceeds some value, but for more traditional setups, it's still valuable.
"as soon as you hit swap" is a bad way of looking at things. Looking around at some servers I run, most of them have .5-2GB of swap used despite a bunch of gigabytes of free memory. That data is never or almost never going to be touched, and keeping it in memory would be a waste. On a smaller server that can be a significant waste.
Swap is good to have. The value is limited but real.
Also not having swap doesn't prevent thrashing, it just means that as memory gets completely full you start dropping and re-reading executable code over and over. The solution is the same in both cases, kill programs before performance falls off a cliff. But swap gives you more room before you reach the cliff.
How programs use ram also changed from the 90s. Back then they were written targeting machines that they knew would have a hard time fitting all their data in memory, so hitting swap wouldn't hurt perceived performance too drastically since many operations were already optimized to balance data load between memory and disk.
Nowadays when a program hits swap it's not going to fallback to a different memory usage profile that prioritises disk access. It's going to use swap as if it were actual ram, so you get to see the program choking the entire system.
The beauty of ZRAM is that on any modern-ish CPU it's surprisingly fast. We're talking 2-3 ms instead of 2-3 seconds ;)
I regularly use it on my Snapdragon 870 tablet (not exactly a top of the line CPU) to prevent OOM crashes (it's running an ancient kernel and the Android OOM killer basically crashes the whole thing) when running a load of tabs in Brave and a Linux environment (through Tmux) at the same time.
ZRAM won't save you if you do actually need to store and actively use more than the physical memory but if 60% of your physical memory is not actively used (think background tabs or servers that are running but not taking requests) it absolutely does wonders!
On most (web) app servers I happily leave it enabled to handle temporary spikes, memory leaks or applications that load a whole bunch of resources that they never ever use.
I'm also running it on my Kubernetes cluster. It allows me to set reasonable strict memory limits while still having the certainty that Pods can handle (short) spikes above my limit.
My 2cents is that in a lot of cases swap is being used for unimportant stuff leave more RAM for your app. Do a "ps aux" and look at all the RAM used by weird stuff. Good news is those things will be swapped out.
Example on my personal VPS
$ free -m
total used free shared buff/cache available
Mem: 3923 1225 328 217 2369 2185
Swap: 1535 1335 200
It's not just 3 seconds for a button click, every time I've run out of RAM on a Linux system, everything locks up and it thrashes. It feels like 100x slowdown. I've had better experiences when my CPU was underclocked to 20% speed. I enable swap and install earlyoom. Let processes die, as long as I can move the mouse and operate a terminal.
Is it possible you misread the comment you're replying to? They aren't recommending adding swap, they're recommending adjusting the memory tunables to make the OOM killer a bit more aggressive so that it starts killing things before the whole server goes to hell.
> Maybe back in the 90s, it was okay to wait 2-3 seconds for a button click, but today we just assume the thing is dead and reboot.
My experience is the exact opposite. If anything 2-3 second button clicks are more common than ever today since everything has to make a roundtrip to a server somewhere whereas in the 90s 2-3s button click meant your computer was about to BSOD.
Edit: Apple recently brought "2-3s to open tab" technology to Safari[1].
YMMV. Garbage-collected/pointer-chasing languages suffer more from swapping because they touch more of the heap all the time. AWS suffers more from swap because EBS is ridiculously slow and even their instance-attached NVMe is capped compared physical NVMe sticks.
Does HDD vs SSD matter at all these days? I can think of certain caching use-cases where swapping to an SSD might make sense, if the access patterns were "bursty" to certain keys in the cache
what an ignorant and clueless comment. Guess what? Todays disks are NVMe drives which are orders of magnitude faster than the 5400rpm HDDs of the 90s. Today's swap is 90s RAM.
I have also seen this in Androids (I tested this on multiple devices - S23U, OnePlus 6,8) , whenever I completely turned off the swap , the phone after a day or two of heavy usage would sometimes hang!
It felt unintuitive since these devices had lot of RAM, and they shouldn't need swap . But turning off swap has always degraded performance for me.
Some workloads may do better with zswap. Cache is compressed, and pages evicted to disk based swap on an LRU basis.
The case of swap thrashing sounds like a misbehaving program, which can maybe be tamed by oomd.
System responsiveness though needs a complete resource control regime in place, that preserves minimum resources for certain critical processes. This is done with cgroupsv2. By establishing minimum resources, the kernel will limit resources for other processes. Sure, they will suffer. That’s the idea.
Yeah I had a few servers look up on me without any clear way to recovery because some app was eating up ram. I am ok with the server coming to a crawl as soon as the swap has to be used but at least it won't stop responding all together.
Of course swap should be enabled. But oom killer has always allowed access to an otherwise unreachable system. The pause is there so you can impress your junior padawan who rushed to you in a hurry.
sometimes swap seems to accumulate even though there is plenty of ram. It is too "greedy" by default, probably set for desktops not servers in mind.
Therefore it is better to always tune "vm.swappiness" to 1 in /etc/sysctl.conf
You can also configure your web server / TCP stack buffers / file limits so they never allocate memory over the physical ram available. (eg. in nginx you can setup worker/connection limits and buffer sizes.)
Depends on the algorithm (and how much CPU is in use); if you have a spare CPU, the faster algorithms can more-or-less keep up with your memory bandwidth, making the overhead negligible.
And of course the overhead is zero when you don't page-out to swap.
> zram, formerly called compcache, is a Linux kernel module for creating a compressed block device in RAM, i.e. a RAM disk with on-the-fly disk compression. The block device created with zram can then be used for swap or as a general-purpose RAM disk
To clarify OP's represention of the tool, it compresses swap space not resident ram. Outside of niche use-cases, compressing swap has overall little utility.
"The practice of System and Network administration" by Tom Limoncelli and Christine Hogan[1] was, together with "Principles of Network and Systems Administration" by Mark Burgess have probably been the books that influenced my approach to sysadmin the most. I still have them. Between them they covered at a high level (at least back when I was sysadmin before devops and Kubernets etc) anything and everything from
- hardware, networks, monitoring, provisioning, server room locations in existing buildings, how to prepare server rooms
- and so on up to hiring and firing sysadmins, salary negotiations[2], vendor negotiations and the first book even had a whole chapter dedicated to "Being happy"
[1] There is a third author as well now, but those two were the ones that are on the cover of my book from 2005 and that I can remember
[2] Has mostly worked well after I more or less left sysadmin behind as well
If it is possible to boot Hetzner from a BSD install image using "Linux rescue mode"^1 then it should also possible to run NetBSD entirely from memory using custom kernel
Every user is different but this is how I prefer to run UNIX-like OS for personal, recreational use; I find it more resilient
To enable a swap file in Linux, first create the swap file using a command like sudo dd if=/dev/zero of=/swapfile bs=1G count=1 for a 1GB file. Then, set it up with sudo mkswap /swapfile and activate it using sudo swapon /swapfile. To make it permanent, add /swapfile swap swap defaults 0 0 to your /etc/fstab file.
Just saw Nate Berkopec who does a lot of rails performance stuff posting about the same idea yesterday saying Heroku is 25-50x price for performance which is so insane. They clearly have zero interest in competing on price.
It's a shame they don't just license all their software stack at a reasonable price with a similar model like Sidekiq and let you sort out actually decent hardware. It's insane to consider Heroku if anything has gotten more expensive and worse compared to a decade ago yet in comparison similar priced server hardware has gotten WAY better of a decade. $50 for a dyno with 1 GB of ram in 2025 is robbery. It's even worse considering running a standard rails app hasn't changed dramatically from a resources perspective and if anything has become more efficient. It's comical to consider how many developers are shipping apps on Heroku for hundreds of dollars a month on machines with worse performance/resources than the macbook they are developing it on.
It's the standard playback that damn near everything in society is going for though just jacking prices and targeting the wealthiest least price sensitive percentiles instead of making good products at fair prices for the masses.
Jacked up prices isn't what is happening here. There is a psychological effect that Heroku and other cloud vendors are (wittingly or unwittingly) the beneficiary of. Customer expectations are anchored in the price they pay when they start using the service, and without deliberate effort, those expectations change in _linear_ fashion. Humans think in linear terms, while actual compute hardware improvements are exponential.
Heroku's pricing has _remained the same_ for at least seven years, while hardware has improved exponentially. So when you look at their pricing and see a scam, what you're actually doing is comparing a 2025 anchor to a mid-2010s price that exists to retain revenue. At the big cloud vendors, they differentiate customers by adding obstacles to unlocking new hardware performance in the form of reservations and updated SKUs. There's deliberate customer action that needs to take place. Heroku doesn't appear to have much competition, so they keep their prices locked and we get to read an article like this whenever a new engineer discovers just how capable modern hardware is.
I mean Heroku is also offering all of the ancillary stuff around their product. It's not literally "just" hosting. It's pretty nice to not have to manage a kube cluster, to get stuff like ephemeral QA envs and the like, etc....
Heroku has obviously stagnated now but their stack is _very cool_ for if you have a fairly simple system but still want all the nice parts of a mode developed ops system. It almost lets you get away with not having an ops team for quite a while. I don't know any other provider that is low-effort "decent" ops (Fly seems to directionally want to be new Heroku but is still missing a _lot_ in my book, though it also has a lot)
To be fair, AWS quite proudly talk about all the times they've lowered prices on existing services, or have introduced new generations that are cheaper (e.g. their Graviton EC2 instances).
> It's a shame they don't just license all their software stack at a reasonable price with a similar model like Sidekiq and let you sort out actually decent hardware
We built and open sourced https://canine.sh for exactly that reason. There’s no reason PaaS providers should be charging such a giant markup over already marked up cloud providers.
Heroku is pricing for “# of FTE headcount that can be terminated for switching to Heroku”; in that sense, this article’s $3000/mo bill is well below 1.0 FTE/month at U.S. pricing, so it’s not interesting to Heroku to address. I’m not defending this pricing lens, but it’s why their pricing is so high: if you aren’t switching to Heroku to layoff at least 1-2 FTE of salary per billing period, or using Heroku to replace a competitor’s equivalent replacement thereof, Heroku’s value assigned to you as a customer is net negative and they’d rather you went elsewhere. They can’t slam the door shut on the small fry, or else the unicorns would start up elsewhere, but they can set the pricing in FTE-terms and VCs will pay it for their moonshots without breaking a sweat.
This looks decent for what it is. I feel like there are umpteen solutions for easy self-hosted compute (and tbh even a plain Linux VM isn't too bad to manage). The main reason to use a PAAS provider is a managed database with built-in backups.
> $50 for a dyno with 1 GB of ram in 2025 is robbery
AWS isn't much better honestly.. $50/month gets you an m7a.medium which is 1 vCPU (not core) and 4GB of RAM. Yes that's more memory but any wonder why AWS is making money hand-over-fist..
Not sure if it's an apples-to-apples comparison with Heroku's $50 Standard-2X dyno, but an Amazon Lightsail instance with 1GB of RAM and 2 vCPUs is $7/month.
AWS certainly also does daylight robbery. In the AWS model the normal virtual servers are overpriced, but not super overpriced.
Where they get you is all the ancillary shit, you buy some database/backup/storage/managed service/whatever, and it is priced in dollars per boogaloo, you also have to pay water tax on top, and of course if you use more than the provisioned amount of hafnias the excess ones cost 10x as much.
Most customers have no idea how little compute they are actually buying with those services.
That is assuming you need that 1 core 24/7, you can get 2 core / 8gb for $43, this will most likely fit 90% of workloads (steady traffic with spikes, or 9-5 cadence).
If you reserve that instance you can get it for 40% cheaper, or get 4 cores instead.
Yes it's more expensive than OVH but you also get everything AWS to offer.
This, plus as a backup plan going from Heroku to AWS wouldn't necessarily solve the problem, at least with our infra. When us-east-1 went down this week so did Heroku for us.
Heroku is the Vercel of Rails: people will pay a fortune for it simply because it works. This has always been their business model, so it’s not really a new development. There’s little competition since the demand isn’t explosive and the margin is thin, so you end up with stagnation
I am not sure what's there to license. The hard and expensive part is in the labor to keep everything running. You are paying to make DevSecOps Somebody Else's Problem. You are paying for A Solution. You are not paying for software. There are plenty of Heroku clones mentioned in this thread.
I know you mean this sarcastically but I actually 100% agree with this particular on the steak point. Especially with beef prices at all time record highs and restaurant inflation being out of control post pandemic. It takes so much of the enjoyment out of things for me if I feel i'm being ripped off left and right.
This argument doesn't work with such commoditized software. It's more like comparing an oil change for $100 plus an hour of research and a short drive against a convenient oil change right next door for $2,500.
Not the best comment but I agree with the sentiment. I fear far too often, people complain about price when there are competitors/other cheaper options that could be used with a little more effort. If people cared so much then they should just use the alternative.
No one gets hurt if someone else chooses to waste their money on Heroku so why are people complaining? Of course it applies in cases where there aren't a lot of competitors but there are literally hundreds of different of different options for deploying applications and at least a dozen of them are just as reliable and cheaper than Heroku.
It's just trendy to bash cloud and praise on-premises in 2025. In a few years that will turn around. Then in another few years it will turn around again.
The cloud has made people forget how far you can get with a single machine.
Hosting staging envs in pricey cloud envs seems crazy to me but I understand why you would want to because modern clouds can have a lot of moving parts.
Teaching a whole bunch of developers some cloud basics and having a few cloud people around is relatively cheap for quite a while. Plus, having test/staging/prod on similar configurations will help catch mistakes earlier. None of that "localstack runs just fine but it turns out Amazon SES isn't available in region antartica-east-1". Then, eventually, you pay a couple people's wages extra in cloud bills, and leaving the cloud becomes profitable.
Cloud isn't worth it until suddenly it is because you can't deploy your own servers fast enough, and then it's worth it until it exceeds the price of a solid infrastructure team and hardware. There's a curve to how much you're saving by throwing everything in the cloud.
Deploying to your private cloud requires basically the same skills. Containers, k8s or whatnot, S3, etc. Operating a large DB on bare metal is different from using a managed DB like Aurora, bit for developers, the difference is hardly visible.
RDS/managed database is extremely nice I will admit, otherwise I agree. Similarly s3, if you're going to do object storage, then running minio or whatever locally is probably not cheaper overall than R2 or similar.
The cloud has made people afraid of linux servers. The markup is essentially just the price business has to pay because of developer insecurity. The irony is that self hosting is relatively simple, and alot of fun. Personally never got the appeal of Heroku, Vercel and similar, because theres nothing better than spinning up a server and setting it up from scratch. Every developer should try it.
> The irony is that self hosting is relatively simple, and alot of fun. Personally never got the appeal of Heroku, Vercel and similar, because theres nothing better than spinning up a server and setting it up from scratch.
It's fun the first time, but becomes an annoying faff when it has to be repeated constantly.
In Heroku, Vercel and similar you git push and you're running. On a linux server you set up the OS, the server authentication, the application itself, the systemctl jobs, the reverse proxy, the code deployment, the ssl key management, the monitoring etc etc.
I still do prefer a linux server due to the flexibility, but the UX could be a lot better.
I dunno, the cloud has mostly made me afraid of the cloud. You can bury yourself in towering complexity so easily on AWS. (The highly managed stuff like Vercel I don't have much experience with, so maybe it's different.)
Never got the appeal of having someone else do something for you, and giving them money, in exchange for goods and services? Vercel is easy. You pay them to make it easy. When you're just getting started, you start on easy mode before you jump into the deep end of the pool. Everybody's got a different cup of tea, and some like it hot and others like it cold.
my take is that it's fun up until there's just enough brittleness and chaos.. too many instance of the same thing but with too many env variables setup by hand and then fuzzy bug starts to pile up
Honestly I think it's the database that makes devs insecure. The stakes are high and you usually want PITR and regular backups even for low traffic apps. Having a "simple" turnkey service for this that can run in any environment (dedicated, VPS, colo, etc.) would be huge.
I think this is partly responsible for the increased popularity of sqlite as a backend. It's super simple and lightstream for recovery isn't that complicated.
Most apps don't need 5 9s, but they do care about losing data. Eliminate the possibility of losing data, without paying tons of $ to also eliminate potential outages, and you'll get a lot of customers.
The cloud was a good deal in 2006 when the smallest aws machine was about the size of a ok dev desktop and took over two years of renting to justify buying the physical machine outright.
Today the smallest, and even large, aws machines are a joke, comparable to a mobile phone from 15 years ago to a terrible laptop today, and take about three to six months to in rent as buying the hardware outright.
If you're on the cloud without getting 75% discount you will save money and headcount by doing everything on prem.
This could be the premise for a fun project based infra learning site.
You get X resources in the cloud and know that a certain request/load profile will run against it. You have to configure things to handle that load, and are scored against other people.
also how far you can get with a single machine has changed massively in the past 15 years. 15 years ago a (really beefy) single machine meant 8 cores with 256GB ram and a couple TB of storage. Now a single machine can be 256 cores on 8TB of ram and a PB of storage.
Exactly, and the performance of consumer tech is wildly faster. Eg, a Ryzen 5825U mini pc with 16GB memory is ~$250USD with 512GB nvme. That thing will outperform of 14 core Xeon from ~2016 on multicore workloads and absolutely thrash it in single thread. Yes lack of ECC is not good for any serious workload, but great for lower environments/testing/prototyping, and it sips power at ~50W full tilt.
I saw a twitter thread recently where someone tried to make this point to someone shilling AWS spaghetti architectures. They were subsequently dog-piled into oblivion but the mental gymnastics people can do around this subject is a sight to behold.
Simplicity is uncomfortable to a lot of people when they're used to doing things the hard way.
The best part is when you start with a $3000/month cloud bill during development and finally realize that hosting the production instance this way would actually cost $300k/month, but now it's too late to change it quickly.
You put your staging env in the same (kind of) place you put your prod system because you need to replicate your prod environment as faithfully as possible. You also then get to re-use your deployment code.
Do you plan on keeping it in your home? At that point I'd be worried about ISP networking or power guarantees unless you plan on upgrading to business rates for both. If you mean colo, well, if you're sure you'll be using it in X years, it's worth it, but the flexibility of month-to-month might be preferable.
Reminds me of my current customer.
We (another freelancer and me) built an application that replaced an Excel sheet, which was the foundation of the business until then. So the usual so far.
We have a policy that our customers are responsible for all their business-related input, but we make the decisions about the technical implementation.
Every technical decision that the customer wants to make basically costs extra.
In this case we built a rather simple multi-tenancy B2B app using Laravel, with one database per tenant.
They planned to start with a single customer/tenant, scaling up to maybe a few dozen within the next years, with less than a hundred concurrent users over the first five years.
There were some processes with a little load, but they were few, running less that a minute each and already built up to run asynchronous.
We planned a single Hetzner instance and to scale up as soon as we would see it reaching its limits.
So less than 100 €/month.
The customer told us that they have a cooperation with their local hosting provider (with "special conditions!") and that they wanted to use them instead.
My colleague did all the setup, because he is more experienced in that, but instead of our usual five-minute-setup in Forge (one of the advantages of the Laravel ecosystem), it took several weeks with the hosting provider, where my colleague had to invest almost full time just for the deployment.
The hosting provider "consulted" out customer to invest in a more complex setup with a load balancer in front, to be able to scale right away.
They also took very long for each step, like providing IP addresses or to handle the SSL certificates.
We are very proud of our very fast development process and having to work with that hosting provider cost us about one third of our first development phase for the initial product.
It's been around two years since then.
While the software still works as intended, the customer could not grow as expected. They are still running with only one single tenant (basically themselves) and the system barely had to handle more than two concurrent users.
The customer recently accidentally mentioned that they pay almost 1000€/month for the hosting alone.
But it scales!
I can only sympathize here because I have those exact issues with some of our customers
They don't want our hosting solutions but insist on using their own hosting partners
The result are similar:
- its at least five times more expensive on pure hosting costs
- we lose a considerable amount of time dealing with the hosting partner (which we bill to the customer)
- it's always a security nightmare, either because they put so much "safety protections" in place that it's unusable (think about the customer wanting an Internet-facing website, but the servers are private...) or because they don't put any safety settings in place so the servers are regularly taken down through awfully simple exploits (think about SSH root access with "passw0rd" as password...)
- customer keep complaining about performances to us, but what can you do when the servers are sharing a 100Mbps connection, or the filesystem is on an NFS with <20Mbps bandwidth
Yes. For these reasons I now have a policy that if you want me maintaining your website / web app, then I manage the hosting. I use something I'm familiar with that I know works.
This smells a lot like IONOS. They can put their certifications where the light doesn't shine. 10x the cost, really baaad provisioning API and bottlenecks, broken OS images, useless support...
Super interesting, and truly unfortunate when that happens! Just thinking about having to wait for SSL certificates like in the old days (versus Let's Encrypt) would frustrate me to no end.
Forge seems like a great integrated solution (I subscribe to their newsletter and their product updates seem quite frequent and useful). What's been your experience with them? Any particular things you like, or dislike about them?
I'm also curious when you talk about scaling up Forge - is that something you've done, and is that generally easy to do?
Local hosting can make sense. Being able to drive to your provider and talk to them in person is quite valuable, and if you want to get the highest support tier from a large cloud provider you will often pay several times more compared to the same service with no support, assuming you are a large enough customer that they are willing to sell it. Cooperation with local businesses can also result in some fair amount of additional sales (sending customer to each other, buying services from each other, word of mouth, ectra), so the product cost may not represent the complete picture.
Local hosting can also be comparing apple with oranges. A local data center that provide a physical machine is very different from a cloud provider, especially if that cloud is located in a different continent and under different jurisdictions. Given that they were providing SSL certificates, was this a local php webshop? Data centers should be a bit more proficient with things like IP addresses and setting up any cast, but less so in providing help with php or certificates, and if they sell that it may not be their area of expertise.
We've had a similar experience at Hack Club, the nonprofit I run that helps high schoolers get into coding and electronics.
We used to be on Heroku and the cost wasn't just the high monthly bill - it was asking "is this little utility app I just wrote really worth paying $15/month to host?" before working on it.
This year we moved to a self-hosted setup on Coolify and have about 300 services running on a single server for $300/month on Hetzner. For the most part, it's been great and let us ship a lot more code!
My biggest realization is that for an organization like us, we really only need 99% uptime on most of our services (not 99.99%). Most developer tools are around helping you reach 99.99% uptime. When you realize you only need 99%, the world opens up.
Disco looks really cool and I'm excited to check it out!
Cheers, let me know if you do / hop onto our Discord for any questions.
We know of two similar cases: a bootcamp/dev school in Puerto Rico that lets its students deploy all of their final projects to a single VPS, and a Raspberry Pi that we've set up at the Recurse Center [0] which is used to host (double checking now) ~75 web projects. On a single Pi!
Just to be aware when you say "Even with all 6 environments and other projects running, the server's resource usage remained low. The average CPU load stayed under 10%, and memory usage sat at just ~14 GB of the available 32 GB."
The load average in htop is actually per CPU core. So if you have 8 CPU cores like in your screenshot, a load average of 0.1 is actually 1.25% (10% / 8) of total CPU capacity - even better :).
Cool blog! I've been having so much success with this type of pattern!
what does this service offer over an established tool like Coolify? currently hosting most of my services on a cheap Hetzner VPS so i'm interested what Disco has to offer
Coolify and other self-hosting options such as Kamal are great. We're all in the same boat!
I'd say the main differences is that we 1) we offer a more streamlined CLI and UI rather than offering extensive app/installation options 2) have an api-key based system that lets team members collaborate without having to manage ssh access/keys.
Generally speaking, I'd say our approach and tooling/UX tends to be more functional/pragmatic (like Heroku) than one with every possible option.
It’s all fine and dandy, but I wonder why so little discussion around this (mainly high-level comments “DBs are hard”?
> disco provides a "good enough" Postgres addon.
> This addon is a great way to quickly setup a database when Postgres is not mission critical to your system. If you need any non-basic features, like replication, automatic failover, monitoring, automatic backups and restore, etc. you should consider using a managed Postgres provider, such as Neon or Supabase.
How come automatic backups is considered an “advanced” feature?
Also I can’t think of a single application since 2012 that I have worked on that did not have a secondary/follower instance deployed. Also suggesting Neon and friends is fine, but I wonder what is your average latency, Hetzner does not have direct connection to the DCs these databases are hosted.
Backups are only advanced in the context of our Postgres being "Good" enough (maybe our built-in Posgres could be called "Barely enough" but that sounds a bit lame) :-)
I fully agree with you though, it's table stakes (unintended pun!) for any prod deployment, just as read-only followers, etc. Our biggest, most important point, is that folks should be using real dbs hosted by people who know what they're doing. The risk/reward ratio is out of whack in terms of doing it yourself.
And finally, re Hetzner and cross-DC latency, that's unfortunately a very good issue that we had to plan for in another case - specifically, a customer using Supabase (which is AWS-based). The solution was to simply use an EC2 machine in the same region. Thankfully, some db providers end up being explicit about which AWS region they run in - and obviously, using AWS RDS is also an option! It's definitely a consideration.
The article's title seems inaccurate - as far as I understood there never was a $3000/mo bill; there was a $500/(mo,instance) staging setup that has been rightly optimized to $55/mo before running six instances.
> Critically, all staging environments would share a single "good enough" Postgres instance directly on the server, eliminating the need for expensive managed database add-ons that, on Heroku, often cost more than the dynos themselves.
Heroku also has cheaper managed database add-ons, why not use something like that for staging? The move to self hosting might still make sense, my point is that perhaps the original staging costs of $500/mo could have been lower from the start.
I answered elsewhere with the list of dynos, but the short version is that between the list of services that each deployment required, and the size of the database, it truly (and unfortunately) did end up costing $500 per staging.
You can also enable zram to compress ram, so you can over-provision like the pros'. A lot of long-running software leaks memory that compresses pretty well.
Here is how I do it on my Hetzner bare-metal servers using Ansible: https://gist.github.com/fungiboletus/794a265cc186e79cd5eb2fe... It also works on VMs.
systemd-oomd and oomd use the kernel's PSI[2] information which makes them more efficient and responsive, while earlyoom is just polling.
earlyoom keeps getting suggested, even though we have PSI now, just because people are used to using it and recommending it from back before the kernel had cgroups v2.
[0]: https://www.freedesktop.org/software/systemd/man/latest/syst...
[1]: https://github.com/facebookincubator/oomd
[2]: https://docs.kernel.org/accounting/psi.html
> systemd-oomd periodically polls PSI statistics for the system and those cgroups to decide when to take action.
It's unclear if the docs for systemd-oomd are incorrect or misleading; I do see from the kernel.org link that the recommended usage pattern is to use the `poll` system call, which in this context would mean "not polling", if I understand correctly.
Another option is to limit memory per application in cgroups but that requires more explaining than I am putting in an HN comment.
Another useful thing is to never OOM kill in the first place on servers that are only doing things in memory and need not commit anything to disk. So don't do this on a disked database. This is for ephemeral nodes that should self heal. Wait 60 seconds so drac/ilo can capture crash message and then earth shattering kaboom...
For a funny side note, those options can also be used as a holy hand grenade to intentionally unsafely reboot NFS diskless farms when failing over to entirely different NFS server clusters. setting panic to 15 mins, triggering OOM panic by setting min_free to 16TB at the command line via Ansible not in sysctl.conf, swapping clusters, arp storm and reconverge.Maybe back in the 90s, it was okay to wait 2-3 seconds for a button click, but today we just assume the thing is dead and reboot.
In the age of microservices and cattle servers, reboot/reinstall might be cheap, but in the long run it is not. A long running server, albeit being cattle, is always a better solution because esp. with some excess RAM, the server "warms up" with all hot data cached and will be a low latency unit in your fleet, given you pay the required attention to your software development and service configuration.
Secondly, Kernel swaps out unused pages to SWAP, relieving pressure from RAM. So, SWAP is often used even if you fill 1% of your RAM. This allows for more hot data to be cached, allowing better resource utilization and performance in the long run.
So, eff it, we ball is never a good system administration strategy. Even if everything is ephemeral and can be rebooted in three seconds.
Sure, some things like Kubernetes forces "no SWAP, period" policies because it kills pods when pressure exceeds some value, but for more traditional setups, it's still valuable.
Swap helps you use ram more efficiently, as you put the hot stuff in swap and let the rest fester on disk.
Sure if you overwhelm it, then you're gonna have a bad day, but thats the same without swap.
Seriously, swap is good, don't believe the noise.
Swap is good to have. The value is limited but real.
Also not having swap doesn't prevent thrashing, it just means that as memory gets completely full you start dropping and re-reading executable code over and over. The solution is the same in both cases, kill programs before performance falls off a cliff. But swap gives you more room before you reach the cliff.
Nowadays when a program hits swap it's not going to fallback to a different memory usage profile that prioritises disk access. It's going to use swap as if it were actual ram, so you get to see the program choking the entire system.
I regularly use it on my Snapdragon 870 tablet (not exactly a top of the line CPU) to prevent OOM crashes (it's running an ancient kernel and the Android OOM killer basically crashes the whole thing) when running a load of tabs in Brave and a Linux environment (through Tmux) at the same time.
ZRAM won't save you if you do actually need to store and actively use more than the physical memory but if 60% of your physical memory is not actively used (think background tabs or servers that are running but not taking requests) it absolutely does wonders!
On most (web) app servers I happily leave it enabled to handle temporary spikes, memory leaks or applications that load a whole bunch of resources that they never ever use.
I'm also running it on my Kubernetes cluster. It allows me to set reasonable strict memory limits while still having the certainty that Pods can handle (short) spikes above my limit.
Example on my personal VPS
But also false. Swap is there so anonymous pages can be evicted. Not as a “slow overflow for RAM”, as a lot of people still believe.
By disabling swap you can actually *increase* thrashing, because the kernel is more limited in what it can do with the virtual memory.
My experience is the exact opposite. If anything 2-3 second button clicks are more common than ever today since everything has to make a roundtrip to a server somewhere whereas in the 90s 2-3s button click meant your computer was about to BSOD.
Edit: Apple recently brought "2-3s to open tab" technology to Safari[1].
[1] https://old.reddit.com/r/MacOS/comments/1nm534e/sluggish_saf...
Deleted Comment
As someone with zero ansible experience, can you elaborate on why a yaml list is better than a simple shell script with comments before each command?
The case of swap thrashing sounds like a misbehaving program, which can maybe be tamed by oomd.
System responsiveness though needs a complete resource control regime in place, that preserves minimum resources for certain critical processes. This is done with cgroupsv2. By establishing minimum resources, the kernel will limit resources for other processes. Sure, they will suffer. That’s the idea.
Is earlyoom a better solution than that to prevent an erratic process from making an instance unresposnsive?
Therefore it is better to always tune "vm.swappiness" to 1 in /etc/sysctl.conf
You can also configure your web server / TCP stack buffers / file limits so they never allocate memory over the physical ram available. (eg. in nginx you can setup worker/connection limits and buffer sizes.)
And of course the overhead is zero when you don't page-out to swap.
To clarify OP's represention of the tool, it compresses swap space not resident ram. Outside of niche use-cases, compressing swap has overall little utility.
For an algorithm using the whole memory, that’s a terrible idea.
- hardware, networks, monitoring, provisioning, server room locations in existing buildings, how to prepare server rooms
- and so on up to hiring and firing sysadmins, salary negotiations[2], vendor negotiations and the first book even had a whole chapter dedicated to "Being happy"
[1] There is a third author as well now, but those two were the ones that are on the cover of my book from 2005 and that I can remember
[2] Has mostly worked well after I more or less left sysadmin behind as well
Another option is to run BSD to avoid the Linux oom issue
For example, I'm not using Hetzner but I run NetBSD entirely from memory (no disk, no swap) and it never "went down" when out of memory
Looks like some people install FreeBSD and OpenBSD on Hetzner
https://gist.github.com/c0m4r/142a0480de4258d5da94ce3a2380e8...
https://computingforgeeks.com/how-to-install-freebsd-on-hetz...
https://web.archive.org/web/20231211052837if_/https://www.ar...
https://community.hetzner.com/tutorials/freebsd-openzfs-via-...
https://www.souji-thenria.net/posts/openbsd_hetzner/
https://web.archive.org/web/20220814124443if_/https://blog.v...
https://www.blunix.com/blog/how-to-install-openbsd-on-hetzne...
https://gist.github.com/ctsrc/9a72bc9a0229496aab5e4d3745af0b...
If it is possible to boot Hetzner from a BSD install image using "Linux rescue mode"^1 then it should also possible to run NetBSD entirely from memory using custom kernel
Every user is different but this is how I prefer to run UNIX-like OS for personal, recreational use; I find it more resilient
1.
https://docs.hetzner.com/robot/dedicated-server/troubleshoot...
https://blog.tericcabrel.com/hetzner-rescue-mode-unlock-serv...
https://github.com/td512/rescue
https://gainanov.pro/eng-blog/linux/hetzner-rescue-mode/
https://docs.hetzner.com/cloud/servers/getting-started/rescu...
ChromeOS has an interesting approach to Linux oom issues. Not sure it has ever been discussed on HN
https://github.com/dct2012/chromeos-3.14/raw/chromeos-3.14/m...
It's a shame they don't just license all their software stack at a reasonable price with a similar model like Sidekiq and let you sort out actually decent hardware. It's insane to consider Heroku if anything has gotten more expensive and worse compared to a decade ago yet in comparison similar priced server hardware has gotten WAY better of a decade. $50 for a dyno with 1 GB of ram in 2025 is robbery. It's even worse considering running a standard rails app hasn't changed dramatically from a resources perspective and if anything has become more efficient. It's comical to consider how many developers are shipping apps on Heroku for hundreds of dollars a month on machines with worse performance/resources than the macbook they are developing it on.
It's the standard playback that damn near everything in society is going for though just jacking prices and targeting the wealthiest least price sensitive percentiles instead of making good products at fair prices for the masses.
Heroku's pricing has _remained the same_ for at least seven years, while hardware has improved exponentially. So when you look at their pricing and see a scam, what you're actually doing is comparing a 2025 anchor to a mid-2010s price that exists to retain revenue. At the big cloud vendors, they differentiate customers by adding obstacles to unlocking new hardware performance in the form of reservations and updated SKUs. There's deliberate customer action that needs to take place. Heroku doesn't appear to have much competition, so they keep their prices locked and we get to read an article like this whenever a new engineer discovers just how capable modern hardware is.
Heroku has obviously stagnated now but their stack is _very cool_ for if you have a fairly simple system but still want all the nice parts of a mode developed ops system. It almost lets you get away with not having an ops team for quite a while. I don't know any other provider that is low-effort "decent" ops (Fly seems to directionally want to be new Heroku but is still missing a _lot_ in my book, though it also has a lot)
To be fair, AWS quite proudly talk about all the times they've lowered prices on existing services, or have introduced new generations that are cheaper (e.g. their Graviton EC2 instances).
We built and open sourced https://canine.sh for exactly that reason. There’s no reason PaaS providers should be charging such a giant markup over already marked up cloud providers.
AWS isn't much better honestly.. $50/month gets you an m7a.medium which is 1 vCPU (not core) and 4GB of RAM. Yes that's more memory but any wonder why AWS is making money hand-over-fist..
Where they get you is all the ancillary shit, you buy some database/backup/storage/managed service/whatever, and it is priced in dollars per boogaloo, you also have to pay water tax on top, and of course if you use more than the provisioned amount of hafnias the excess ones cost 10x as much.
Most customers have no idea how little compute they are actually buying with those services.
If you reserve that instance you can get it for 40% cheaper, or get 4 cores instead.
Yes it's more expensive than OVH but you also get everything AWS to offer.
To compare to Heroku's standard dynos (which are shared hosting) you want the t3a family which is also shared, and much cheaper.
Netlify sets the same prices.
Just throw it into a cloud bucket from CI and be done with it.
Every other time I login to the admin site I get a Heroku error.
It's insane how much a restaurant charges for a decent steak, I can do it much cheaper myself!
...!
No one gets hurt if someone else chooses to waste their money on Heroku so why are people complaining? Of course it applies in cases where there aren't a lot of competitors but there are literally hundreds of different of different options for deploying applications and at least a dozen of them are just as reliable and cheaper than Heroku.
Deleted Comment
Really? I mean oil changes are pretty cheap. You can get an oil change at walmart for like 40 bucks.
Hosting staging envs in pricey cloud envs seems crazy to me but I understand why you would want to because modern clouds can have a lot of moving parts.
Cloud isn't worth it until suddenly it is because you can't deploy your own servers fast enough, and then it's worth it until it exceeds the price of a solid infrastructure team and hardware. There's a curve to how much you're saving by throwing everything in the cloud.
It's fun the first time, but becomes an annoying faff when it has to be repeated constantly.
In Heroku, Vercel and similar you git push and you're running. On a linux server you set up the OS, the server authentication, the application itself, the systemctl jobs, the reverse proxy, the code deployment, the ssl key management, the monitoring etc etc.
I still do prefer a linux server due to the flexibility, but the UX could be a lot better.
It offloads things like - Power Usage - Colo Costs - Networking (a big one) - Storage (SSD wear / HDD pools) - etc
It is a long list but what doesnt allow you do it make trade offs like spending way less but accept downtime if your switch dies etc etc.
For a staging env these are things you might want to do.
Is it mostly developer insecurity, or mostly tech leadership insecurity?
I think this is partly responsible for the increased popularity of sqlite as a backend. It's super simple and lightstream for recovery isn't that complicated.
Most apps don't need 5 9s, but they do care about losing data. Eliminate the possibility of losing data, without paying tons of $ to also eliminate potential outages, and you'll get a lot of customers.
Today the smallest, and even large, aws machines are a joke, comparable to a mobile phone from 15 years ago to a terrible laptop today, and take about three to six months to in rent as buying the hardware outright.
If you're on the cloud without getting 75% discount you will save money and headcount by doing everything on prem.
You get X resources in the cloud and know that a certain request/load profile will run against it. You have to configure things to handle that load, and are scored against other people.
Things like Lambda do fit in this model, but they are too inefficient to model every workload.
Amazon lacks vision.
Simplicity is uncomfortable to a lot of people when they're used to doing things the hard way.
* The big caveat: If you don't incur the exact same devops costs that would have happened with a linux instance.
Many tools (containers in particular) have cropped up that have made things like quick, redundant deployment pretty straightforward and cheap.
As cloud marches on it continues to seem like a grift.
We have a policy that our customers are responsible for all their business-related input, but we make the decisions about the technical implementation. Every technical decision that the customer wants to make basically costs extra.
In this case we built a rather simple multi-tenancy B2B app using Laravel, with one database per tenant. They planned to start with a single customer/tenant, scaling up to maybe a few dozen within the next years, with less than a hundred concurrent users over the first five years. There were some processes with a little load, but they were few, running less that a minute each and already built up to run asynchronous.
We planned a single Hetzner instance and to scale up as soon as we would see it reaching its limits. So less than 100 €/month.
The customer told us that they have a cooperation with their local hosting provider (with "special conditions!") and that they wanted to use them instead.
My colleague did all the setup, because he is more experienced in that, but instead of our usual five-minute-setup in Forge (one of the advantages of the Laravel ecosystem), it took several weeks with the hosting provider, where my colleague had to invest almost full time just for the deployment. The hosting provider "consulted" out customer to invest in a more complex setup with a load balancer in front, to be able to scale right away. They also took very long for each step, like providing IP addresses or to handle the SSL certificates.
We are very proud of our very fast development process and having to work with that hosting provider cost us about one third of our first development phase for the initial product.
It's been around two years since then. While the software still works as intended, the customer could not grow as expected. They are still running with only one single tenant (basically themselves) and the system barely had to handle more than two concurrent users. The customer recently accidentally mentioned that they pay almost 1000€/month for the hosting alone. But it scales!
They don't want our hosting solutions but insist on using their own hosting partners
The result are similar:
- its at least five times more expensive on pure hosting costs
- we lose a considerable amount of time dealing with the hosting partner (which we bill to the customer)
- it's always a security nightmare, either because they put so much "safety protections" in place that it's unusable (think about the customer wanting an Internet-facing website, but the servers are private...) or because they don't put any safety settings in place so the servers are regularly taken down through awfully simple exploits (think about SSH root access with "passw0rd" as password...)
- customer keep complaining about performances to us, but what can you do when the servers are sharing a 100Mbps connection, or the filesystem is on an NFS with <20Mbps bandwidth
Forge seems like a great integrated solution (I subscribe to their newsletter and their product updates seem quite frequent and useful). What's been your experience with them? Any particular things you like, or dislike about them?
I'm also curious when you talk about scaling up Forge - is that something you've done, and is that generally easy to do?
Thanks a lot!
Local hosting can also be comparing apple with oranges. A local data center that provide a physical machine is very different from a cloud provider, especially if that cloud is located in a different continent and under different jurisdictions. Given that they were providing SSL certificates, was this a local php webshop? Data centers should be a bit more proficient with things like IP addresses and setting up any cast, but less so in providing help with php or certificates, and if they sell that it may not be their area of expertise.
What prevented them from scaling to more tenants?
We used to be on Heroku and the cost wasn't just the high monthly bill - it was asking "is this little utility app I just wrote really worth paying $15/month to host?" before working on it.
This year we moved to a self-hosted setup on Coolify and have about 300 services running on a single server for $300/month on Hetzner. For the most part, it's been great and let us ship a lot more code!
My biggest realization is that for an organization like us, we really only need 99% uptime on most of our services (not 99.99%). Most developer tools are around helping you reach 99.99% uptime. When you realize you only need 99%, the world opens up.
Disco looks really cool and I'm excited to check it out!
We know of two similar cases: a bootcamp/dev school in Puerto Rico that lets its students deploy all of their final projects to a single VPS, and a Raspberry Pi that we've set up at the Recurse Center [0] which is used to host (double checking now) ~75 web projects. On a single Pi!
[0] https://www.recurse.com/
Lots of conversation & discussion about self-hosting / cloud exits these days (pros, cons, etc.) Happy to engage :-)
Cheers!
The load average in htop is actually per CPU core. So if you have 8 CPU cores like in your screenshot, a load average of 0.1 is actually 1.25% (10% / 8) of total CPU capacity - even better :).
Cool blog! I've been having so much success with this type of pattern!
I'd say the main differences is that we 1) we offer a more streamlined CLI and UI rather than offering extensive app/installation options 2) have an api-key based system that lets team members collaborate without having to manage ssh access/keys.
Generally speaking, I'd say our approach and tooling/UX tends to be more functional/pragmatic (like Heroku) than one with every possible option.
https://news.ycombinator.com/item?id=44292103
https://news.ycombinator.com/item?id=44873057
Would be great to have a comparison on the main page of Disco
Deleted Comment
Interesting project. Do you have any screenshots of the UI of Disco?
Oh, there’s actually this tutorial that shows a tiny preview of it:
https://disco.cloud/docs/deployment-guides/meilisearch
Thanks for the reminder!
> disco provides a "good enough" Postgres addon.
> This addon is a great way to quickly setup a database when Postgres is not mission critical to your system. If you need any non-basic features, like replication, automatic failover, monitoring, automatic backups and restore, etc. you should consider using a managed Postgres provider, such as Neon or Supabase.
How come automatic backups is considered an “advanced” feature?
Also I can’t think of a single application since 2012 that I have worked on that did not have a secondary/follower instance deployed. Also suggesting Neon and friends is fine, but I wonder what is your average latency, Hetzner does not have direct connection to the DCs these databases are hosted.
I fully agree with you though, it's table stakes (unintended pun!) for any prod deployment, just as read-only followers, etc. Our biggest, most important point, is that folks should be using real dbs hosted by people who know what they're doing. The risk/reward ratio is out of whack in terms of doing it yourself.
And finally, re Hetzner and cross-DC latency, that's unfortunately a very good issue that we had to plan for in another case - specifically, a customer using Supabase (which is AWS-based). The solution was to simply use an EC2 machine in the same region. Thankfully, some db providers end up being explicit about which AWS region they run in - and obviously, using AWS RDS is also an option! It's definitely a consideration.
> Critically, all staging environments would share a single "good enough" Postgres instance directly on the server, eliminating the need for expensive managed database add-ons that, on Heroku, often cost more than the dynos themselves.
Heroku also has cheaper managed database add-ons, why not use something like that for staging? The move to self hosting might still make sense, my point is that perhaps the original staging costs of $500/mo could have been lower from the start.