Yes, perfs are better on bare metal, and it's cheaper too. In fact, even a cheap VPS will perform better than the cloud for much less money, and scale vertically to incredible highs.
You do have to do a lot of things manually, and even doing that, you won't get the high availability and elasticity you have the potential to get from a cloud offer. Potential being the key word though.
But honestly, most projects don't need it.
Crashing used to terrify me. One day I started to work for a client that had his service down once a month.
Guess what ? Nothing happened. No customers ever complained. No money was lost, in fact, the cash flow kept growing.
Most services are not so important they can't get switched off once in a while.
Not to mention, monoliths are more robust they are given credit for.
I've seen a fair amount of outages caused by the extra complexity brought on by making a system distributed for the purposes of high availability.
Hardware is actually quite reliable nowadays, and I'll trust a hardware single-point of failure running a monolithic application more than a distributed microservice-based system with lots of moving parts.
Sure, in theory, the distributed system should win, but in practice, it fails more often (due to operator error or unforeseen bugs) than hardware.
> Sure, in theory, the distributed system should win, but in practice, it fails more often (due to operator error or unforeseen bugs) than hardware.
Isn't this because of the rampant introduction of accidental complexity whenever you attempt to make a system horizontally scalable - e.g. for whatever reason the developers or the people in charge suddenly try to cram in as many technological solutions as possible because apparently that's what the large companies are doing?
There's no reason why you couldn't think about which data can or cannot be shared, and develop your system as one that's scalable, possibly modular, but with the codebase still being largely monolithic in nature. I'd argue that there's a large gray area between the the opposite ends of that spectrum - monoliths vs microservices.
The biggest benefit of HA architectures is IMO not resilience from crashing or overloaded systems, but more that it is often a prerequisite for doing zero-downtime updates and partial rollout of updates.
No more emergency revert to last version because the new version didn't start up, only to realize there was a schema update so you also must restore your data from a snapshot, which nobody knows how to do. All this under stress from the service being completely down.
In most cases, it's only cheaper if you value engineering time at $0. Fine for a personal project, but the math changes when you're paying a fully-loaded cost of $100-200/hr for engineers and you have a long backlog of more valuable things for them to work on.
That's the real reason companies use cloud services: Not because it's cheaper or faster, but because it lets the engineers focus their efforts on the things that differentiate the company.
From another perspective: I can change the oil in my car cheaper and faster than driving to a shop and having someone else do it. However, if my time is better spent doing an hour of freelancing work then I'll gladly pay the shop to change my oil while I work.
> In most cases, it's only cheaper if you value engineering time at $0.
Clearly not at $0 since the cloud is much more expensive and you need more of it because it's slower. If you could have someone do setup and maintenance for free, obviously you'd be way ahead.
So the question is really how much does it cost and whether it's cheaper than the cloud tax.
At a previous startup our AWS bill would've been enough to hire about 3 to 4 full time sysadmins at silicon valley rates. Our workload wasn't huge. I estimated at the time we could've taken one month of AWS cost to buy more than enough equipment to run everything in-house with redundancy, then hire two people to run it all and bank the rest.
At current startup, we're still very small and have no real customer traffic, but the AWS bill is getting close, not there yet, to pay for one full time person. For a workload that could be hosted on a few ancient laptops over a home DSL line if we wanted.
Yes, there's quite the convenience in being able to just click a new instance into existence, but the true cost is also very high. Sure it's just VC money we're all lighting on fire, but I often wonder how much more could be accomplished if we didn't collectively just hand over most of the funding to AWS.
Eh, everything can be automated just as well on bare metal. "The cloud" tends to add complexity not remove it, or at best replace one kind of complexity with another. Bare metal tooling is less familiar and could use some advancement, but basically anything that a cloud compute provider can do can be done on bare metal.
A lot of orgs just never bothered to learn how to automate bare metal and ended up doing a lot of manual work.
Well, not really. If you are using a cloud solution, you usually needs an engineer that knows this particular solution. Outside of the HN bubble, it's a rare breed and it cost a lot more than your traditional linux admin, that you probably already have anyway.
Then you need to design, maintain and debug the distributed cloud system, which is more complex.
So you'll have a dedicated person or team for that in both cases.
On the other end, setuping a linux box for the common tasks (web server, cache, db, etc) never takes me more than a day.
Oh like that one system I saw once with an uptime of 10 years that was happily chugging away data (not web facing tho).
Bare metal servers with a propper fallback and a working monitoring/notification system can be incredibly reliable and for most purposes definitely enough.
>"You do have to do a lot of things manually, and even doing that, you won't get the high availability and elasticity you get from a cloud offer."
I run my things on bare metal. In case of hardware failure it takes less than 1 hour to restore servers from backup and there was exactly zero hardware failures for all its lifetime anyways. I also have configurations with standby servers but I am questioning it now as again there was not a single time (bar testing) when standby was needed.
As for performance - I use native C++ backends and modern multicore CPU with lots of RAM. Those babies can process thousands of requests per second sustainably without ever breaking a sweat. It is more than enough for any reasonable business. All while costing fractions of a cloudy stuff.
With ZFS or even RAID, there should ideally never be a need to "restore from backup" because of a conventional hardware failure; storage drive malfunctions nowadays can and (IMO) should be resolved online.
This is of course not a reason to avoid backups, but nowadays "restoring from backups" should be because of operator error or large-scale disaster (fire, etc), not because of storage drive failure.
Nowadays I'd be more worried about compute hardware failure - think CPU, RAM or the system board. Storage redundancy is IMO a long-solved problem provided you don't cheap out.
I have a service at work that only needs to be up during a couple of 4 hour intervals every week. It's one of the least stressful elements of my job.
There's a cost to always-on systems, and I don't think we're accounting those costs properly (externalities). It's likely that in many cases the benefits do not outweigh the costs.
I think it comes down to your ability to route around a problem. If the budgeting or ordering system is down, there are any number of other tasks most of your people can be doing. They can make a note, and stay productive for hours or days.
If you put all of your software on one server or one provider, and that goes down, then your people are pretty much locked out of everything. Partial availability needs to be prioritized over total availability, because of the Precautionary Principle. The cost of everything being down is orders of magnitude higher than the cost of having some things down.
One thing I don't get is why not have an on prem solution with cloud fallback? It's hard but not super hard. You would just need a cloud data store. And depending on your app, you can have on prem data store that periodically backs up (and no egress under normal operation) if you can design with an eventual consistency models.
I recently moved a smallish business application from two bare-metal servers onto Azure VMs. It's a standard PHP 7.4 application, MySQL, Redis, nginx. Despite the VMs costing more and having twice the spec of the bare-metal servers, it has consistently performed slower and less reliably throughout the stack. The client's budget is too small to spend much time looking into it. Instead, they upped the spec even further, and what used to cost £600 per month now costs £2000.
(Disclaimer) As a bare metal provider, I hope more people become aware what I've been saying for years: cloud is great for scaling down, but not that great for scaling up. If you want to have lots of VM's that don't warrant their own hardware that you can conveniently turn up and down, then cloud is fantastic. If you have a big application that needs to scale, you can get further vertically with bare metal, and if you need to scale horizontally, you need to optimize better higher up in the stack anyway, and the much lower cost for equivalent resources (without even taking any virtualization performance hit into account), more flexibility and thus more/better fitted performance of bare metal should have the clear advantage.
> what I've been saying for years: cloud is great for scaling down, but not that great for scaling up.
Yes and no. The cloud isn’t cheap at running any lift and shift type project. Where the cloud comes into its own is serverless and SaaS. If you have an application that’s a typical VM farm and you want to host it in the cloud, then you should at least first pause and identify how, if at all, your application can be re-architected to be more “cloudy”. Not doing this is usually the first mistake people make when deploying to the cloud.
I think “the cloud” never claimed to be cheaper, though. Its promise is mainly that you’ll offset a lot of risks with it against a higher price. And of course the workflow with the cloud is very different, with virtual machine images and whatnot.
Whether that’s worth the price is up debate, though. I hope we’ll get more mature bare metal management software, perhaps even standardized in some remote management hardware so you can flash entire bare metal systems.
Right now I’m mostly running Debian stable and everything inside Docker on the host machine, making it effectively a Docker “hypervisor”. It can get you quite far, without leaking any application state to the host machine.
Oh I agree, it is easier to mitigate some risks in a cloud solution. But this client - and they’re not unusual in their thinking - believes the cloud is somehow automatically mitigating those risks, when in fact it’s doing nothing because they’re not paying for it.
In this specific case, they chose Azure. They had a consultant in a fancy suit tell them Azure would be safer, and proposed an architecture that ended up not working at all. But they still went with Azure, and it’s difficult to point to any gains they’ve got for the 200% price increase.
Who actually runs Centos 7 ( Kernel 3.10 ) for benchmarks in 2021? Run something recent like Ubuntu 20 + KVM, you will see big difference. I don't beleive modern virtualization has ~20% overhead ( it should be less than 5% ).
Can you find sources for those "less than 5%"-numbers form people who aren't selling kubernetes or cloud-related services?
It's generally pretty easy to construct benchmarks that make look favorable. It's why there's constantly blog posts to the effect of "$INTERPRETED_HIGH_LEVEL_LANGUAGE is faster than C++".
I mean one could say kubernetes has virtualized components, and requires VT-d extensions to operate at an accelerated speed, but I don't think containers are truly virtualized. So you can probably get away with a less than 5% benchmark if the stars aligned.
With a hypervisor you're looking at 10-15% overhead, typically. maybe getting down to 7-12% using tricks (paravirtualization, pci-passthrough etc). In my environment i am around 12% OH on a good day
I saw quite a significant performance and resource usage benefit from migrating away from a virtualized kubernetes environment to bare metal debian, so their findings align well with my anecdata as well.
Another thing to check is how nginx was compiled. Using generic optimizations vs. x86_64 can do interesting things on VM's vs bare metal. nginx and haproxy specifically should be compiled generic for VM's. I don't have any links, just my own performance testing in the past.
A binary running in a VM is still executing native machine code, so compiler optimizations should have the same effect whether running on bare metal or a VM.
Am I reading correctly that there is a huge difference in http versus SSL requests per second? e.g. in the Bare Metal 1 CPU case it's 48k http to 800 SSL? I had no idea the performance impact of SSL was that huge, if this is correct.
>The Ixia client sent a series of HTTPS requests, each on a new connection. The Ixia client and NGINX performed a TLS handshake to establish a secure connection, then NGINX proxied the request to the backend. The connection was closed after the request was satisfied.
I honestly struggle to understand why they didn't incorporate keepalives in the testing. Reusing an existing TLS connection, something done far more often than not in the wild, will have a dramatic positive effect on throughput.
I wonder how many of the extra cpu cycles are in PV network interfaces. It would be interesting to see how this works out with SR-IOV capable NICs with a VF in each VM.
You do have to do a lot of things manually, and even doing that, you won't get the high availability and elasticity you have the potential to get from a cloud offer. Potential being the key word though.
But honestly, most projects don't need it.
Crashing used to terrify me. One day I started to work for a client that had his service down once a month.
Guess what ? Nothing happened. No customers ever complained. No money was lost, in fact, the cash flow kept growing.
Most services are not so important they can't get switched off once in a while.
Not to mention, monoliths are more robust they are given credit for.
Debatable.
I've seen a fair amount of outages caused by the extra complexity brought on by making a system distributed for the purposes of high availability.
Hardware is actually quite reliable nowadays, and I'll trust a hardware single-point of failure running a monolithic application more than a distributed microservice-based system with lots of moving parts.
Sure, in theory, the distributed system should win, but in practice, it fails more often (due to operator error or unforeseen bugs) than hardware.
Isn't this because of the rampant introduction of accidental complexity whenever you attempt to make a system horizontally scalable - e.g. for whatever reason the developers or the people in charge suddenly try to cram in as many technological solutions as possible because apparently that's what the large companies are doing?
There's no reason why you couldn't think about which data can or cannot be shared, and develop your system as one that's scalable, possibly modular, but with the codebase still being largely monolithic in nature. I'd argue that there's a large gray area between the the opposite ends of that spectrum - monoliths vs microservices.
I actually wrote down some thoughts in a blog post of mine, called "Moduliths: because we need to scale, but we also cannot afford microservices": https://blog.kronis.dev/articles/modulith-because-we-need-to...
No more emergency revert to last version because the new version didn't start up, only to realize there was a schema update so you also must restore your data from a snapshot, which nobody knows how to do. All this under stress from the service being completely down.
> You do have to do a lot of things manually...
Doing those manual things has a cost, though.
In most cases, it's only cheaper if you value engineering time at $0. Fine for a personal project, but the math changes when you're paying a fully-loaded cost of $100-200/hr for engineers and you have a long backlog of more valuable things for them to work on.
That's the real reason companies use cloud services: Not because it's cheaper or faster, but because it lets the engineers focus their efforts on the things that differentiate the company.
From another perspective: I can change the oil in my car cheaper and faster than driving to a shop and having someone else do it. However, if my time is better spent doing an hour of freelancing work then I'll gladly pay the shop to change my oil while I work.
Clearly not at $0 since the cloud is much more expensive and you need more of it because it's slower. If you could have someone do setup and maintenance for free, obviously you'd be way ahead.
So the question is really how much does it cost and whether it's cheaper than the cloud tax.
At a previous startup our AWS bill would've been enough to hire about 3 to 4 full time sysadmins at silicon valley rates. Our workload wasn't huge. I estimated at the time we could've taken one month of AWS cost to buy more than enough equipment to run everything in-house with redundancy, then hire two people to run it all and bank the rest.
At current startup, we're still very small and have no real customer traffic, but the AWS bill is getting close, not there yet, to pay for one full time person. For a workload that could be hosted on a few ancient laptops over a home DSL line if we wanted.
Yes, there's quite the convenience in being able to just click a new instance into existence, but the true cost is also very high. Sure it's just VC money we're all lighting on fire, but I often wonder how much more could be accomplished if we didn't collectively just hand over most of the funding to AWS.
A lot of orgs just never bothered to learn how to automate bare metal and ended up doing a lot of manual work.
Then you need to design, maintain and debug the distributed cloud system, which is more complex.
So you'll have a dedicated person or team for that in both cases.
On the other end, setuping a linux box for the common tasks (web server, cache, db, etc) never takes me more than a day.
Bare metal servers with a propper fallback and a working monitoring/notification system can be incredibly reliable and for most purposes definitely enough.
I run my things on bare metal. In case of hardware failure it takes less than 1 hour to restore servers from backup and there was exactly zero hardware failures for all its lifetime anyways. I also have configurations with standby servers but I am questioning it now as again there was not a single time (bar testing) when standby was needed.
As for performance - I use native C++ backends and modern multicore CPU with lots of RAM. Those babies can process thousands of requests per second sustainably without ever breaking a sweat. It is more than enough for any reasonable business. All while costing fractions of a cloudy stuff.
This is of course not a reason to avoid backups, but nowadays "restoring from backups" should be because of operator error or large-scale disaster (fire, etc), not because of storage drive failure.
Nowadays I'd be more worried about compute hardware failure - think CPU, RAM or the system board. Storage redundancy is IMO a long-solved problem provided you don't cheap out.
There's a cost to always-on systems, and I don't think we're accounting those costs properly (externalities). It's likely that in many cases the benefits do not outweigh the costs.
I think it comes down to your ability to route around a problem. If the budgeting or ordering system is down, there are any number of other tasks most of your people can be doing. They can make a note, and stay productive for hours or days.
If you put all of your software on one server or one provider, and that goes down, then your people are pretty much locked out of everything. Partial availability needs to be prioritized over total availability, because of the Precautionary Principle. The cost of everything being down is orders of magnitude higher than the cost of having some things down.
https://astuteinternet.com/services/dedicated-servers-order
{"error":"URI Not Found"}
Yes and no. The cloud isn’t cheap at running any lift and shift type project. Where the cloud comes into its own is serverless and SaaS. If you have an application that’s a typical VM farm and you want to host it in the cloud, then you should at least first pause and identify how, if at all, your application can be re-architected to be more “cloudy”. Not doing this is usually the first mistake people make when deploying to the cloud.
Whether that’s worth the price is up debate, though. I hope we’ll get more mature bare metal management software, perhaps even standardized in some remote management hardware so you can flash entire bare metal systems.
Right now I’m mostly running Debian stable and everything inside Docker on the host machine, making it effectively a Docker “hypervisor”. It can get you quite far, without leaking any application state to the host machine.
https://www.php.net/supported-versions
It's generally pretty easy to construct benchmarks that make look favorable. It's why there's constantly blog posts to the effect of "$INTERPRETED_HIGH_LEVEL_LANGUAGE is faster than C++".
With a hypervisor you're looking at 10-15% overhead, typically. maybe getting down to 7-12% using tricks (paravirtualization, pci-passthrough etc). In my environment i am around 12% OH on a good day
Virtualized in Hardware: Hardware support for virtualization, and near bare-metal speeds. I'd expect between 0.1% and 1.5% overhead.
5% is achievable only for pure user-mode CPU loads with minimal I/O.
Neither supports io_uring (although I don't think nginx does either)
I honestly struggle to understand why they didn't incorporate keepalives in the testing. Reusing an existing TLS connection, something done far more often than not in the wild, will have a dramatic positive effect on throughput.
A) 0% SSL handshake workload per connection
and B) 100% SSL handshake workload per connection.
A reasonable step is just to solve the linear system---
and go from there for preliminary sizing. And, when you need detailed sizing, you should be using the real data from your real application.E.g., if it's reused 6 times, we'd expect around 4400 reqs/second from one core.