sksjvsla (u/sksjvsla)

sksjvsla commented on We moved from AWS to Hetzner, saved 90%, kept ISO 27001 with Ansible medium.com/@accounts_7307... · Posted by u/sksjvsla

trod1234 · 9 months ago

With respect, there's a big difference between "could close your account" and have "closed people's accounts" temporarily based on unlawful complaints.

I probably won't be responding after this or in the future on HN because I took a significant blast off my karma for keeping it real and providing valuable feedback. You have a lot of people brigading accounts that punish those that provide constructive criticism.

Generally speaking AWS is incentivized to keep your account up so long as there is no legitimate reason for them taking it down. They generally vet claims with a level of appropriate due diligence before imposing action because that means they can keep billing for that time. Spurious unlawful requests cost them money and they want that money and are at a scale where they can do this.

I'm sure you've spent a lot of time and effort on your rollout. You sound competent, but what makes me cringe is the approach you are taking that this is just a technical problem when it isn't.

If you've done your research you would have ran across more than a few incidents where people running production systems had Hetzner either shut them down outright, or worse often in response to invalid legal claims which Hetzner failed to properly vet. There have also been some strange non-deterministic issues that may be related to hardware failing, but maybe not.

Their support is often a one response every 24 hours, what happens when the first couple responses are boilerplate because the tech didn't read or understand what was written. 24 hours + % chance of skipping the next 24 hours at each step; and no phone support, which is entirely unmanageable. While I realize they do have a customer support line, it is for most an international call and the hours are bankers hours. If your in Europe you'll have a lot easier time lining up those calls, but anywhere else and you are dealing with international calls with the first chance of the day being midnight.

Having a separate platform for both servers is sound practice, but what happens when your DAG running your logging/notification system is on the platform that fails, but not a failover. The issues are particularly difficult when half your stack fails on one provider, stale data is replicated over to your good side, and you have nonsensical, or invisible failures; and its not enough to force an automatic failover with traffic management which is often not granular enough.

Its been awhile since I've had to work with Cloudflare tm, so this may have become better but I'm reasonably skeptical. I've personally seen incidents where the RTO for support for direct outages was exceptional, but then the RTO for anything above a simple HTTP(200) was nonexistent with finger pointing, which was pointless because the raw network captures were showing the failure at L2/L3 traffic on the provider side which was being ignored by the provider. They still argued, and downtime/outage was extended as a result. Vendor management issues are the worst when contracts don't properly scope and enforce timely action.

Quite a lot of the issues I've seen with various hosting providers OVH and Hetzner included, are related to failing hardware, or transparent stopgaps they've put in place which break the upper service layers.

For example, at one point we were getting what appeared to be stale cache issues coming in traffic between one of a two backend node set on different providers. There was no cache between them, and it was breaking sequential flows in the API while still fulfilling other flows which were atomic. HTTP 200 was fine, AAA was not, and a few others. It appeared there was a squid transparent proxy placed in-line which promptly disappeared upon us reaching out to the platform, without them confirming it happened; concerning to say the least when your intended use of the app you are deploying is knowledge management software with proprietary and confidential information related to that business. Needless to say this project didn't move forward on any cloud platform after that (and it was populated with test data so nothing lost). It is why many of our cloud migrations were suspended, and changed to cloud repatriation projects. Counter-party risk is untenable.

Younger professionals I've found view these and related issues solely as technical problems, and they weigh those technical problems higher than the problems they can't weigh because of lack of experience and something called the streetlamp effect which is an intelligence trap often because they aren't taught a Bayes approach. There's a SANS CTI presentation on this (https://www.youtube.com/watch?v=kNv2PlqmsAc).

The TL;DR is a technical professional can see and interrogate just about every device, and that can lead to poor assumptions and an illusion of control which tend to ignore problems and dismiss them when there is no real clear visibility about how those edge problems can occur (when the low level facilities don't behave as they should). The class of problems in the non-deterministic failure domain where only guess and check works.

The more seasoned tend to focus more on the flexibility needed to mitigate problems that occur from business process failures, such as when a cooperative environment becomes adversarial, which necessarily occurs when trust breaks down with loss, deception, or a breaking of expectations on one parties part. This phase change of environment, and the criteria is rarely reflected or touched on in the BC/DR plans; at least the ones that I've seen. The ones I've been responsible for drafting often include a gap analysis taking into account the dependencies, stakeholder thoughts, and criteria between the two proposed environments, along with contingencies.

This should includes legal obviously to hold people to account when they fail in their obligations but even that is often not enough today. Legal often costs more than simply taking the loss and walking away absent a few specific circumstances.

This youthful tendency is what makes me cringe. The worst disasters I've seen were readily predictable to someone with knowledge of the underlying business mechanics, and how those business failures would lead to inevitable technical problems with few if any technical resolutions.

If you were co-locating on your own equipment with physical data center access I'd have cut you a lot more slack, but it didn't seem like you are from your other responses.

There are ways to mitigate counter-party risk while receiving the hosting you need. Compromises in apples to oranges services given the opaque landscape rarely paint an objective view, which is why a healthy amount of skepticism and disagreement is needed to ensure you didn't miss something important.

There's an important difference between constructive criticism intended to reduce adverse cost and consequence, and criticisms that simply aren't based in reality.

The majority of people on HN these days don't seem capable of making that important distinction in aggregate. My relatively tame reply was downvoted by more than 10 people.

These people by their actions want you to fail by depriving you of feedback you can act on.

sksjvsla · 9 months ago

I sincerely appreciate it — and I would never downvote a reply like this. It's clear you’ve been around the block, and I respect the experience and nuance you're bringing to the discussion.

On the topic of Hetzner and account risks, I completely agree: this is not just a technical issue, and that's why we built a multi-cloud setup spanning Hetzner and OVH in Europe. The architecture was designed from the start to absorb a full platform-level outage or even a unilateral account closure. Recovery and failover have been tested specifically with these scenarios in mind — it's not a "we'll get to it later" plan, it's baked into the ops process now.

I’ve also engaged Hetzner directly about the reported shutdown incidents — here’s one of the public discussions where I raised this: https://www.reddit.com/r/hetzner/comments/1lgs2ds/comment/mz...

What I got in a private follow-up from Hetzner support helped clarify a lot about those cases. Without disclosing anything sensitive, I’ll just say the response gave me more confidence that they are aware of these issues and are actively working to handle abuse complaints more responsibly. Of course, it doesn't mean the risk is zero — no provider offers that — but it did reduce my level of concern.

Regarding Cloudflare, I actually agree with your point: vendor contract structure and incentives matter. But that’s also why I find the AWS argument interesting. While it’s true that AWS is incentivized to keep accounts alive to keep billing, they also operate at a scale where mistakes (and opaque actions) still happen — especially with automated abuse handling. Cloudflare, for its part, has consistently been one of the most resilient providers in terms of DNS, global routing, and mitigation — at least in my experience and according to recent data. Neither platform is perfect, and both require backup plans when they become uncooperative or misaligned with your needs.

The broader point you make — about counterparty risk, legal ambiguity, and the illusions of control in tech stacks — is one I think deserves more attention in technical circles. You're absolutely right that these risks aren't visible in logs or Grafana dashboards, and can't always be solved by code. It's exactly why we're investing in process-level failovers, not just infrastructure ones.

Again, thank you for sharing your insights here. I don’t think we’re on opposite sides — I think we’re simply looking at the same risks through slightly different lenses of experience and mitigation.

sksjvsla commented on We moved from AWS to Hetzner, saved 90%, kept ISO 27001 with Ansible medium.com/@accounts_7307... · Posted by u/sksjvsla

mbmjertan · 9 months ago

Still, it’s highly location-dependent, and mileage varies drastically between countries.

I’m an SWE with a background in maths and CS in Croatia, and my annual comp is less than what you claim here. Not drastically, but comparing my comp to the rest of the EU it’s disappointing, although I am very well paid compared to my fellow citizens. My SRE/devops friends are in a similar situation.

I am always surprised to see such a lack of understanding of economic differences between countries. Looking through Indeed, a McDonald’s manager in the US makes noticeably more than anyone in software in southeast Europe.

sksjvsla · 9 months ago

As I wrote elsewhere in this thread:

Being able to stay compliant and protect revenue is worth far more than quibbling over which cloud costs a little less or much a monthly salary for an employee is in various countries.

The real ratio to look at is cloud spend vs. the revenue.

For me, switching from AWS to European providers wasn’t just about saving on cloud bills (though that was a nice bonus). It was about reducing risk and enabling revenue. Relying on U.S. hyperscalers in Europe is becoming too risky — what happens if Safe Harbor doesn’t get renewed? Or if Schrems III (or whatever comes next) finally forces regulators to act?

If you want to win big enterprise and governmental deals, Then you got to do whatever it takes and being compliant and in charge is a huge part of that.

sksjvsla commented on We moved from AWS to Hetzner, saved 90%, kept ISO 27001 with Ansible medium.com/@accounts_7307... · Posted by u/sksjvsla

iLoveOncall · 9 months ago

That is absolutely not what I was talking about.

I'm talking about the issues that will happen to your current setup and requirement. Disaster recovery, monitoring, etc.

sksjvsla · 9 months ago

> Disaster recovery, monitoring, etc

The ISO 27001 has me audited for just that (disaster recovery and monitoring) so that settles it, no?

Also worth noting that these are the two things you don't really get from the hyperscalers. If you want to count on more than their uptime guarantees, you have to roll some DR yourself and while you might think that this is easy, it is not easier than doing it with Terraform and Ansible on other clouds.

I have had my DR and monitoring audited in its AWS and EU version. One was no easier or harder than the other.

But the EU setup gave me a clear answer to clients on CLOUD act, Shrems II, GDPR, Safe Harbor, which is a competitive advantage.

sksjvsla commented on We moved from AWS to Hetzner, saved 90%, kept ISO 27001 with Ansible medium.com/@accounts_7307... · Posted by u/sksjvsla

kikimora · 9 months ago

There is a lot more - Aurora to handle our spiky workload (can grow 100x from normal levels at times) - Zero-ETL into RedShift. - Slow query monitoring, not just metrics but actual query source. - Snapshots to move production data into staging to test queries.

Besides this we also use - ECS to autoscale app layer - S3 + Athena to store and query logs - Systems Manager to avoid managing SSH keys. - IAM and SSO to control access to the cloud - IoT to control our fleet of devices

I’ve never seen how people operate complex infrastructures outside of a cloud. I imagine that using VPS I would have a dedicated dev. ops acting as a gatekeeper to the infrastructure or I’ll get a poorly integrated and insecure mess. With cloud I have teams rapidly iterating on the infrastructure without waiting on any approvals and reviews. Real life scenario 1. Let use DMS + PG with sectioned tables + Athena 2. Few months later: let just use Aurora read replicas 3. Few months later: Let use DMS + RedShift 4. Few months later: Zero-ETL + RedShift.

I imagine a dev. ops would be quite annoyed by such back and forth. Plus he is busy keeping all the software up to date.

sksjvsla · 9 months ago

I wanted to comment on this but mistakenly put the answer here. Sorry.

https://news.ycombinator.com/item?id=44335920#44346481

sksjvsla commented on We moved from AWS to Hetzner, saved 90%, kept ISO 27001 with Ansible medium.com/@accounts_7307... · Posted by u/sksjvsla

iLoveOncall · 9 months ago

For now :)

sksjvsla · 9 months ago

If you want me to assess what I would be needing the next 5-10 years, I'd make a very different thread here on HN.

The defining conditions is my current setup and business requirement. It works well and we've resisted pretending that we know where we will be in 5 years.

I am reminded of the 2023 story of the surprisingly simple infra of Stack Overflow[1] and the 2025 story of that Stack overflow is almost dead[2]

Given that the setup works now, one can't add that it is only working "for now". I see no client demand in the foreseeable future leading me to think that this has been fundamentally architected incorrectly.

[1] https://x.com/sahnlam/status/1629713954225405952

[2] https://blog.pragmaticengineer.com/stack-overflow-is-almost-...

sksjvsla commented on We moved from AWS to Hetzner, saved 90%, kept ISO 27001 with Ansible medium.com/@accounts_7307... · Posted by u/sksjvsla

anticodon · 9 months ago

Some of these tasks are required when you run your service in Amazon Cloud as well. It's not all free and not all by default. You'll need someone experienced with Amazon Services to set up many of these things in the Amazon cloud as well.

Also, it's not like you need everything you mention and need it immediately.

NTP clock syncing is a part of any Linux distro for the last 20 years if not more.

I don't remember that Amazon automatically locks down SSH (didn't touch AWS for 7-8 years, don't remember such a feature out of the box 8 years ago).

Rolling web app deploys with rollback can be implemented in multiple ways, depends on your app, can be quite easy in some instances. Also, it's not something that Amazon can do for you for free, you need to spend some effort on the development side anyways, doesn't matter if you deploy on Amazon or somewhere else. There's no magic bullet that makes automatic rollback free and flawless without development effort.

sksjvsla · 9 months ago

Exactly. Well said.

A thing we learned in this process is that there's many levels of abstraction which you can think of rollback and locking down SSH and so on and so forth.

If your abstraction level is AWS and the big hyperscalers, it would be to use Kubernetes, but peeling layers of complexity off that, you could also do it with Docker Compose or even Linux programs that are really battle tested for decades.

Most ISO certified companies are not at hyperscale so here is a fun one: Instead of Grafana Agent from 2020, you could most likely get away better with rsyslog from 2004.

And if you want your EKS cluster to give you insights you have configure CloudWatch yourself so does what hands-off is there comparing that setup to Ubuntu+Grafana Agent?

sksjvsla commented on We moved from AWS to Hetzner, saved 90%, kept ISO 27001 with Ansible medium.com/@accounts_7307... · Posted by u/sksjvsla

OutOfHere · 9 months ago

Hetzner's biggest problem is that they can and do terminate a user's account without warning if the user starts using CPU resources very heavily or for any reason. This is for very legal usage, of course. This can and does happen to people in months. When this happens, consider your data lost and your account blocked. They will offer no explanation whatsoever, and will even send you a bill for the full month. Hetzner simply cannot be trusted, not even a little bit.

As for OVH, they don't do the above, but they have week-long unplanned downtimes, so using them is okay only as an optional resource.

Even so, there are lots of providers that are cheaper than Amazon and won't screw you over.

sksjvsla · 9 months ago

I have not experienced this in spite of rumours online. As I mention in these two comments, given these we decided to design our way around it by assuming that they would both go down at some point of time (but not at the same time).

1. https://news.ycombinator.com/item?id=44335920#44339234

2. https://news.ycombinator.com/item?id=44335920#44337619

sksjvsla commented on We moved from AWS to Hetzner, saved 90%, kept ISO 27001 with Ansible medium.com/@accounts_7307... · Posted by u/sksjvsla

znpy · 9 months ago

So op is now spending 10% of their "$24,000 annual bill" which would be 2400 $/year. Which in turn would be $200/month on infrastructure.

If the whole company can run on 200$/month in VPSes, they probably went on AWS too early.

sksjvsla · 9 months ago

As I wrote elsewhere in this thread:

Being able to stay compliant and protect revenue is worth far more than quibbling over which cloud costs a little less.

The real ratio to look at is cloud spend vs. the revenue.

For me, switching from AWS to European providers wasn’t just about saving on cloud bills (though that was a nice bonus). It was about reducing risk and enabling revenue. Relying on U.S. hyperscalers in Europe is becoming too risky — what happens if Safe Harbor doesn’t get renewed? Or if Schrems III (or whatever comes next) finally forces regulators to act?

sksjvsla commented on We moved from AWS to Hetzner, saved 90%, kept ISO 27001 with Ansible medium.com/@accounts_7307... · Posted by u/sksjvsla

nostrebored · 9 months ago

This is a strangely limited view. Cloud providers have done the work of building fault-tolerant distributed systems for many of the _primitives_ with large blast radius on failure.

For example, you'd be hard pressed to find a team building AWS services who is not using SQS and S3 extensively.

Everyone is capable of rolling their own version of SQS. Spin up an API, write a message to an in memory queue, read the message. The hard part is making this system immediately interpretable and getting "put a message in, get a message out" while making the complexities opaque to the consumer.

There's nothing about rolling your own version that will make you better able to plan this out -- many of these lessons are things you only pick up at scale. If you want your time to be spent learning these, that's great. I want my time to be spent building features my customers want and robust systems.

sksjvsla · 9 months ago

I see where you’re coming from — no doubt, services like SQS and S3 make it easier to build reliable, distributed systems without reinventing the wheel. But for me, the decision to shift to European cloud providers wasn’t about wanting to build my own primitives or take on unnecessary complexity. It was about mitigating regulatory risk and protecting revenue.

When you rely heavily on U.S. hyperscalers in Europe, you’re exposed to potential disruptions — what if data transfer agreements break down or new rulings force major changes? The value of cloud spend, in my view, isn’t just in engineering convenience, but in how it helps sustain the business and unlock growth. That’s why I prioritized compliance and risk reduction — even if that means stepping a little outside the comfort of the big providers’ managed services.

sksjvsla commented on We moved from AWS to Hetzner, saved 90%, kept ISO 27001 with Ansible medium.com/@accounts_7307... · Posted by u/sksjvsla

iLoveOncall · 9 months ago

I don't mean that in an offensive way, but if your amount of logs is so small that you can live tail them, you don't operate at the scale that AWS cares about.

It's paid because operating that feature at AWS' scale is expensive as hell. Maybe not for your project, but for 90% of their customers it is.

sksjvsla · 9 months ago

If you were to divide the AWS customer base into a 10% bucket and a 90% bucket, a 90% bucket would not be the ones needing the infinite scale of AWS.