I’m an SWE with a background in maths and CS in Croatia, and my annual comp is less than what you claim here. Not drastically, but comparing my comp to the rest of the EU it’s disappointing, although I am very well paid compared to my fellow citizens. My SRE/devops friends are in a similar situation.
I am always surprised to see such a lack of understanding of economic differences between countries. Looking through Indeed, a McDonald’s manager in the US makes noticeably more than anyone in software in southeast Europe.
Being able to stay compliant and protect revenue is worth far more than quibbling over which cloud costs a little less or much a monthly salary for an employee is in various countries.
The real ratio to look at is cloud spend vs. the revenue.
For me, switching from AWS to European providers wasn’t just about saving on cloud bills (though that was a nice bonus). It was about reducing risk and enabling revenue. Relying on U.S. hyperscalers in Europe is becoming too risky — what happens if Safe Harbor doesn’t get renewed? Or if Schrems III (or whatever comes next) finally forces regulators to act?
If you want to win big enterprise and governmental deals, Then you got to do whatever it takes and being compliant and in charge is a huge part of that.
I probably won't be responding after this or in the future on HN because I took a significant blast off my karma for keeping it real and providing valuable feedback. You have a lot of people brigading accounts that punish those that provide constructive criticism.
Generally speaking AWS is incentivized to keep your account up so long as there is no legitimate reason for them taking it down. They generally vet claims with a level of appropriate due diligence before imposing action because that means they can keep billing for that time. Spurious unlawful requests cost them money and they want that money and are at a scale where they can do this.
I'm sure you've spent a lot of time and effort on your rollout. You sound competent, but what makes me cringe is the approach you are taking that this is just a technical problem when it isn't.
If you've done your research you would have ran across more than a few incidents where people running production systems had Hetzner either shut them down outright, or worse often in response to invalid legal claims which Hetzner failed to properly vet. There have also been some strange non-deterministic issues that may be related to hardware failing, but maybe not.
Their support is often a one response every 24 hours, what happens when the first couple responses are boilerplate because the tech didn't read or understand what was written. 24 hours + % chance of skipping the next 24 hours at each step; and no phone support, which is entirely unmanageable. While I realize they do have a customer support line, it is for most an international call and the hours are bankers hours. If your in Europe you'll have a lot easier time lining up those calls, but anywhere else and you are dealing with international calls with the first chance of the day being midnight.
Having a separate platform for both servers is sound practice, but what happens when your DAG running your logging/notification system is on the platform that fails, but not a failover. The issues are particularly difficult when half your stack fails on one provider, stale data is replicated over to your good side, and you have nonsensical, or invisible failures; and its not enough to force an automatic failover with traffic management which is often not granular enough.
Its been awhile since I've had to work with Cloudflare tm, so this may have become better but I'm reasonably skeptical. I've personally seen incidents where the RTO for support for direct outages was exceptional, but then the RTO for anything above a simple HTTP(200) was nonexistent with finger pointing, which was pointless because the raw network captures were showing the failure at L2/L3 traffic on the provider side which was being ignored by the provider. They still argued, and downtime/outage was extended as a result. Vendor management issues are the worst when contracts don't properly scope and enforce timely action.
Quite a lot of the issues I've seen with various hosting providers OVH and Hetzner included, are related to failing hardware, or transparent stopgaps they've put in place which break the upper service layers.
For example, at one point we were getting what appeared to be stale cache issues coming in traffic between one of a two backend node set on different providers. There was no cache between them, and it was breaking sequential flows in the API while still fulfilling other flows which were atomic. HTTP 200 was fine, AAA was not, and a few others. It appeared there was a squid transparent proxy placed in-line which promptly disappeared upon us reaching out to the platform, without them confirming it happened; concerning to say the least when your intended use of the app you are deploying is knowledge management software with proprietary and confidential information related to that business. Needless to say this project didn't move forward on any cloud platform after that (and it was populated with test data so nothing lost). It is why many of our cloud migrations were suspended, and changed to cloud repatriation projects. Counter-party risk is untenable.
Younger professionals I've found view these and related issues solely as technical problems, and they weigh those technical problems higher than the problems they can't weigh because of lack of experience and something called the streetlamp effect which is an intelligence trap often because they aren't taught a Bayes approach. There's a SANS CTI presentation on this (https://www.youtube.com/watch?v=kNv2PlqmsAc).
The TL;DR is a technical professional can see and interrogate just about every device, and that can lead to poor assumptions and an illusion of control which tend to ignore problems and dismiss them when there is no real clear visibility about how those edge problems can occur (when the low level facilities don't behave as they should). The class of problems in the non-deterministic failure domain where only guess and check works.
The more seasoned tend to focus more on the flexibility needed to mitigate problems that occur from business process failures, such as when a cooperative environment becomes adversarial, which necessarily occurs when trust breaks down with loss, deception, or a breaking of expectations on one parties part. This phase change of environment, and the criteria is rarely reflected or touched on in the BC/DR plans; at least the ones that I've seen. The ones I've been responsible for drafting often include a gap analysis taking into account the dependencies, stakeholder thoughts, and criteria between the two proposed environments, along with contingencies.
This should includes legal obviously to hold people to account when they fail in their obligations but even that is often not enough today. Legal often costs more than simply taking the loss and walking away absent a few specific circumstances.
This youthful tendency is what makes me cringe. The worst disasters I've seen were readily predictable to someone with knowledge of the underlying business mechanics, and how those business failures would lead to inevitable technical problems with few if any technical resolutions.
If you were co-locating on your own equipment with physical data center access I'd have cut you a lot more slack, but it didn't seem like you are from your other responses.
There are ways to mitigate counter-party risk while receiving the hosting you need. Compromises in apples to oranges services given the opaque landscape rarely paint an objective view, which is why a healthy amount of skepticism and disagreement is needed to ensure you didn't miss something important.
There's an important difference between constructive criticism intended to reduce adverse cost and consequence, and criticisms that simply aren't based in reality.
The majority of people on HN these days don't seem capable of making that important distinction in aggregate. My relatively tame reply was downvoted by more than 10 people.
These people by their actions want you to fail by depriving you of feedback you can act on.
On the topic of Hetzner and account risks, I completely agree: this is not just a technical issue, and that's why we built a multi-cloud setup spanning Hetzner and OVH in Europe. The architecture was designed from the start to absorb a full platform-level outage or even a unilateral account closure. Recovery and failover have been tested specifically with these scenarios in mind — it's not a "we'll get to it later" plan, it's baked into the ops process now.
I’ve also engaged Hetzner directly about the reported shutdown incidents — here’s one of the public discussions where I raised this: https://www.reddit.com/r/hetzner/comments/1lgs2ds/comment/mz...
What I got in a private follow-up from Hetzner support helped clarify a lot about those cases. Without disclosing anything sensitive, I’ll just say the response gave me more confidence that they are aware of these issues and are actively working to handle abuse complaints more responsibly. Of course, it doesn't mean the risk is zero — no provider offers that — but it did reduce my level of concern.
Regarding Cloudflare, I actually agree with your point: vendor contract structure and incentives matter. But that’s also why I find the AWS argument interesting. While it’s true that AWS is incentivized to keep accounts alive to keep billing, they also operate at a scale where mistakes (and opaque actions) still happen — especially with automated abuse handling. Cloudflare, for its part, has consistently been one of the most resilient providers in terms of DNS, global routing, and mitigation — at least in my experience and according to recent data. Neither platform is perfect, and both require backup plans when they become uncooperative or misaligned with your needs.
The broader point you make — about counterparty risk, legal ambiguity, and the illusions of control in tech stacks — is one I think deserves more attention in technical circles. You're absolutely right that these risks aren't visible in logs or Grafana dashboards, and can't always be solved by code. It's exactly why we're investing in process-level failovers, not just infrastructure ones.
Again, thank you for sharing your insights here. I don’t think we’re on opposite sides — I think we’re simply looking at the same risks through slightly different lenses of experience and mitigation.