After down detector went down with the rest of the internet during the Cloudflare outage today I decided to build a robust, independent tool which checks if down detector is down. Enjoy!!
Those are all much smaller. Smaller providers have a much stronger incentive to be reliable, as they will lose customers if they are not. In a corporate settings management will say "this would not have happened if you had gone with AWS". its the current version of "no one ever got fired for buying IBM" (we had MS and others in between).
Hetzner provides a much simpler set of services than AWS. Less complexity to go wrong.
A lot of people want the brand recognition too. Its also become the standard way of doing things and is part of the business culture. I have sometimes been told its unprofessional or looks bad to run things yourself instead of using a managed service.
There is this weird thing that happens with hyperscale - the combination of highly central decision-making, extreme interconnection / interdependence of parts, and the attractiveness of lots of money all conspire to create a system pulled by unstable attractors to a fracturing point (slowed / mitigated at least a little by the inertia of such a large ship).
Are smaller scale services more reliable? I think that's too simple a question to be relevant. Sometimes yes, sometimes no, but we know one thing for sure - when smaller services go down the impact radius is contained. When a corrupt MBA who wants to pump short term metrics for a bonus gains power, the damage they can do is similarly contained. All risk factors are boxed in like this. With a hyperscale business, things are capable of going much more wrong for many more people, and the recursive nature of vertical+horizontal integration causes a calamity engine that can be hard to correct.
Take the financial sector in 08. Huge monoliths that had integrated every kind of financial service with every other kind of financial service. Few points of failure, every failure mode exposed to every other failure mode.
There's a reason asymmetric warfare is hard for both parties - cellular networks of small units that can act independently are extremely fault tolerant and robust against changing conditions. Giants, when they fall, do so in spectacular fashion.
> Smaller providers have a much stronger incentive to be reliable, as they will lose customers if they are not.
Hard disagree. A smaller provider will think twice about whether they use a Tier 1 data center versus a Tier IV data center because the cost difference is substantial and in many cases prohibitively expensive.
Not to mention the familiarity of the company, its services and expectations. You can hire people with experience with AWS, Azure or GCP, but the more niche you go, the higher the possibility that some people you hire might not know how to work with those systems and their nuances, which is fine they can learn as they work, but that adds to ramp up time and could lead to inadvertent mistakes happening.
I've actually tried hetzner on and off with 1 server for the past 2 years and keep running into downtime every few months.
First I used an ex101 with an i9-13900. Within a week it just froze. It could not be reset remotely. Nothing in kern.log. Support offered no solution but a hard reboot. No mention of what might be wrong other than user error.
A few months later, one of the drives just disconnects from raid by itself. It took support 1 hour to respond and they said they found no issue so it must be my fault.
Then I changed to a ryzen based server and it also mysteriously had problems like this. Again the support blamed the user.
It was only after I cancelled the server and several months later that I see this so I know it isn't just me.
>I have sometimes been told its unprofessional or looks bad to run things yourself instead of using a managed service.
That's an incredibly bad take lol.
There are times where "The Cloud" makes sense, sure. But in my experience the majority of the time companies over-use the cloud. On Prem is GOOD. It's cheaper, arguably more secure if you configure it right (a challenge, I know, but hear me out) and gives you data sovereignty.
I don't quite think companies realize how bad it would be if EG AWS was hacked.
Any Data you have on the cloud is no longer your data. Not really. It's Amazon, Microsoft, Apple, whoevers.
> Smaller providers have a much stronger incentive to be reliable, as they will lose customers if they are not.
I disagree because conversely, outages for larger providers cause millions or maybe even billions of dollars in losses for its customers. They might be more "stuck" in their current providers' proprietary schemes, but these kinds of losses will cause them to move away, or at least diversify cloud providers. In turn, this will cause income losses to the cloud provider.
Earlier this year, a Hetzner server I manage was shutdown, and after I started it via the console, it booted to a rescue system. In the same month, it was rebooted without a reason. There was some maintenance notice but the server was not listed as impacted.
Note that I'm not saying Hetzner is bad. Just incidents happen in Europe too. The server didn't have a lot of issues like this over the years.
They've recently introduced bunny.net Shield to add a security layer. I've not made use of it yet so I don't know what the coverage is like or how effective it is: https://bunny.net/shield/
I've done something similar, it's worth noting Scaleway in the same space, for people looking for an AWS replacement more like managed services (equivalents to fargate/lambda/sqs/s3/etc) instead of just bare instance hosting.
+1 for Scaleway. I also use Hetzner for most of my compute. But some stuff just really profits from using managed services. I‘ve used Scaleway‘s Serverless compute offers and managed DBs an been quite happy with them.
We are also looking to migrate off Cloudflare. I thought Bunny.net was mostly a pure CDN, not a reverse proxy like Cloudflare. Am I wrong? One of the most important things for us would be DDoS protection.
American solo developer here. Moved to Hetzner two months ago. They have servers in Oregon for west coast people. My storage box is in Germany but that is okay, it is for backups.
I know you were joking, but responding in seriousness - while in general it's worthwhile asking "Quis custodiet ipsos custodes?", in this particular case, I don't see any issue with Down Detector detecting the Down Detector Down Detector. Assuming they are in different availability zones, using different code, with a different deployment cadence, this approach works quite well in practice.
haha — this is the exact comment i was hoping to see! indeed, i was joking. The Watchmen graphic novel is very important to me as it opened my eyes to the concept of “who watches the watchmen” which I was ultimately eluding to here, albeit extremely facetiously.
Three down detectors walk into a bar. The bartender asks them if they're all up. The first says "I don't know". The second says "I don't know". The third says "Yes".
It's a centralization vs decentralisation vs distributed system question.
Since down detectors serve to detect failures of centralized (and decentralized systems) the idea would be to at least get that right: a distributed system to detect outages.
You basically run detectors that heartbeat each others. Just a few suffice.
Once you start to see clusters of detectors go silent, you can assume things are falling apart, which is fine so long as a few remain.
Self healing also helps to make the web of nodes resilient to inevitable infrastructure failures.
Thank you for your service! Now, for an even bigger challenge: since it seems the increased demand for the Cloudflare status page brought down Amazon CloudFront for a bit as well, build a new CDN capable of handling that load as well...
I think an important caveat here is that down detector was not actually down, the cloudflare human verification component was (AFAIK). I wonder if this downdetector down detector accounts for that aspect? It was technically "not down" but still unusable.
Cloudflare > Bunny.net
AWS > Hetzner
Business email > Infomaniak
Not a single client site has experienced downtime, and it feels great to finally decouple from U.S. services.
Hetzner provides a much simpler set of services than AWS. Less complexity to go wrong.
A lot of people want the brand recognition too. Its also become the standard way of doing things and is part of the business culture. I have sometimes been told its unprofessional or looks bad to run things yourself instead of using a managed service.
Are smaller scale services more reliable? I think that's too simple a question to be relevant. Sometimes yes, sometimes no, but we know one thing for sure - when smaller services go down the impact radius is contained. When a corrupt MBA who wants to pump short term metrics for a bonus gains power, the damage they can do is similarly contained. All risk factors are boxed in like this. With a hyperscale business, things are capable of going much more wrong for many more people, and the recursive nature of vertical+horizontal integration causes a calamity engine that can be hard to correct.
Take the financial sector in 08. Huge monoliths that had integrated every kind of financial service with every other kind of financial service. Few points of failure, every failure mode exposed to every other failure mode.
There's a reason asymmetric warfare is hard for both parties - cellular networks of small units that can act independently are extremely fault tolerant and robust against changing conditions. Giants, when they fall, do so in spectacular fashion.
Hard disagree. A smaller provider will think twice about whether they use a Tier 1 data center versus a Tier IV data center because the cost difference is substantial and in many cases prohibitively expensive.
Not to mention the familiarity of the company, its services and expectations. You can hire people with experience with AWS, Azure or GCP, but the more niche you go, the higher the possibility that some people you hire might not know how to work with those systems and their nuances, which is fine they can learn as they work, but that adds to ramp up time and could lead to inadvertent mistakes happening.
First I used an ex101 with an i9-13900. Within a week it just froze. It could not be reset remotely. Nothing in kern.log. Support offered no solution but a hard reboot. No mention of what might be wrong other than user error.
A few months later, one of the drives just disconnects from raid by itself. It took support 1 hour to respond and they said they found no issue so it must be my fault.
Then I changed to a ryzen based server and it also mysteriously had problems like this. Again the support blamed the user.
It was only after I cancelled the server and several months later that I see this so I know it isn't just me.
https://docs.hetzner.com/robot/dedicated-server/general-info...
That's an incredibly bad take lol.
There are times where "The Cloud" makes sense, sure. But in my experience the majority of the time companies over-use the cloud. On Prem is GOOD. It's cheaper, arguably more secure if you configure it right (a challenge, I know, but hear me out) and gives you data sovereignty.
I don't quite think companies realize how bad it would be if EG AWS was hacked.
Any Data you have on the cloud is no longer your data. Not really. It's Amazon, Microsoft, Apple, whoevers.
This sounds like a good thing.
I disagree because conversely, outages for larger providers cause millions or maybe even billions of dollars in losses for its customers. They might be more "stuck" in their current providers' proprietary schemes, but these kinds of losses will cause them to move away, or at least diversify cloud providers. In turn, this will cause income losses to the cloud provider.
Deleted Comment
You can use whatever infrastructure you want for whatever reason, but you may not have an accurate picture of the availability.
This may be true over a long enough timeframe, but GP stated that their clients had experienced no downtime since switching at the start of the year.
That is clearly better than both AWS and Cloudflare during that time.
Note that I'm not saying Hetzner is bad. Just incidents happen in Europe too. The server didn't have a lot of issues like this over the years.
Am I missing something or is bunny.net not actually a replacement for that?
Ah yes, the place for RabbitMQ endpoints.
but who detects the down detector detecting the down detector detecting the down detector
Arbites.
Maybe distributed down detection?
I know there are people here perfectly capable of running with that idea and we might just see a distributed down detector announced on HN :)
Deleted Comment
https://youtu.be/DpMfP6qUSBo
Deleted Comment
Deleted Comment
Deleted Comment
It's downdetectorsdown all the way down.
From there, the "who's watching who?" can become mathematically interesting.
* https://en.wikipedia.org/wiki/Directed_Graph
Since down detectors serve to detect failures of centralized (and decentralized systems) the idea would be to at least get that right: a distributed system to detect outages.
You basically run detectors that heartbeat each others. Just a few suffice.
Once you start to see clusters of detectors go silent, you can assume things are falling apart, which is fine so long as a few remain.
Self healing also helps to make the web of nodes resilient to inevitable infrastructure failures.
Looks like it's hosted in London?
Deleted Comment