The problem I ran into was performance was bimodal. We had this one group of users that was lightning fast and the rest were far slower. I chased down a few obvious outliers (that one forum thread with 11000 replies that some guy leaves up on a browser tab all the time, etc.) but it was still bimodal. Eventually I just changed the application level code to display known bots as one performance trace and everything else as another trace.
60% of all requests are known bots. This doesn't even count the random ass bot that some guy started up at an ISP. Yes, this really happened. We were paying customer of a company who decided to just conduct a DoS attack on us at 2 PM one afternoon. It took down the website.
Not only that, the bots effectively always got a cached response since they all seemed to love to hammer the same pages. Users never got a cached response, since LRU cache eviction meant the actual discussions with real users were always evicted. There were bots that would just rescrape every page they had ever seen every few minutes. There were bots that would just increase their throughput until the backend app would start to slow down.
There were bots that would run the javascript for whatever insane reason and start emulating users submitting forms, etc.
You probably are thinking "but you got to appear in a search index so it is worth it". Not really. Google's bot was one of the few well behaved ones and would even slow scraping if it saw a spike in the response times. Also we had an employee who was responsible for categorizing our organic search performance. While we had a huge amount of traffic from organic search, it was something like 40% to just one URL.
Retrospectively I'm now aware that a bunch of this was early stage AI companies scraping the internet for data.
Google has invested decades of core research with an army of PhDs into its crawler, particularly around figuring out when to recrawl a page. For example (a bit dated, but you can follow the refs if you're interested):
https://www.niss.org/sites/default/files/Tassone_interface6....
I've not actually seen an AZ go down in isolation, so whilst I agree its technically a less "robust" deployment, in practice its not that much of a difference.
> these "we cut bare costs by moving away from the cloud" posts are catnip for HN. But they usually don't make sense.
We moved away from atlas because they couldn’t cope with the data growth that we had(4tb is the max per DB). Turns out that its a fuck load cheaper even hosting on amazon (as in 50%). We haven't moved to hertzner because that would be more effort than we really want to expend, but its totally doable, with not that much extra work.
> more maintenance tasks (installing, upgrading, patching, troubleshooting, getting on-call, etc) with lower reliability and fewer services, isn't an advantage.
Depends right, firstly its not that much of an overhead, and if it saves you significant cash, then it increases your run rate.
Counterpoint: I have. Maybe not completely down, but degraded, or out of capacity on an instance type, or some other silly issue that caused an AZ drain. It happens.