Metrics should be emitted in separate stream and never by logs outside corner cases. Logs should be used to determine WHY the system is having issues but never IS the system having issues.
Log alerting is a fools errand that looks like a great idea at start but quickly becomes a sand trap that will drive future people crazy and at scale, will overwhelm systems.
Why is log alerting bad idea?
Every log becomes a metric point that must be dealt with. Therefore, the logging system must be kept operational and error free. However, due to other problems below, this system quickly becomes a beast of it's own.
Logs are generally much bigger then KV of <Metric> <Value> so there ends up being a ton of filtering going on in logging system, adding to the load.
Logging system probably does not understand rates so you end up writing gnarly queries to be like "Is this first unhandled exception?" in 10m or my 50th in 10m. Query in Prometheus is much much simpler.
Each language logging library handles things in different way so organization must be on point to either A) Keep log format the same between all different languages. B) Teach the logging system how to manipulate each log into format that can be handled by alerting system. Obviously A causes massive developer friction and B causes massive Ops friction.
Finally, I find people doing logging tend not handle exceptions as well because they can just trust logging system to alert them on specific problem and deal with it manually.
So for future Ops person who has to deal with your code, I'm begging you, import prometheus_client.
I've spent the last years in Python land, recently heavily LLM assisted, but I'm itching to do something with Ruby (and or Rails) again.
It also serves as a natural sandbox for the "setup" part so we can always know that in a finite (and short) timeline, the script is interpreted and no weird stuff can ever happen.
Of course, there are ways to combine it (e.g. gitlab can generate and then trigger downstream pipelines from within the running CI, but the default is the script. It also has the side effect that pipeline setup can't ever do stuff that cannot be debugged (because it's running _before_ the pipeline) But I concede that this is not that clear-cut. Both have advantages.
My argument is that we should acknowledge that any CI/CD system intended for wide usage will eventually arrive here, and it's better that we go into that intentionally rather than accidentally.
> The Fix: Use a full modern programming language, with its existing testing frameworks and tooling.
I was reading the article and thinking myself "a lot of this is fixed if the pipeline is just a Python script." And really, if I was to start building a new CI/CD tool today the "user facing" portion would be a Python library that contains helper functions for interfacing with with the larger CI/CD system. Not because I like Python (I'd rather Ruby) but because it is ubiquitous and completely sufficient for describing a CI/CD pipeline.
I'm firmly of the opinion that once we start implementing "the power of real code: loops, conditionals, runtime logic, standard libraries, and more" in YAML then YAML was the wrong choice. I absolutely despise Ansible for the same reason and wish I could still write Chef cookbooks.
This is the nasty key point. The reliability is decided client-side.
For example, systemd-resolved at times enacted maximum technical correctness by always returning the lowest IP address. After all, DNS-RR is not well-defined, so always returning the lowest IPs is not wrong. It got changed after some riots, but as far as I know, Debian 11 is stuck with that behavior, or was for a long time.
Or, I deal with many applications with shitty or no retry behavior. They go "Oh no, I have one connection refused, gotta cancel everything, shutdown, never try again". So now 20% - 30% of all requests die in a fire.
It's an acceptable solution if you have nothing else. As the article notices, if you have quality HTTP clients with a few retries configured on them (like browsers), DNS-RR is fine to find an actual load balancer with health checks and everything, which can provide a 100% success rate.
But DNS-RR is no loadbalancer and loadbalancers are better.
There were definitely some warts in that system but as those sorts of systems go it was fast, easy to introspect, and relatively bulletproof.
I think AWS will do 5 Gbps with a capable peer -- which is their limit for a single flow [1] -- but you might need to tell them first so they don't kill public networking on the instance though. I found that UDP iperf tests reliably got my instance's internet shut off, so keep that in mind. On the other hand, OVH will happily do 5-ish Gbps to/from my EC2 instance in a TCP iperf test, but won't tolerate more than 1 Gbps of inbound UDP. OVH support has indicated that this is expected, though they do not document that limitation and it seemed that both their support and network engineering people were themselves unaware of that limit until we complained. They don't seem to have the same limits on ESP, which is why I developed an interest in ipsec arcana.
[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-inst...
wait, what? Pretty sure I still used unencapsulated ESP a few months ago… though I wouldn't necessarily notice if it negotiates UDP after some update I guess… starts looking at things
Edit: strongswan 6.0 Beta documentation still lists "<conn>.encap default: no" as config option — this wouldn't make any sense if UDP encapsulation was always on now. Are you sure about this?
There's an issue open for years; it will probably never be fixed:
Although, I do feel slighted when a manager acknowledges the absurdity of all the corporatisms we hear everyday then proceeds to preach them to everyone and waste time anyway. Like, please, I thought we just agreed this is all fluff.