General guidance when working as a cloud engineer

Truth is an interesting concept. It's often subjective and has many forms. Within the context of the cloud, almost all cloud services are only mutable, so "truth" is whatever the current state of the cloud actually is. Whatever is in Git is merely idealism.

Whatever you are maintaining, read the docs completely first. And I mean cover to cover. Not just the one chapter you need to get a PoC up and running. You will wish you had later, and it will come in handy many times over your career. Consider it an investment in your future.

Read books on microservices before you implement them. Whatever two-line quip you read on a blog will not be as good as reading several whole books from experts.

Docker multi-stage builds won't work in some circumstances. Build optimization eventually gets complex, the more you rely on builds to be "advanced".

crdrost · 3 years ago

Thanks for the alternative microservices quip, it was better than the original. Indeed, I find that “microservices should only perform a single task” is a really dangerous way to phrase it because we have no idea what the article means by “task.” The classic microservices separation is to separate an ordering service from a shipping service, is each of those one “task”? Or at the most extreme, is saving an order distinct from returning the list of your outstanding orders? Even when people graduate to a language of DDD and refine, often they settle on “one microservice per bounded context” where “bounded context” means “separated however I want it separated at the time,” and has no consistent principle behind... This despite the fact that I think Eric was quite explicit in his explanation of the idea, he meant a mapping of the software idea to the fussy complex world of businesspeople and business language: perhaps a better way to phrase this is that it's one microservice per archetype of user, “we have people from the warehouses who all speak the same shipping jargon, we should have a microservice specifically for them which speaks their language,” and I think most developers target their microservices smaller than that, in which case it is definitely not “one microservice per bounded context”.

Don't get me started on how “strong coupling” is shorthand for “coupled in ways I don't like” etc. ... Sometimes I feel like I'm on an episode of “whose line is it anyway?”, where everything is made up and the points don't matter.

LawTalkingGuy · 3 years ago

All guidance assumes it's for a thinking person. You should look at what could be learned, not insist that there be one unambiguous message and that it be earth-shaking.

> I find that “microservices should only perform a single task” is a really dangerous way to phrase it because we have no idea what the article means by “task.”

First off, it's not dangerous. It's just loosely defined.

Second, it's not important to know their task size to still know that two different (to them!) things shouldn't be shoved together.

If the two tasks fit together so well that it's a mere 5% extra code to do them both, then maybe they are two sides of the same coin. But if "one simple thing" is painful to implement maybe it's really two separate things under one blanket.

> often they settle on “one microservice per bounded context” where “bounded context” means “separated however I want it separated at the time,”

Right, because sometimes 'bounded context' is what Legal tells you about data residency and sometimes it's about optimal latency.

> Don't get me started on how “strong coupling” is shorthand for “coupled in ways I don't like”

Strong coupling is generally easy to define, relative to loose coupling, in any given language/platform. The value of engineering around it depends on the value of doing the thing which requires it multiplied by how often you have to do it.

vsareto · 3 years ago

Most of this comes down to what the team/org wants and who has authority to tell you it’s not defined right or tightly coupled.

It definitely is loosely defined and the rules do get stretched to fit opinions though.

rswail · 3 years ago

"Microservices" are just "services". I use the business object/entity as the service boundary. So I don't have an "ordering" vs "shipping" service. I have an Orders service and a Shipments service.

Orders are related to Fulfillments are related to Shipments. They are "coupled" in that an Order will trigger Fulfillment, and Fulfillment will trigger Shipment (there's a Payments service in there somewhere too).

k__ · 3 years ago

"read the docs completely first"

I learned from using software like Photoshop and Ableton Live, that you shouldn't underestimate the complexity of any software you use.

Take a few days or weeks, if you can, to read docs or do high quality courses on the topic and it will make your life easier in the long run.

pabs3 · 3 years ago

The only truth is the memory and disk contents of the devices that make up your cloud. Everything else is an abstraction of that, which discards data and potentially is out of sync with reality.

thewisenerd · 3 years ago

while I wouldn't wish the bootstrap problem on my worst enemy, I think the idealism helps for at least versioning configuration changes, and partial component-level tear-downs and bring ups (you don't need this often, but when you do, you do).

also, with k8s, nothing like deleting the wrong object or making a change and not knowing what it was, N revisions ago.

> Do not make production changes on Fridays

I ~hate~ dislike this advice. If you can't deploy on a Friday, you need to fix your deployment strategy. By removing Friday from when you can deploy, you're wasting 1/5 of your available days.

Note: deploy != Release[1]. Use flags, canaries etc.

[1]: https://andydote.co.uk/2022/11/02/deploy-doesnt-mean-release...

Edit: hate is far too stronger word for this

Sevii · 3 years ago

The point of not deploying on friday is to reduce the risk of getting paged over the weekend. It's a quality of life move for the oncall team. No deployment strategy will change the fact that deployments are the leading cause of outages.

If you can't afford to give up 1/5 of your available deployment days you have a problem somewhere in your CI/CD system.

nijave · 3 years ago

Sure but ideally you have high enough confidence in your software that those types of issues are highly unlikely.

kevan · 3 years ago

I'm a huge advocate for CI/CD pipelines and my team owns a lot of them. We're confident enough to deploy anytime but we choose to limit deploys to our team's business hours and not on Fridays. Why? Because we think the return going from deploying 4 days/week to 5 days/week is outweighed by the stress and morale hit of ruined weekend plans if something weird happens. There's probably situations where that extra speed makes a difference but for us deploying to all regions safely can take a full day anyways so it's pretty normal to have multiple changes flowing at the same time.

pondidum · 3 years ago

I understand that, but would counter with that it sounds like deploy == release, and that if that weren't the case, you could deploy more often.

However, I will admit it is a trade-off; some engineering time does have to be spent to get there, and perhaps that engineering time is better used elsewhere right now.

grogenaut · 3 years ago

CI/CD, flags, canaries don't catch everything, and can still cause outages to others. We try and do pretty heavy CI/CD where I work, but not everyone does (we, like everyone, has old systems). It's actually quite easy for us to have the well behaved systems honor release hours or not depending how their release history has gone, or coverage,etc... but they're well behaved, so they usually have great tests, and they're not usually panicked about rolling out after hours, they have their sh*t together.

The reason we have core hours release only without director approval (aka director approval required outside core hours) is so you don't piss off another team by paging them after hours, and so you aren't trying to shove out a thing on a system that doesn't have good coverage or by turning off the safeties. In a large company I've noticed many engineers assume urgency even where there isn't. As an approver myself, most of the time someone wants to rush is because they've not even had the convo with their manager on if it's worth the risk, they are assuming urgency because that's when the sprint ends or what some TPM added to a jira ticket 4 months ago.

I admit that sounds risky itself (the engineers not having the right risk training) but this is why we have a policy and tooling... most of the times I've dug in they're just very new and worried about perception as a new employee, so my job is to shepherd them through having that convo with their managers which inevitably has the managers saying "yes it can totally wait till monday", and the change is inevitibly a bit more hot than it should be due to accidental deadline pressure.

rexarex · 3 years ago

I get that people really want to flex that they can deploy on Friday afternoon and NOTHING CAN GO WRONG, but it’s still foolish and flaunts Murphy’s Law. It can wait.

nikau · 3 years ago

Plus they are likely running simple Mickey mouse systems that aren't intertwined with a bunch of other systems maintained by other groups.

lockedinspace · 3 years ago

Yep, let's not forget that Murphy's Law, whenever you least expect it, boom.

dopylitty · 3 years ago

This one made me laugh. I've been places that only allow deployments on Fridays because it gives the whole weekend to fix things if they break.

It's a good interview question as a candidate. If you ask the interviewer when they deploy and they say only Friday (or worse only once a month) then perhaps look elsewhere for your own sanity because it's a sign of serious malfunction either organizationally, technically, or both.

spc476 · 3 years ago

> or worse only once a month

Don't discount a job because of one deployment per month---it really depends upon the service. I joke that a busy year for me involved four deployments to production, but "production" for me wasn't a website, or even a web-based service, but a service involved in the call path for phone calls. Our customer is [1] the Oligarchic Cell Phone company and the SLAs are pretty severe.

I do have to ask---where do you people work where you have multiple deployments per week (or even per day)? To me, that sounds insane!

[1] Still is, even though I left a few months ago, and not because of the lack of deployments, but the shoving of Enterprise Agile [2] on the company by new management.

[2] Which is anything but Agile.

bityard · 3 years ago

I am on a tools team, so the "customers" for our team are the company's developers. For changes that might cause an extended outage if things go sideways, we generally prefer to do those after-hours or on weekends so that we don't have all the dev teams sitting idle during work hours if something goes wrong on our end.

The upshot is that this is fairly rare and we do not have an on-call rotation. If most anything breaks over the weekend, nobody is going to notice or care until Monday morning rolls around.

fragmede · 3 years ago

Depending on your role, that is. If your desired position is straight dev with minimal to no ops work as possible, then yeah, red flag. However, if you're an SRE/DevOps-type person, setting up a continuous deployment system so they can deploy more often than that is a perfect landing task to dig your teeth into. Different strokes for different folks.

tbrownaw · 3 years ago

If there's a very strong "only during standard work hours" usage pattern with SLA penalties for downtime, adapting deployment patterns to that reality can maybe be sensible.

rexarex · 3 years ago

I love this idea.

doctor_eval · 3 years ago

You should have both the confidence that you could deploy on Fridays, and the wisdom to know that you shouldn’t.

lopatin · 3 years ago

Interestingly my company only deploys on Friday because it has to wait for (most) markets to close for the weekend.

pondidum · 3 years ago

Oh now that is an interesting take!

Would that still be required if a deployment and a release are decoupled entirely, or is it unavoidable? Genuinely interested!

elric · 3 years ago

Hating it seems a little strong. I'm sure that any team far along enough on the quality spectrum can just read this and say "we've moved beyond this worry". The post is titled "general guidance", not "absolute truths". Adjust expectations accordingly.

pondidum · 3 years ago

You're right, hate is far too stronger term for this, I've updated the wording. Thanks.

charcircuit · 3 years ago

There isn't a bottleneck in the amount of commits you can ship per day. You just get more changes to roll out on Monday.

I do disagree with the absolute statement of not doing it, but I definitely do a risk analysis whenever I ship a change on Friday and avoid anything risky and just push it off until Monday morning.

Not all changes can be put behind flags and canaries don't really fix the issue (unless you are okay in blocking important fixes from being rolled out due to your bad change killing the canary)

lockedinspace · 3 years ago

Seems reasonable, but if you work for a large company, you can't guarantee that a major release (which is a production change) won't cause any unexpected harm. I have worked with quite bit organizations, and deployed on Friday and wasted my entire Friday-night and Saturday morning, rolling back the +130 components that an app had.

If you are a small company, or you do not do extra weekend shifts, I understand your point. Elsewhere, you just want to live an adventure every Friday.

nielsole · 3 years ago

Another random selection:

* When choosing internal names and identifiers (e.g. DNS) do not include org hierarchy of the team. Chances are the next reorg is coming faster than the lifetime of the identifier and renaming is often hard.

* The industry leading tools will contain bugs. From Linux kernel to deploy tooling, there are bugs everywhere. Part of your job is to identify and work around them until upstream patches make it to you if ever.

* Maintaining a patched fork is usually more expensive than setting up a workaround

* Your hyperscaler cloud provider has plenty of scalability limitations. Some of which are not documented. If you want to do something out of the ordinary make sure to check with your account rep before wasting engineering time.

* Bought SaaS will break production in the middle of the night. Your own team will have the best context and motivation to fix/workaround them. When choosing a vendor, include the visibility into their internal monitoring as a factor for disaster recovery (exported metrics and logs of their control plane for example)

vladvasiliu · 3 years ago

> * Your hyperscaler cloud provider has plenty of scalability limitations. Some of which are not documented. If you want to do something out of the ordinary make sure to check with your account rep before wasting engineering time.

If only they'd tell you. We had this exact issue on AWS. Seemingly random packet drops. Metrics on both clients and servers were ok, latency specifically was very low when it worked.

Call up support "yeah, you're running into our connection limit". "Oh. What's that limit?" "yeah, I can't tell you that". His solution was that, since this was somehow related to connection tracking in the security group, I could set this to allow all/all, and set up filtering at the NACL level. Turns out I could do it for this particular issue.

This was before there was a possibility to monitor this [0]. Called up our customer manager. "Let me check". A few days later, "yeah, that's not something we divulge".

---

[0] For those who don't know, it's now possible to keep an eye on refused connections (at least on Linux). https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitori... -> conntrack_allowance_exceeded

toast0 · 3 years ago

Ran into that one too, but my service rep didn't mention the possibility of configuring connectionless firewall rules. I'm still bitter many years later.

anilakar · 3 years ago

> When choosing internal names and identifiers (e.g. DNS) do not include org hierarchy of the team.

Naming in general is hard. If you name stuff based on location, use an identifier that won't change, like provider datacenter names, street addresses or customer building codes, not the current tenant or purpose of use.

For products, come up with an internal product/project name and stick to it in everything that is not immediately visible to the customer. At one point you could see the current and three previous names of our product if you popped out an iframe and opened the inspector (logo with name, page title, URL and prefixed log messages).

> Maintaining a patched fork is usually more expensive than setting up a workaround

When your bosses demand additional features a single customer requested, you absolutely have to make them understand that the functionality must be added to the main product.

throwawaaarrgh · 3 years ago

My naming convention is like this:

- anything my company doesn't create or own, is just called whatever it is

- logical components that aren't specific to the org chart can be whatever you want

- anything org chart specific gets a randomly generated code name. an internal website allows you to register and look up any code name across the company. this allows you to find who the hell owns server "ws-prod" in a 6 year old account that nobody seems to maintain. instead you find "peanutcar-ws-prod" and can then look up who registered "peanutcar". (this also prevents the bu-org-group-product-subproduct-env mess that eventually runs over character limits)

- doesn't matter if the rest of the company doesn't do it, I do it for what I manage. later on if it gets adopted, fine, but if not, at least I won't ever have to rename my crap.

kator · 3 years ago

Don't forget "pets vs cattle", thinking of servers as ephemeral and working towards quickly being able to scale up/down based on demand. So often I see people "lift and shift" from a dedicated server model into the cloud and never convert their pets into cattle. This reduces flexibility later, not to mention makes it harder to respond to patching needs, scaling, and moving to optimize latency or costs.

r3trohack3r · 3 years ago

As an ex-FAANG engineer, this is FAANG advice. Pets are just fine. Most companies arent FAANG and don't need that class of solution.

An R620 plugged into a switch in a colo, a bash script via cron, or a cloudflare worker are just fine for a lot of use cases. The only time it stops being fine is when you can't afford to do your pet -> cattle migration as you scale up. But I don't think this is a common death for companies.

If you call "cattle" a cloudflare worker or lambda function - fine. But when we are talking about multiple redundant servers with load balancing across them, you really need to justify the cost of that vs the value you squeeze out. Sometimes you're squeezing the juice out of the rind.

mr_toad · 3 years ago

Treating servers as disposable is about more than just scale. It helps avoid creating snowflake servers, makes DR more predictable, and makes creating dev environments much easier.

Pets are fine in the sense of "there's no way our servers would just disappear, and Larry the DevOps Guy who knows everything will never leave us..."

Cattle is the best approach. Practice it and make it your default.

oofnik · 3 years ago

No, pets are not fine.

Just the other day I had to perform some maintenance on a long-running VM hosting some monitoring software. A backup VM is supposedly always running and ready to handle the workload in case of downtime. The switchover seemed to go fine, at first.

Turns out, someone long ago had manually added a cron job to the primary server without adding it to the backup server, without documenting what it does, what permissions it needs, how it works, or why it's needed. This was only discovered after some manager in a different department complained that he stopped receiving the daily report to his e-mail inbox.

If whoever deployed the report generation script took an extra hour to document what the script did, or even better, added it to VCS as part of the provisioning process for the server and re-deployed the server to ensure that the process works as expected, a day's worth of headache could have been averted.

dijksterhuis · 3 years ago

> But when we are talking about multiple redundant servers with load balancing across them, you really need to justify the cost of that vs the value you squeeze out.

Sometimes you can justify using a thing for the wrong reasons.

I recently attached 1x NLB to each of our Swarm clusters to migrate to automatically managed certificates directly attached to the NLB (Digital Ocean).

$COMPANY has maybe ~3 users accessing each production application at a time. So the NLB itself is utterly pointless.

But Engineering no longer have to fix the certificates each quarter after users see an insecure browser warning and email us about it.

100% worth it for $12 pcm (per swarm cluster).

nuker · 3 years ago

> Pets are just fine.

They are not, if they are being configured by hand, using mouse or cli

voiper1 · 3 years ago

Some replies are saying this is only for "as-scale/FAANG".

It may only be absolutely necessary there, but it's helpful even for smaller folks.

Over the years, even Debian LTS goes out of support and new features and software should be installed. There's moving systems, doing restores, things breaking and wanting to "reset" to a known working state. Any time you can do something simple with docker or even just (short) step-by-step build scripts, that's a huge win.

I have playbooks for deploying a system, but with npm installs, bower installs, secrets to be hand copied from multiple places, etc, it feels more like pets and it's NOT simple to deploy.

candiddevmike · 3 years ago

Citation needed? There are tradeoffs to both, one is not always better than the other.

hiAndrewQuinn · 3 years ago

There might be an earlier source, but I first ran across the pets versus cattle nomenclature in Tom Limoncelli's _Handbook of System and Network Administration_ - which is a really, really good read for anyone going deep into ops space (like a cloud engineer should be).

paulryanrogers · 3 years ago

What's the advantage of pets? Simplicity?

birdymcbird · 3 years ago

> A good monitoring system, well-organized repository, fault-tolerance workloads and automation mechanisms are the basis of any architecture.

Monitoring/alarming, and knowing what to monitor. Also, properly instrument your services or whatever it is you have. Take time to reflect on what are the signals that tell you operational health. An error metric alone is useless if you don’t know the denominator. Also be careful to avoid adding noisy metrics that cause panic for no reason.

I’m not sure what fault tolerance means in this context. Very handwavy statement. I think if you have dependencies, have a plan and understanding of which ones tipping over will bring down your service or how you can build resiliency. For example, some feature on your page requires talking to a recommendations service. If the service goes down, can you call back to a generic list of hard coded recommendations or some static asset?

As for automation: yeah, have test workflows built into your CI/CD harness. And avoid manual steps there requiring human intervention. Use canaries to test certain functions are up and running as expected, etc

Maybe I was a bit vague in the fault tolerance statement. What I mean is to have a high availability in your services, e.g: using AWS ASG for your servers, having more than one replica for your Kubernetes pods.

If one of the servers/pods fail, the process behind detects them as "unhealthy" (having a nice monitoring/alarming as you mentioned) and replaces them with a new server with the same software characteristics so, for the end-user, SO, your client. Nothing has changed, the load just moved to a single instance for about 5-10 mins until a new server was deployed.

Cool makes sense

TrackerFF · 3 years ago

"Learn to say: I do not know about this/that. You cannot know everything that gets presented to you. The bad habit comes when the same technological asset appears for a second time and you still do not know how it works or what it does."

Absolutely. I've seen so many junior engineers / devs go on about it like this:

Someone higher up: Could you please look at this problem? I need it fixed ASAP.

Jr. Engineer, presented with a problem he's never seen before: No problem, I will look into it!

Someone higher up (the next day): Did you fix the problem?

Jr. Engineer: Sorry, I haven't still gotten around to look at it / I'm still working on it / etc.

Someone higher up: We really need it fixed today, please prioritize it and give me a call when it is fixed.

Jr. Engineer works on the problem all night, feeling stressed out, not wanting to let down his seniors.

WolfOliver · 3 years ago

"Microservices should only perform a single task." -> I guess this advice is the reason there are so widely misunderstood, see: https://linkedrecords.com/challenging-the-single-responsibil...

adamisom · 3 years ago

Wow and I thought functions should only perform a single task. I need to keep up with the times! Apparently you need an entire deployable app and API to do anything these days. I guess it makes sense. How else could we justify so many software engineers!?

So many? Last I checked there was a huge shortage, and with the exception of a couple of notable bloatware companies, most seem to be understaffed?

_vertigo · 3 years ago

I think this advice really depends on your scaling needs. If you need to scale your services up, it’s a lot easier to do that if each service only does one thing.

It also depends on how much functionality you consider to be “one thing”.

dagss · 3 years ago

I never understood when people talk about microservices and scaling (for traffic).

I thought microservices is a solution to scale development teams, not for traffic.

If you have a horizontally scalable monolith, it can scale pretty much as far as you want. If you split services along functional boundaries (i.e., vertical) then a split from 1 to 2 services will in the extreme best case scenario give you 2x scaleup; further splits give you less. So: If load is the issue, work on horizontal scaling, not microservices.

What am I missing?

abledon · 3 years ago

> If you need to build an architecture which involves microservices, I am sure that your cloud provider has a solution that fits better than Kubernetes. E.g: ECS for AWS.

Thank you! So many people running unnecessary things on Kubernetes

On the other hand, K8S provides you with orchestration abstraction across AWS, GCP, Azure, VMWare, bare metal.

There are distinct advantages to that in terms of both development (running a local K8S cluster is relatively easy) and deployment.

ECS has no distinct advantages over K8S (or EKS in AWS land). Particularly now that there are CRDs for K8S that allow you to deploy AWS functionality (eg ALBs, TGs) from K8S.