Side note -- I know a lot of early YC startups like to play things fast and loose, but you really should compensate your engineers for after-hours emergencies if they are already working 40 hour weeks. Morally and employee-retention-wise it's the obvious thing to do, but beyond that in certain states and jurisdictions you can easily run afoul of local labor laws if you try to require employees to do things outside of their regular hours without compensation. Another very important angle to consider is you want to incentivize your employees to answer the call to deal with these issues. If there is no incentive structure in place, they might not take it seriously, which in itself can be a security and stability risk that could threaten things like your SOC-2 compliance, if you have it.
When I worked as a full time DoD scientist, there was a very set-in-place system for dealing with these situations, and it was normal to pay double overtime for any hours spent by an employee on after hours emergencies. This is the right way to do it. Pre-series B startups largely don't do this, but once you get to series B and C suddenly it becomes a thing because companies realize they either have to legally, or have to to prevent their employees from churning and to protect themselves from people not showing up to put out the fire.
Just do it, and do it early. Do it before Series A. That's the advice I give my consulting clients, and the approach I take with my own companies. By compensating your employees for this time you also take what would be a red flag for many would-be employees and turn it into an exciting perk.
Yes, and even more importantly: compensate your engineers who minimize the need for these kinds of heroics. Ones who:
* fix the alerts so they reliably page when there's a SLO-worthy problem and only then.
* test the restore system so it works smoothly when needed at the necessary scale.
* add safety checks to prevent the need to use those backups in the first place.
* get to the root cause of yesterday's outage and prioritize the 9–5 engineering work to ensure it won't happen again.
It's awful to work for a place where heroics are often necessary and unrewarded. I still don't like working in a place where heroics are often necessary, even if they're celebrated. They're often avoidable.
The trick is to make sure the heroic efforts are rewarded quietly, but the unheroic ones create lots of noise.
This is inherently difficult because the heroic efforts tend to come as a result of very noisy incidents - everyone already knows the database cluster was literally on fire and wants to hear how Jane put it out while simultaneously remagnetizing the backup drives. The lead's job is to make the noise ensuring everyone knows Morgan's the one who did the boring rewiring of the backup-restore process to automatically sync and fail over to another AZ when thermal metrics start trending bad.
One thing I've noticed as we've been trying to integrate some on-call processes across formerly separate teams is that "our" incident reports were about 25% timeline/responses/impact analysis and 75% RCA and plans for future mitigations. "Theirs" were the opposite. Our incident rate has trended down sharply over the past two years even as our system grew; theirs has scaled up roughly at the same rate as their service count.
Avoidance isn't the goal of mgmt though. Cost minimization is, and by quantizing the compensation portion it can help figure out the cost of a bug. Ultimately the cost of a bug is a key variable in the decision to prevent or solve.
The cost can easily vary from $0 (eg at a startup with no customers) or to millions (maybe billions?) when you consider the risk of brand damage, lost ARR (multiply that ARR by the sales multiple!), decremented velocity (lost opportunity value), and eventually even recruiting cost (to replace frustrated employees).
I've always wanted to work at a JoelTest[1] company which fixes bugs before writing new code... But the closest i've found is a commitment that the oncall engineer works on bugs for their rotation to reduce the bug backlog ( and is not part of sprint velocity)
What I find interesting is that if you give employees a choice of 100 units of currency per year as salary plus 1 unit of currency for each week on call with an expectation of 1 week on call every four weeks versus 115 units of currency for the year with the same on-call expectations, some employees will feel better about the first arrangement and some will feel better about the second arrangement.
Some people want to see everything broken out and take a very transactional approach to work; others prefer the simpler approach but with less clear linkage to a specific piece of work.
The problem is once you consider it part of your standard comp, you take it for granted. If emergencies almost never happen and then one does, you will inevitably feel cheated when you have to wake up at 2 AM on a Sunday. Much better familial optics if you can tell your family "hey we got an extra $500 for that" versues "oh, dealing with these is included in my take-home salary".
Even worse, if emergencies happen all the time, your family is going to hate you if you don't have some reward you can show for each one, especially if there are other jobs at your salary range that don't have an on-call clause.
I'd say the issue for people who are inclined to pick the first arrangement is one of trust.
If you don't trust your employer not to abuse the second arrangement then you'd be a fool to accept it. With the first arrangement the employer is incentivised not to exceed the agreed upon expected number of on-call weeks because they are financially penalised for additional on-call time. With the second arrangement the employer can demand as much on-call time as they think they can get away with.
It's really a case of if you expect your relationship with your employer to be in some way adversarial - if you trust your employer not to screw you than go ahead, but you can't necessarily reliably make that judgement going into a new job.
What do you mean by "expectation"? One day on call, one day paid overtime; anything else is abusive.
There should be no incentive to require or avoid "free" work, only advance planning of who'll be on call (with more available engineers doing more turns).
Another side note, doing a 4-day work week for engineering is a great and productivity boosting practice, but it shouldn't be seen as a carte blanche to not compensate them for after-hours emergencies. If you have a 4-day work week, then 32 hours a week is their normal full-time number of hours that you are compensating them for. When you ask them to go beyond this, you need to compensate them for their time.
Unless you're in a really dysfunctional startup, the employees who voluntarily jump in after-hours to fix OC issues are going to be very quickly compensated with raises and refresher option grants. Far better payout generally than a few bucks of overtime.
It becomes really obvious who is holding the company together and who is coasting, relying on the "senior" (ie, anyone who puts in the effort) engineers to keep the ship afloat, and in a highly competitive engineering market, that will definitely be reflected in equity comp.
I respectfully disagree. I did exactly this for a number of years. A colleague (and is now friend since we've both left said company) made it clear he would not work overtime without compensation (and never did), but his day-to-day work was excellent. He knew he had a skill they required, and was good at it (and our skill sets were the same).
Our careers both grew at about the same pace for four years. He became known as the go-to guy for green-field projects, while I was the guy you could put onto over-budget tight-deadline projects to rescue them. I worked a lot of weekends for my troubles, and I always envied how he ended up on the "fun" projects. I learnt a lot from him. As a professional, some self respect is required, or you will be abused.
The problem with informal "people will notice" reward schemes, even when administered well (and don't take that for granted, it's easy for them to become popularity contests!), is that they encourage bad work-life balance among more junior employees. If you regularly have people popping up at 10 PM to fix things, and you don't have any formal recognition of the fact that they've gone above and beyond, new hires who want to get ahead will learn that working late into the night is the way to do that.
2. There are few pages, preferably the median should be 0 per week.
3. Spurious / non-actionable alerts get fixed right away (with very high priority)
4. You're not up more than 1 week per 1-1.5 month.
5. You subtract middle of the night pages from your next working day, with bad nights resulting in a day off. Being on-call doesn't mean working overtime.
As with most things, the core idea is not bad, it's the execution that matters.
Also, only on on-call for things you're actually responsible for. I hate having on-calls span multiple teams. They can then be lazy, as their issues hurts someone else. And the added stress of having to debug and fix shit you're not comfortable touching.
Additionally, I'll never accept on-call with 15 minutes from alert to being on a computer again. It's just too limiting and disruptive of my life. Have to bring the computer everywhere. Any dinner or social event can be instantly ruined. A workout becomes meaningless. I remember doing a swim and having to check my phone every 5 minutes. It's mentally exhausting and frankly not worth the pay.
If it's in the contract you sign, then it's already priced in. I don't see much value in specifically outlining which part is base and which is for oncall, if oncall is mandatory.
At one place I worked, they were wanting on-call shifts of 12 hours being no more than 15 minutes away from being logged on (it was a multi-week Big Event) with the promise of, maybe, time off in lieu as recompense.
They were most displeased when I declined this opportunity.
> 4. You're not up more than 1 week per 1-1.5 month.
That seems excessive if you're expected to be able to log into your work system within X minutes.
Having to be essentially home, near a computer, 25% of the time (1 week out of 4) is a pretty heavy burden, especially for people who prefer to be out, rather than home.
It's a heavy burden, and a lot of teams might want to consider a longer interval, but there's a lot of legitimate scenarios where it's just not feasible to distribute an oncall rotation among 8+ people. I would definitely point more towards 1 in 6 as the ideal minimum.
I would add a "there is time reserved to write automation to reduce toil".
I have a fair bit of experience in teams with developers hating oncall and the critical issues happen in two camps generally:
- In some cases the org was absolutely open to give them time to resolve issues and automate stuff away (PMs were proposing months for fixes only, and cleaning up boards, etc), there just was no interest until enough escalation happened (creating a massive conflict between ops and devs in the meantime). Even offering to do the work was met with a "stay in your lane" kind of response and no collaboration at all.
- In others the company just did not care, deprioritized tickets until things blew up completely, then finger pointing started and all that nice toxic bullshit.
On-call means you have to plan your free time around being available to work. That's never going to be "just fine" for some of us, no matter how it's structured.
My one weird trick is to have a zero tolerance policy for flaky monitors/tests. If it’s not accurate, we either have to drop everything else and fix it, or disable that alarm entirely.
Like they say, normalization of deviance is real, and the only way to fight against it is to have every form of deviance be a problem.
For extra credit: if you have a weekly or monthly team meeting, include an agenda item for the people that were on call so they can debrief the team on what alerts fired and what the resolutions were. As a team, you can then decide which alerts need to be deleted or need adjustments, and if there are additions or edits that need to be made to the runbook.
A big thing to avoid in this whole process is "naming and shaming." The Google SRE book calls this a "blameless postmortem culture," and it's helps you avoid perverse incentives for people to hide or obscure latent production issues.
Yeah. Also, unless you are a genuinely essential application—like air traffic control, a hospital, or a nuclear power plant—you can live with a few hours of downtime.
AWS goes on the fritz for a few days out of of every year and breaks half the internet. Your business will be okay.
99% of the time your shit just isn't that essential. That's the One Weird Trick: don't get suckered into thinking your corporate vision is so important that it can't have an issue wait until morning.
I tend to agree with you, but found an exception a while ago.
A certain file has to appear before a specific time, or else some people don't get money they deserve and rightfully get very angry.
Except 1 or 2 times per year there is nobody in that situation. No payments have to be made. No file appears, as other alerts would signal an empty file.
As the relevant time was in business hours and the thing was important, I decided to swallow my pride and accept that invalid alert.
Resolution procedure is documented as: Call team X, and ask if this is correct. If yes, blackout that alert for 24 hours
No, but, the point is there are never alerts that aren’t alerts.
This is arguably worse, since it’s otherwise a very important alert so anyone that sees it will freak out. Since it happens only 2 times a year, anyone seeing the alert for the first time (depending on churn, this may happen quite often) is going to think it’s really important.
Hopefully you don’t regularly get these alerts because something actually went wrong (say once a year), but that means 66% of all alerts you get are false alarm.
Unacceptable. Make sure that the file is there but empty if nobody needs to get money. If empty could be a failure case, have a 'this page intentionally left blank' type arrangement for the file contents. Done, no monitor exceptions.
After years of iteration, here’s what our team does.
The team is remote and distributed across multiple time zones ranging from West Coast US to Western Europe.
This gets us as close to round the world coverage as we can have.
There are two people on call for each shift, each shift lasts a week.
It will typically (but not always) be one person from US and one person from UK/EU. This helps reduce the single personal cost and spreads it out so what might be night for one person, is morning for the other and vice versa.
All of our alerts are prioritized/categorized to help prevent alert overload.
For example, an alert for a test/QA environment will not fire outside of business hours, and it has a much longer time before it’s required to be ack’ed or resolved.
There are two on-call rotas: critical and non-critical.
Critical, production-impacting, and/or client-facing alerts are dispatched to the critical rotation.
The non-critical rotation only escalates alerts during business hours, again, with a more lax timeline for acknowledgment or resolution.
People are not part of both rotas at the same time.
If there’s a big enough incident, the folks on call get to take off that next working day or the next one.
I (the manager) am on call 24/7 for escalation.
Anything that is an annoyance during on-call is a candidate for review and change.
That can be anything from thresholds to code to upgrading some IaaS/SaaS subscription. Or even straight up disabling the alert if it provides no value.
People can swap on-call days as they want.
Typically, this happens if there’s a birthday, personal event, or PTO, and it’s worked out among team members. If no one else is available, then I’ll take their shift and act as primary.
How does it work for you to be on-call 24/7 for escalation? I get that that ends up happening for many committed founders/operators/managers, but I struggle how that can be a real strategy.
Are you never off-grid for a bit, or drunk in a bar, or just on a real no-work vacation? There seem to be situations where being on call just isn’t feasible.
I was effectively oncall 24/7 at my job at times in 2020.
I barely noticed the pandemic. I never strayed far from my computer. Also, yes, I tried not to drink much.
I certainly learned what my limits are. People think I am a pretty good engineer (not amazing) but what I am known for is being able to keep that level of performance up for a long time.
For my part, despite my reputation, I tried to quit a few times. Not the job, the company entirely. I have never cried at work, but came close once or twice after being up for days and unwinding from a big escalation.
I'm from PeopleSoft / ERP world. The same PeopleSoft HCM base application would generate one off-hours pager alert a month at one client; and half a dozen a night at another client; due to different customization/implementation/complexity of business logic and data.
Any on-call/on-shift rotation system must be viewed through the lens of actual demand and need.
At first client, we had 3-4 developers total who shared pagers on weekly basis, as per the OP, with no undue stress or impact on their day job.
At second client, we now have multi-tier support starting with on-shift (junior but specialized ops team members who stare at computer overnight and provide immediate response), Tier 1 and Tier 2 on-call support, and multi-level escalation rotation.
And yes, there are still people who get woken up all the time always, because buck eventually stops there :-/ . Being on call sucks, as per the title. I've been in 24x7 escalation roles; I don't drink to begin with so that's not an issue, but it absolutely had significant negative impact on my social & family life, sleep and stress levels. I've spent significant effort to a) Make the system better, both in terms of more reliable application, and deeper and more self-sufficient support team tree, and b) Move myself out of the role, though that relies on success in a).
I do find it fascinating to occasionally meet very senior people, with family and social life, who are positively EAGER to be on-call and engaged for every little thing all the time always - and then, unfortunately, have same expectations of literally everybody else ("Let's all come on bridge, always, for everything, anytime")
Even in all the situations you’ve listed, I still have my phone with me.
If I’m going to be out of cell coverage (e.g. a plane ride, or in the countryside with spotty Internet) or simply want to be left alone, I usually plan for that in advance and do a combination of: 1) scaling back our risk exposure by rescheduling work (which requires you to have a good understanding of the business, its needs, and its timelines) and 2) shoring up the bits I feel most weary about through code, documentation, tooling, and/or contractors.
The same goes for the team SMEs: reschedule where I can, crosstrain where I can’t, get headcount where none of the prior work.
I’ve been on call since 1999. I had to figure out a pattern that worked for me (and my family) but wouldn’t result in a life that was boring or worse, one that I resented.
> Anything that is an annoyance during on-call is a candidate for review and change.
What does that mean in practice?
Hopefully: "This woke me up last night. It's now the top priority until it's fixed so it never wakes anyone up again. Sorry product manager, your new feature will have to wait."
"This woke me up last night, but it could have been something that didn't need escalation outside of business hours."
"This woke me up last night and had I snoozed for 5 more minutes it could have been catastrophic, let's get some more proactive monitoring in place."
"This woke me up last night and it was triggered by bad user input, we shouldn't get alerted on this but more importantly, we shouldn't allow users to submit this crap."
Very rarely do I encounter alerts that are traced back to some deep architectural flaw that requires me to tango with a product manager and their roadmap.
Often times, our team escalates to the engineering lead in question and a small bug fix is slipped into the next release.
My last gig, on call worked well, I thought, for a few reasons: it was our services that we wrote, it was 1 wk out of 6 that you were on call, we heavily prioritized fixing unactionable alerts and automating fixes -- every alert had a runbook entry that described the non-automated fixes, and while on call your or sprint commitments were not counted,
That last point was very nice as it meant you could work on whatever you felt was most important for quality of life improvements all week long while not fielding on call issues. This meant that I looked forward to on call.
Honestly, that last point seems like something that would make on-call extremely palatable. Essentially acknowledging "on-call sucks; in return, here's the latitude to work on whatever you happen to think is important/interesting"
It is OK being on call on your servers and your software.
I think it is also why less and less people having stuff on premise and going for SaaS/Cloud solutions.
If someone wants to have my software on their servers and me not having any access better they have dedicated person for running it and I don't care even if they pay $1000 per hour - I am still not dealing with server I don't know talking on the phone with admin that has no clue how my software should be configured.
My company(UK) recently tried to force on-call on all engineers.
The initial wording was very restrictive, like 5 minute acknowledgement time and 15 minutes at-laptop. 24/7 for 7 days. They tried to have this implemented without any extra remuneration or perks for the on-call engineer.
On top of it possibly being very illegal, it seems very immoral to spring something like that on people that did on agree to it when they took the job.
I fought for it and I got them to change their policy in 2 mostly meaningful ways:
- It's an opt-in method
- On-call engineers get paid extra for just being on-call and get extra time off whenever they need to actually do something.
This makes sure that you only get people actually willing to do it and there is an incentive. I think it's been quite a successful program!
Luckily I didn't need to get them involved, but in the UK there are unions starting to form for tech workers, I suggest you join one like https://prospect.org.uk/tech-workers
A company I used to work for asked me to do on-call, it wasn't in my contract, I declined, that was that.
I don't understand what "force" means in this context - the conversation went something like "I have commitments outside of work" and that was that. I mean, there was a back and forth, but yeah, at the end of the day I took the job knowing I'd be available for the hours they wanted when I took the job.
I joined Prospect because my company tried to implement an unspoken on-call arrangement, whereby they would try to call me on my mobile 24/7 expecting an immediate response. I asked what the additional renumeration is for that, and they said there isn't any.
Now I'm a Prospect member, and my mobile is always on mute.
I used to work for an MSP. They billed 2-3x the normal rate for on-call to clients. We, however, were simply paid our hourly rate plus overtime. It created a perverse incentive to have as many on-call events as possible as it was very profitable for the company. They billed minimum time to clients, but we were told we could only bill for the exact minutes spent working.
Yeah oncall is horrible. If things needed to be up 24/7, then some team should be staffed 24/7 around the world.
The worse part of oncall is the control of your life it has. for one week I can't do anything I would normally do. (if your company actually compensates for this, let me know where i can a apply, or better, if it doesn't have oncall at all!) Of course managers are never oncall 24/7. The worse is they give the excuse well im on call all the time by default since im the one manager. But theyre not reorganizing their life and putting their off work hobbies on hold becasue of it are they?
> a monitoring change that fixes some flaky alert that might page somebody about once every six weeks.
These kind of things suck. I was on a team where we had tons of these, 10 alerts like this mean your getting pages all the time. No single alert is worth the time investment. Worse was a manager insisted there will always be a base line of alerts that go off and we will just live with it.
Teams never seem to understand how to alert on stuff. Ive been paged for things going off, that might indicate a problem, then you get stuck sticking around because someone else wants to just wait and see what happens. "We should just be cautious" Its impossible to push back on these things, your just going against someones gut feeling, like maybe one day we will want to know, and everyone needs to protect them selves.
The problem with oncall teams that are different to the usual SRE/DevOps teams are the lack of understanding of the system. This can obviously be fixed with good documentation, but in reality, no one has good enough docs. The second problem is actually building that team with the skill needed. Someone with the skill to fix complex systems is not going to want to stick around as 1st line on call support.
Most places at some scale have some meaningfully defined escalation path. That way you staff people with varying understanding of the system(s).
> This can obviously be fixed with good documentation, but in reality, no one has good enough docs.
One problem I've witnessed related to documentation about on-call issues is the over reliance on the SOP concept. They only commit to one level or one pass of analyzing the issue. They do not future drill down, either by linking to other notes or reviewing the issues deliberately. It's like they read about the 5 Whys and decided why not just 1 why.
yeah thats true, there must be some kind of middle ground between an ops team like that and the can't go to the bathroom without your phone oncall we have though.
> Teams never seem to understand how to alert on stuff. Ive been paged for things going off, that might indicate a problem, then you get stuck sticking around because someone else wants to just wait and see what happens. "We should just be cautious" Its impossible to push back on these things, your just going against someones gut feeling, like maybe one day we will want to know, and everyone needs to protect them selves.
From an ops person: if an alert does not have:
- clear, provable impact on customers (internal/external)
- clear documentation (e.g. runbooks) on how to solve it
It should not be an alert. I took this path (successfully) when trying to remove spurious alerts that existed only for the ego of someone (most absurd example, something that started complaining when p99 for some endpoints went >500ms and happened every day when we downscaled the ASGs because business hours were over. No clear path to resolution, and impact was a couple pages opened a bit slower sure - but the number of customers using those pages after hours was <1%!
It sucks, definitely, but the best way to go around those alerts is to prove they're pointless or a waste of time or can be automated around and should automated around (and I've seen so many servlets leaking memory triggering OS alerts for OS teams or spawning infinite threads and never cleaning up after themselves...).
If the company does not want to do it, and pushes back, I would recommend starting to look for another company. It's sad, but it is what it is. 99.99% of software does not need a follow the sun rotation (or people damned to night shifts), just a bit of thought about what happens when things fail.
On-call is even worse for people with disabilities. I quite literally can't do it unless I stop taking my antipsychotic.
Under ADA, I can not be placed on call, regardless of policy, nor can I be discriminated against for that. On-call is not an essential function of being a software developer, with very few exceptions—all of which have nothing to do with "policy" or "fairness".
Needless to say, companies (and some coworkers) really don't like this.
Not to detract from you comment, but you don't have to have a disability. Some people simply cannot do on-call, disability or not.
I had a co-worker at a previous job, he did two or three on-call rotation and told our boss that he couldn't do it. Mentally it's simply to much for him, especially outside business hours where he felt alone with to much responsibility. In terms of abilities and qualifications he was absolutely able to do the job. Nobody complained or got angry with him over it, because everyone could relate.
At the other end of the scale I had another co-worker, in a more complicated scenario who absolutely didn't care. The payment for the on-call shifts was very good, so he just grabbed as many as possible. He would just take his laptop golfing, no problem. His reasoning: Either he'd know how to fix the problem, if not he'd just to call someone else and hand of the incident.
Unless you've tested this theory in court it might not be true. It's almost certainly not as cut and dry as you make it seem. Many companies put the same people oncall who write the code, meaning it literally is an essential function of a software developer to provide oncall support. You'd have to argue in court that it's not really essential but it would be situationally dependent.
That said, I'd hope most places would be willing to accommodate you. Places I've worked have always treated oncall as a kinda optional "right thing to do". I've never seen anyone punished for missing an alert. You'd have a good argument if that were the case at your company but that approach to oncall is not universal.
That would be one of the exceptions. For example, high-frequency trading firms always need developers on call while they are actively operating. Keeping those systems running correctly is essential to their role in the company. Same with small companies who have no other staff, assuming they even have enough employees (15) to be under ADA. :)
For the more common scenario of on-call rotation, it would be very difficult to make that argument because other people can take up the disabled person's shifts.
When I worked as a full time DoD scientist, there was a very set-in-place system for dealing with these situations, and it was normal to pay double overtime for any hours spent by an employee on after hours emergencies. This is the right way to do it. Pre-series B startups largely don't do this, but once you get to series B and C suddenly it becomes a thing because companies realize they either have to legally, or have to to prevent their employees from churning and to protect themselves from people not showing up to put out the fire.
Just do it, and do it early. Do it before Series A. That's the advice I give my consulting clients, and the approach I take with my own companies. By compensating your employees for this time you also take what would be a red flag for many would-be employees and turn it into an exciting perk.
* fix the alerts so they reliably page when there's a SLO-worthy problem and only then.
* test the restore system so it works smoothly when needed at the necessary scale.
* add safety checks to prevent the need to use those backups in the first place.
* get to the root cause of yesterday's outage and prioritize the 9–5 engineering work to ensure it won't happen again.
It's awful to work for a place where heroics are often necessary and unrewarded. I still don't like working in a place where heroics are often necessary, even if they're celebrated. They're often avoidable.
This is inherently difficult because the heroic efforts tend to come as a result of very noisy incidents - everyone already knows the database cluster was literally on fire and wants to hear how Jane put it out while simultaneously remagnetizing the backup drives. The lead's job is to make the noise ensuring everyone knows Morgan's the one who did the boring rewiring of the backup-restore process to automatically sync and fail over to another AZ when thermal metrics start trending bad.
One thing I've noticed as we've been trying to integrate some on-call processes across formerly separate teams is that "our" incident reports were about 25% timeline/responses/impact analysis and 75% RCA and plans for future mitigations. "Theirs" were the opposite. Our incident rate has trended down sharply over the past two years even as our system grew; theirs has scaled up roughly at the same rate as their service count.
Avoidance isn't the goal of mgmt though. Cost minimization is, and by quantizing the compensation portion it can help figure out the cost of a bug. Ultimately the cost of a bug is a key variable in the decision to prevent or solve.
The cost can easily vary from $0 (eg at a startup with no customers) or to millions (maybe billions?) when you consider the risk of brand damage, lost ARR (multiply that ARR by the sales multiple!), decremented velocity (lost opportunity value), and eventually even recruiting cost (to replace frustrated employees).
I've always wanted to work at a JoelTest[1] company which fixes bugs before writing new code... But the closest i've found is a commitment that the oncall engineer works on bugs for their rotation to reduce the bug backlog ( and is not part of sprint velocity)
[1]: https://www.joelonsoftware.com/2000/08/09/the-joel-test-12-s...
Unpaid on-call time is exploitative and illegal in most jurisdictions.
Some people want to see everything broken out and take a very transactional approach to work; others prefer the simpler approach but with less clear linkage to a specific piece of work.
Even worse, if emergencies happen all the time, your family is going to hate you if you don't have some reward you can show for each one, especially if there are other jobs at your salary range that don't have an on-call clause.
If you don't trust your employer not to abuse the second arrangement then you'd be a fool to accept it. With the first arrangement the employer is incentivised not to exceed the agreed upon expected number of on-call weeks because they are financially penalised for additional on-call time. With the second arrangement the employer can demand as much on-call time as they think they can get away with.
It's really a case of if you expect your relationship with your employer to be in some way adversarial - if you trust your employer not to screw you than go ahead, but you can't necessarily reliably make that judgement going into a new job.
There should be no incentive to require or avoid "free" work, only advance planning of who'll be on call (with more available engineers doing more turns).
It becomes really obvious who is holding the company together and who is coasting, relying on the "senior" (ie, anyone who puts in the effort) engineers to keep the ship afloat, and in a highly competitive engineering market, that will definitely be reflected in equity comp.
Our careers both grew at about the same pace for four years. He became known as the go-to guy for green-field projects, while I was the guy you could put onto over-budget tight-deadline projects to rescue them. I worked a lot of weekends for my troubles, and I always envied how he ended up on the "fun" projects. I learnt a lot from him. As a professional, some self respect is required, or you will be abused.
1. There is a rotation.
2. There are few pages, preferably the median should be 0 per week.
3. Spurious / non-actionable alerts get fixed right away (with very high priority)
4. You're not up more than 1 week per 1-1.5 month.
5. You subtract middle of the night pages from your next working day, with bad nights resulting in a day off. Being on-call doesn't mean working overtime.
As with most things, the core idea is not bad, it's the execution that matters.
This is important. If you are on-call, you do not have the freedom that you would otherwise have and you should be compensated for this.
* For every hour you're available for on-call you get your regular hourly rate.
* For every incident, which involves you working, you receive twice your hourly salary for the duration.
So companies based her tend to have that as standard, though I'm sure some companies would pay more to stand out.
Additionally, I'll never accept on-call with 15 minutes from alert to being on a computer again. It's just too limiting and disruptive of my life. Have to bring the computer everywhere. Any dinner or social event can be instantly ruined. A workout becomes meaningless. I remember doing a swim and having to check my phone every 5 minutes. It's mentally exhausting and frankly not worth the pay.
They were most displeased when I declined this opportunity.
This is impoprtant because if you are on-call, you are working, and if it is during a holiday, then that is one less holiday for you.
This could actually be pretty good too. People would be much more willing to cover for others on holidays
That seems excessive if you're expected to be able to log into your work system within X minutes.
Having to be essentially home, near a computer, 25% of the time (1 week out of 4) is a pretty heavy burden, especially for people who prefer to be out, rather than home.
I have a fair bit of experience in teams with developers hating oncall and the critical issues happen in two camps generally:
- In some cases the org was absolutely open to give them time to resolve issues and automate stuff away (PMs were proposing months for fixes only, and cleaning up boards, etc), there just was no interest until enough escalation happened (creating a massive conflict between ops and devs in the meantime). Even offering to do the work was met with a "stay in your lane" kind of response and no collaboration at all.
- In others the company just did not care, deprioritized tickets until things blew up completely, then finger pointing started and all that nice toxic bullshit.
That's insane. That's around 20% of your life.
Deleted Comment
In interviews, always ask them:
1. How often is someone on call (typical was one week every 6-8 weeks).
2. While on call, how often do you get paged after hours?
3. What do you do to reduce the number of pages?
I suspect I've been rejected for merely asking the second and third question, but that's good!
My one weird trick is to have a zero tolerance policy for flaky monitors/tests. If it’s not accurate, we either have to drop everything else and fix it, or disable that alarm entirely.
Like they say, normalization of deviance is real, and the only way to fight against it is to have every form of deviance be a problem.
For extra credit: if you have a weekly or monthly team meeting, include an agenda item for the people that were on call so they can debrief the team on what alerts fired and what the resolutions were. As a team, you can then decide which alerts need to be deleted or need adjustments, and if there are additions or edits that need to be made to the runbook.
A big thing to avoid in this whole process is "naming and shaming." The Google SRE book calls this a "blameless postmortem culture," and it's helps you avoid perverse incentives for people to hide or obscure latent production issues.
AWS goes on the fritz for a few days out of of every year and breaks half the internet. Your business will be okay.
A certain file has to appear before a specific time, or else some people don't get money they deserve and rightfully get very angry.
Except 1 or 2 times per year there is nobody in that situation. No payments have to be made. No file appears, as other alerts would signal an empty file.
As the relevant time was in business hours and the thing was important, I decided to swallow my pride and accept that invalid alert.
This is arguably worse, since it’s otherwise a very important alert so anyone that sees it will freak out. Since it happens only 2 times a year, anyone seeing the alert for the first time (depending on churn, this may happen quite often) is going to think it’s really important.
Hopefully you don’t regularly get these alerts because something actually went wrong (say once a year), but that means 66% of all alerts you get are false alarm.
The team is remote and distributed across multiple time zones ranging from West Coast US to Western Europe.
This gets us as close to round the world coverage as we can have.
There are two people on call for each shift, each shift lasts a week.
It will typically (but not always) be one person from US and one person from UK/EU. This helps reduce the single personal cost and spreads it out so what might be night for one person, is morning for the other and vice versa.
All of our alerts are prioritized/categorized to help prevent alert overload.
For example, an alert for a test/QA environment will not fire outside of business hours, and it has a much longer time before it’s required to be ack’ed or resolved.
There are two on-call rotas: critical and non-critical.
Critical, production-impacting, and/or client-facing alerts are dispatched to the critical rotation.
The non-critical rotation only escalates alerts during business hours, again, with a more lax timeline for acknowledgment or resolution.
People are not part of both rotas at the same time.
If there’s a big enough incident, the folks on call get to take off that next working day or the next one.
I (the manager) am on call 24/7 for escalation.
Anything that is an annoyance during on-call is a candidate for review and change.
That can be anything from thresholds to code to upgrading some IaaS/SaaS subscription. Or even straight up disabling the alert if it provides no value.
People can swap on-call days as they want.
Typically, this happens if there’s a birthday, personal event, or PTO, and it’s worked out among team members. If no one else is available, then I’ll take their shift and act as primary.
Are you never off-grid for a bit, or drunk in a bar, or just on a real no-work vacation? There seem to be situations where being on call just isn’t feasible.
I barely noticed the pandemic. I never strayed far from my computer. Also, yes, I tried not to drink much.
I certainly learned what my limits are. People think I am a pretty good engineer (not amazing) but what I am known for is being able to keep that level of performance up for a long time.
For my part, despite my reputation, I tried to quit a few times. Not the job, the company entirely. I have never cried at work, but came close once or twice after being up for days and unwinding from a big escalation.
Depends heavily on how stable app/system is.
I'm from PeopleSoft / ERP world. The same PeopleSoft HCM base application would generate one off-hours pager alert a month at one client; and half a dozen a night at another client; due to different customization/implementation/complexity of business logic and data.
Any on-call/on-shift rotation system must be viewed through the lens of actual demand and need.
At first client, we had 3-4 developers total who shared pagers on weekly basis, as per the OP, with no undue stress or impact on their day job.
At second client, we now have multi-tier support starting with on-shift (junior but specialized ops team members who stare at computer overnight and provide immediate response), Tier 1 and Tier 2 on-call support, and multi-level escalation rotation.
And yes, there are still people who get woken up all the time always, because buck eventually stops there :-/ . Being on call sucks, as per the title. I've been in 24x7 escalation roles; I don't drink to begin with so that's not an issue, but it absolutely had significant negative impact on my social & family life, sleep and stress levels. I've spent significant effort to a) Make the system better, both in terms of more reliable application, and deeper and more self-sufficient support team tree, and b) Move myself out of the role, though that relies on success in a).
I do find it fascinating to occasionally meet very senior people, with family and social life, who are positively EAGER to be on-call and engaged for every little thing all the time always - and then, unfortunately, have same expectations of literally everybody else ("Let's all come on bridge, always, for everything, anytime")
If I’m going to be out of cell coverage (e.g. a plane ride, or in the countryside with spotty Internet) or simply want to be left alone, I usually plan for that in advance and do a combination of: 1) scaling back our risk exposure by rescheduling work (which requires you to have a good understanding of the business, its needs, and its timelines) and 2) shoring up the bits I feel most weary about through code, documentation, tooling, and/or contractors.
The same goes for the team SMEs: reschedule where I can, crosstrain where I can’t, get headcount where none of the prior work.
I’ve been on call since 1999. I had to figure out a pattern that worked for me (and my family) but wouldn’t result in a life that was boring or worse, one that I resented.
Sometimes the answer is yes. Sometimes the answer is no. If it’s down already it can’t get worse.
To be fair, I was the first (and only) point of escalation.
What does that mean in practice?
Hopefully: "This woke me up last night. It's now the top priority until it's fixed so it never wakes anyone up again. Sorry product manager, your new feature will have to wait."
"This woke me up last night and had I snoozed for 5 more minutes it could have been catastrophic, let's get some more proactive monitoring in place."
"This woke me up last night and it was triggered by bad user input, we shouldn't get alerted on this but more importantly, we shouldn't allow users to submit this crap."
Very rarely do I encounter alerts that are traced back to some deep architectural flaw that requires me to tango with a product manager and their roadmap.
Often times, our team escalates to the engineering lead in question and a small bug fix is slipped into the next release.
That last point was very nice as it meant you could work on whatever you felt was most important for quality of life improvements all week long while not fielding on call issues. This meant that I looked forward to on call.
Am I the only one that thinks the prospect being on call one week every six is horrifying?
I think it is also why less and less people having stuff on premise and going for SaaS/Cloud solutions.
If someone wants to have my software on their servers and me not having any access better they have dedicated person for running it and I don't care even if they pay $1000 per hour - I am still not dealing with server I don't know talking on the phone with admin that has no clue how my software should be configured.
The initial wording was very restrictive, like 5 minute acknowledgement time and 15 minutes at-laptop. 24/7 for 7 days. They tried to have this implemented without any extra remuneration or perks for the on-call engineer.
On top of it possibly being very illegal, it seems very immoral to spring something like that on people that did on agree to it when they took the job.
I fought for it and I got them to change their policy in 2 mostly meaningful ways:
- It's an opt-in method
- On-call engineers get paid extra for just being on-call and get extra time off whenever they need to actually do something.
This makes sure that you only get people actually willing to do it and there is an incentive. I think it's been quite a successful program!
Luckily I didn't need to get them involved, but in the UK there are unions starting to form for tech workers, I suggest you join one like https://prospect.org.uk/tech-workers
I don't understand what "force" means in this context - the conversation went something like "I have commitments outside of work" and that was that. I mean, there was a back and forth, but yeah, at the end of the day I took the job knowing I'd be available for the hours they wanted when I took the job.
In a call I was explicitly told "every company does it like this, if that's not ok you might not be a right fit for this company".
Now I'm a Prospect member, and my mobile is always on mute.
it's like as long as its not me, i dont care how much your suffer.
The worse part of oncall is the control of your life it has. for one week I can't do anything I would normally do. (if your company actually compensates for this, let me know where i can a apply, or better, if it doesn't have oncall at all!) Of course managers are never oncall 24/7. The worse is they give the excuse well im on call all the time by default since im the one manager. But theyre not reorganizing their life and putting their off work hobbies on hold becasue of it are they?
> a monitoring change that fixes some flaky alert that might page somebody about once every six weeks.
These kind of things suck. I was on a team where we had tons of these, 10 alerts like this mean your getting pages all the time. No single alert is worth the time investment. Worse was a manager insisted there will always be a base line of alerts that go off and we will just live with it.
Teams never seem to understand how to alert on stuff. Ive been paged for things going off, that might indicate a problem, then you get stuck sticking around because someone else wants to just wait and see what happens. "We should just be cautious" Its impossible to push back on these things, your just going against someones gut feeling, like maybe one day we will want to know, and everyone needs to protect them selves.
> This can obviously be fixed with good documentation, but in reality, no one has good enough docs.
One problem I've witnessed related to documentation about on-call issues is the over reliance on the SOP concept. They only commit to one level or one pass of analyzing the issue. They do not future drill down, either by linking to other notes or reviewing the issues deliberately. It's like they read about the 5 Whys and decided why not just 1 why.
From an ops person: if an alert does not have: - clear, provable impact on customers (internal/external) - clear documentation (e.g. runbooks) on how to solve it
It should not be an alert. I took this path (successfully) when trying to remove spurious alerts that existed only for the ego of someone (most absurd example, something that started complaining when p99 for some endpoints went >500ms and happened every day when we downscaled the ASGs because business hours were over. No clear path to resolution, and impact was a couple pages opened a bit slower sure - but the number of customers using those pages after hours was <1%!
It sucks, definitely, but the best way to go around those alerts is to prove they're pointless or a waste of time or can be automated around and should automated around (and I've seen so many servlets leaking memory triggering OS alerts for OS teams or spawning infinite threads and never cleaning up after themselves...).
If the company does not want to do it, and pushes back, I would recommend starting to look for another company. It's sad, but it is what it is. 99.99% of software does not need a follow the sun rotation (or people damned to night shifts), just a bit of thought about what happens when things fail.
Under ADA, I can not be placed on call, regardless of policy, nor can I be discriminated against for that. On-call is not an essential function of being a software developer, with very few exceptions—all of which have nothing to do with "policy" or "fairness".
Needless to say, companies (and some coworkers) really don't like this.
I had a co-worker at a previous job, he did two or three on-call rotation and told our boss that he couldn't do it. Mentally it's simply to much for him, especially outside business hours where he felt alone with to much responsibility. In terms of abilities and qualifications he was absolutely able to do the job. Nobody complained or got angry with him over it, because everyone could relate.
At the other end of the scale I had another co-worker, in a more complicated scenario who absolutely didn't care. The payment for the on-call shifts was very good, so he just grabbed as many as possible. He would just take his laptop golfing, no problem. His reasoning: Either he'd know how to fix the problem, if not he'd just to call someone else and hand of the incident.
Imagine being a single parent; or needing to support elderly or disabled family.
Deleted Comment
That said, I'd hope most places would be willing to accommodate you. Places I've worked have always treated oncall as a kinda optional "right thing to do". I've never seen anyone punished for missing an alert. You'd have a good argument if that were the case at your company but that approach to oncall is not universal.
For the more common scenario of on-call rotation, it would be very difficult to make that argument because other people can take up the disabled person's shifts.