Thank you for keeping systems available and safe. I've been there many times in the past, including having to fly at the last minute to a non-internet-connected data center in NJ to babysit an emergency production bug fix that took the entire holiday to create, install, verify, and monitor.
If you don’t do it for the sake of the person you are asking for help, do it because it works better. That’s the most practical advice [0] ever given by Hans Rosling [1], the Fact master himself:
> In fact, I have the secret to how to get the best help immediately from any customer service, like the phone company or the bank or anything. I have the best line, it always works. You want to know what it is? When I call, I say, “Hello. I am Hans Rosling and I have made a mistake.” People immediately want to help you when you put it this way. You get much more when you don’t offend people.
[0]: Unless you are in charge of a developing country’s budget and have to decide between education and healthcare.
[1]: https://blog.ted.com/qa_with_hans_ro_1/
I do this with internal teams at work. I've found approaching other teams with issues with their library/framework in a "this could be our mistake" manner really helps in keeping them from getting defensive and stonewalling.
You don't have to be excessively self-effacing about it, just avoid presenting things as though the project you're reporting it to being at fault is the only possible conclusion.
Then be grateful for the help, because it truly isn't granted or a given that people have to drop everything and figure things out for you, even if you work together. And even if the mistake was actually theirs. Gratitude is huge.
Sure, be kind, but don’t bullshit people. My personal and professional tolerance for bullshit is very low these days because there’s so much of it.
Except if you're actually convinced it could be your mistake, getting that tone will feel like getting played like some small kid. Most people will help you anyway and be professional, of course.
That reminds me of detective Columbo. It's cute and all when it's supposed to be done to strangers. Imagine Columbo coming to you every week with that same convoluted spiel.
https://youtu.be/hVimVzgtD6whttps://www.ted.com/playlists/474/the_best_hans_rosling_talk...
Practice those first, then work on inter personal soft skills.
Don’t bullshit people by lying to them about “whose fault it is”.
It basically helps move things forward, blame has been allocated, how do we move on from here, but this only works if you have a health work environment.
This wasn't a company policy or anything like that, it was just how we talked and supported each other.
If you want, you can acknowledge how you tried to fix it and failed (if that's accurate). But don't say that the problem is your fault unless it is.
(There are situations in which taking blame for a situation not necessarily yours might be a convention, but mistakes of vendors when talking with the vendor aren't one of them, IMHO. For example, you might take a little heat for colleagues, when appropriate, and all the CEO you're talking with needs to hear right then is, "Sorry, I don't have that for you yet; let me get that to you later today." Not "I've been pestering Bob since Monday for the dependency." Then you can go tell Bob that you two really need to solve this in the next couple hours. And if there's a larger problem, like Bob has been overextended by a family problem, or tasking has been unclear since a tentative pivot, then work it with management in the appropriate vertices of the org chart.)
Friends of mine hearing him would say, He never says what the mistake is precisely, but there’s always the option that it was booking a flight with that airline.
A few days ago I suddenly had my french press for coffee suddenly shatter and almost blast hot coffee over my upper body.
How am I supposed to start that call with "Hi, I'm jorvi and I made a mistake"..?
It's not like that is a unique situation either. And you can guarantee that if you tell customer service "I made a mistake", and it is clear they delivered a broken service / product (but often want to duck responsibilities), there is no way in hell they will not take the freebie you just gave them by admitting fault.
Deleted Comment
ticks the 'potential liability' box
"How can I help you?"
https://www.youtube.com/watch?v=GwQW3KW3DCc
Dead Comment
December tends to be hell for our customers, so stability should be a priority there.
And honestly, no one wants to work on holidays. So lets just wrap everything starting in december, maybe use the third week for some unnoticed issues and then just lay down the tools. Use that time for documentation, or shorter days, quite frankly.
That way we minimize the on-call situations occuring. Let's hope it goes well for the engineer this year as well. We have a streak to keep.
Later, after we regrouped after a month of this brutality, they wandered around the office bragging like they'd hung the fucking moon after they fixed the crippling, obvious design issue they'd released. I confronted the dev lead with the fact that they would have seen this after 30s of load testing and he just laughed, I think he literally said "LOL". A giant middle finger, that's what Ops got from Dev for Christmas that year.
Here's to the people who KTLO. My people.
What a brilliant move! Christmas's was saved, everyone eligible received their bonuses.
Deleted Comment
My little firm have just lifted and shifted a customer's hardware from someone else's computer room (data centre is too grand) and plopped it down in ours. Downtime was roughly six hours which includes two hours driving, unracking, loading, unloading and racking.
Then there was a flurry of network knitting ... oh they've tagged the bloody VLAN instead of untagging it on what are effectively access ports and don't need to be trunks or hybrid. lol, lose 20 mins. I wasn't allowed to look at the "source" switch's config and might (emogi: looking up and whistling) have assumed a few things ...
We did spend quite a long time trying to work out what the customer might have failed to tell us because we hadn't asked the right questions.
... so I plug my laptop into the NIC in question on the Hyper-V box and run up Wireshark ... fuck (dot 1Q tag) ... run back upstairs to my PC and reconfigure the port to hybrid with tagged VLAN 100 instead of access on VLAN 100. A better solution would be a trunk with PVID on the naughty VLAN and tagged v100. I chose the former to make it stand out.
The naughty VLAN thing is similar to a discard VLAN but the traffic is not discarded but instead gets logged. We should never see traffic on the naughty VLAN. If we do its a miss-configuration or something nasty.
As well as that, we have customers for whom Chrimbo is anything up to 50% of annual turnover. Their systems tend to be treated in the same way as yours.
Business offices cut their chilled water supply back to minimums (or nothing) over holiday weekends & breaks.
If you're running a server closet, even if you have a dedicated Liebert HVAC, when the chilled water cuts.. you will overheat.
I learned this over the course of three consecutive Thanksgivings.
Actually I bet some people like it (I know I do). It's not that crazy to want to dodge the whole mad rush and take lots of time off later in the year when it's actually nice outside. Summer vacation beats winter vacation, so if you have to take days off in the winter there's pressure to try and get somewhere warm where the days are longer. Besides. The "office" is quiet, even if you're a telecommuter, so it's easy to get things done. If you're not touching production, that's fine, there's usually all kinds of fun or quality-of-life projects around tech debt, tooling, whatever. Lots of important work is actually easier to do during a change-freeze or other downtime.
It's a similar concept to not deploying on Fridays. If you're afraid to introduce changes due to some arbitrary timing, perhaps it's worth focusing on the source of that uncertainty.
We always should target better stability, but no matter how good your system and incident response are, if your goal is to minimize customer disruption during a certain time window, or avoid dealing with incidents on weekends, minimizing production changes is the simplest and most effective measure
_However_ (that part is probably best bookmarked until Jan 2nd), it also betrays that your system is brittle and can be broken by a bad commit. Don’t do it because you want people to grind until Dec 24th at 6 pm. Do it because it’s great the rest of the year, too. I’d recommend you look into (or ask me about) feature flags, alerting, and automated roll-backs.
The short version is: there’s a meta-system on top of your release process that can tell (if you are using roll-back not features flags): - commits until xyzsdf are fine; - roll-outs starting from commit abcdef have a 2% error rate, 80% on Android; - revert to xyzsdf, send a message (low-priority, email) to the DevOps on call and the author of abcdef that it happened; - for all commits after abcdef: if there no conflicts with xyzsdf, re-try to roll them out; - if there is a conflict because they were on top or abcdef, send a message (low-priority email) to the authors that there is a conflict.
There are more sophisticated versions that can do things like, if you use feature flags, flagging Android users to use the previous version. Another way to do this is to scale who has access to abcdef gradually: say 1% every hour, and revert if you detect issues.
All those seem daunting to teams that haven’t worked like this before, but it my experience, they love it very fast.
/However/, let me counter with the point: Just one of our customer has 8000 FTEs working with our system. During hell-time (aka, December and Christmas shopping and shipping), each of those dudes spends their shift taking customer calls lasting 2-4 minutes, which in turn require a few requests into our systems.
Due to the stress of their customers^2 (because it's Christmas and holidays and such), if an agent of a customer is unable to access our systems, they cannot handle the use case of the customer^2 and that will piss of the customer of the customer.
So if we push a bad change during this time, we're going to piss of hundreds of customers^2 per minute for that one customer alone. Even with a fast automatic rollback, that's a long time during hell-time. And they have people who know how to yell at vendors in nasty ways who don't like that.
I enjoy moving software fast and enabling moving software quickly, but customer focus and customer orientation means to understand when to move slow as well.
And hey, if that means more quiet holidays for the hard working operators on my team, who's gonna complain?
What is an error? Is a business logic bug going to be picked up by this process automatically, or is some manual steps involved?
Ie a point of sale app releases an update that automatically halves the amount to charge, but displays the full amount to the merchant in the UI. Unit tests pass (because an engineer made a human mistake). Backend calls are correctly used, no errors thrown, simply the wrong amount is used.
How would this be automatically detected and reverted?
Would anyone writing point of sale software want to risk this over one of the biggest trading periods of the year?
Correct. So's yours. So's everyone. You might not know what the bad commit is, you might've fixed a bunch of the other bad commits, but even Google gets taken down by bad commits. Your system is brittle and can be broken by a bad commit.
Deleted Comment
Deleted Comment
Yeah
Deleted Comment
Deleted Comment
It sounds like an easy isssue to correct, but downstream systems that consume those numbers had already processed them and associated reports and other records with the incidents. I spent the next few months sorting out that mess and helping work with partners to clear out data.
You might be under the impression that what makes you qualified for various positions in software development is primarily your technical acumen and ability to work with other technically-capable engineers.
You’d be wrong.
While a certain minimum of capability is required to do your day-to-day work, what your value really consists of is in grinding yourself against the piercing pincers of elusive bugs and razor-wire bundles of bullshit code until something resembling progress is made. You are not a problem-solver, you are a problem-endurer.
https://web.archive.org/web/20160317234837/https://medium.co... -> Point 4
Unfortunately I feel like I lucked into this role and if I left I wouldn't be able to find anything anywhere near as good.
In fact, even when there are US/western counterparts these subhumans projects that they will make sure Indian engineers are on-call even during American daytime. This has been happening at my workplace. They employ all tactics - from fear, intimidation, to try to sweat talk engineers into it with shit like, "Oh, we own it, right? So it's our responsibly to support even when it's night".
With that environment it becomes extremely difficult and a pressurised situation for someone like me who simply refuse to even sign up on something like PagerDuty and make it clear that my phone remains silenced and out of my bedroom between 10pm-7am and it really does.
I agree with you - there is no amount of money that can put on on-call, definitely not on a night shift on-call.
If it makes you feel any better this is very common in small to mid-sized US tech companies as well. In every team I've been on that had an oncall rotation it was a full week 24/7 per person, that rotated among team members. Even at Google we were on call for our own service overnight and didn't have SRE / other time zone oncalls.
But the number of pages and other work varied significantly between teams. The worst was risk at Square in 2016, where we routinely got paged 40+ times a week (mostly noise) and when real incidents were most likely on Saturday morning. The best was Instant Apps at Google where we got a ~$5k bonus for each week of overnight oncall and almost never got a single page.
Agree but, I have to say that, as a DevOps, it was infuriating to me to have to deal with developers without any care for the quality of what they were delivering. Sometimes for pressure from someone higher in the chain, other times, for pure laziness and/or incompetence. I remember coming in the morning after a hell of a night on the on-call, reporting the issues to the Devs in charge and being answered something along the lines of "fixing that is not the priority right now" and my replying on anger with "If it was your damn phone the one ringing during the whole night I'm pretty sure you would make it a priority".
It's actually my favorite time of the year. Everyone is gone, it is quiet, and I can get shit done.
I’m a militant proselytizing atheist raised by a jew and I still have a tree with pretty lights, give presents, and drink and eat some things I only drink/eat once per year (never make homemade eggnog if you ever want to enjoy it guilt free again, you’re basically drinking a megacalorie of heavy cream, yum). It’s fun to celebrate the generic concept of “holiday” - a time that is different from other times.
You’re allowed to feel nice about peppermint candy (and/or chocolate gelt, I go for both) at the end of December without bringing the supernatural into the equation. :)
\m/
Solid advice. I literally put on 5 lbs. while refining my non-alcoholic eggnog recipe.
One thing I noticed is that ice cream, crème brûlée, and non-alcoholic eggnog are all just variations on the same recipe. A glass of egg nog is pretty similar to a glass of melted ice cream.
Bingo!
Build A 300-Mile Wall Around SF During Burning Man:
https://web.archive.org/web/20190213021206/https://megagogo....
>A community effort to construct a 300-mile wall in one week and prevent Burning Man attendees from returning to the Bay Area.
>About This Project
>We want to help Burning Man attendees continue their favorite week of the year, and allow them to keep experiencing the genuine community and deep connections they can only feel while at Burning Man. To do this, we will build a 300-mile wall around the entire Bay Area during Burning Man.
>For the rest of us, what’s normally our favorite week of the year… lasts forever!
Or a member of one of the religions that don't celebrate Christmas.
I love end of year because nobody’s pushing anything or needs help.
Then you take vacation in January when the floodgates open.
The right way: slow way the fuck down to the point you're practically on vacation during the holidays without taking vacation. Take your vacation during other times in the year.
https://news.ycombinator.com/item?id=38727987
Deleted Comment
Many thanks to all of the health care workers who take care of us over the holidays. (Along with all of the others, of course.)
Growing up ignoring holidays is mostly great (fly on xmas and everybody feels sorry for you, even though they are the ones working on xmas). But it causes relationship problems bc even when you genuinely try to participate you’re “doing it wrong”.
Having a family that accepts rescheduling Holidays helps. We've celebrated Thanksgiving, New Year and Christmas on different days before.