What SRE Could Be - Readit News

SRE has an identity crisis. It means too many things to too many people. In some companies, it is an operational help desk. In a few companies, SRE is rumored to be a different aspect of software engineering.

Pay is a rough signal on what a company expects from SRE. If SREs get paid as much or more than Software Engineers, it is a good sign. Yet even that signal is not accurate. I know highly paid SREs stuck in Ops Hell. You could not pay me enough money to do a job like that. It is also not sustainable because eventually the company will fire you and replace you with a lower paid employee who can do that job just as well.

I strongly oppose organizations that have SRE separate from Software Engineers. I strongly believe that SREs must be fully integrated into Software Engineering teams (aka the embedded model), with the same reporting lines as their product colleagues.

Keeping SRE organizationally separate has done a great disservice to SRE. SRE 2.0 needs to tear down the mini empires that SRE 1.0 built and get back to software engineering.

dijit · 3 years ago

> SRE has an identity crisis. It means too many things to too many people. In some companies, it is an operational help desk. In a few companies, SRE is rumored to be a different aspect of software engineering.

SRE is Sysadmins but with better contracts between teams, and automation focused.

I haven't had much trouble with that definition, does away with all the cruft of the "Production" talk by Benjamin Treynor Sloss and the majority of the SRE books (which are a prescription of a method of creating those contracts).

That said: "DevOps" and Systems Administration in general has had an identity crisis for a while. People don't like the idea of "Sysadmin" for some unknown reason, but give the same role to someone and call them "DevOps" and suddenly all is well.

Same is now true of SRE.

Someone on HN said to me: "DevOps is how you get Engineering to respect Systems Administration".. Maybe the same is true of SRE?

I definitely feel the identity crisis of DevOps is much, much worse than that of SRE- but I guess that depends on your circles.

kodah · 3 years ago

> SRE is Sysadmins but with better contracts between teams, and automation focused.

That's not even close to true, which I think is OPs point. For instance, I have an extensive SRE-SE background, but I'm also a software engineer. If I was tasked with "sysadmin" stuff, or shepherding peoples Jenkins jobs every day I'd probably lose my mind.

nunez · 3 years ago

Reliability Engineering is much wider than the scope of a sysadmin. SREs are software engineers whose focus on the reliability of systems and platforms. Could be sysadmins; could be traditional SWEs; knowledge of both is required.

daenz · 3 years ago

Former Senior SRE here. Was doing everything: working with engineering teams to build a microservice authoring framework. Putting out fires and responding to outages. Adding monitoring/metrics/logging to all the things. Building productivity tools for engineers. Spinning up new software in sandboxed environments for evaluation. Onboarding and offboarding new engineers. Coordinating downtime for software upgrades. Refactoring our infra and making it more IaC. Security audits.

Our team was basically a dumping ground for anything that required experience in many different types systems and/or privileged access. On top of that, we had projects were shoe-horned into roadmaps. So while 90% of our duties were reactive to what the org needed, we also had the pressure of trying to meet deadlines for normal software engineering work. It was unhealthy and I wanted very much to not be defined as an "SRE"

samstave · 3 years ago

Director of Ops title for a decade here:

Add on top of this managing multiples of you and having to do the CFOs work for the CEO and litterally manage to the penny the instances we spun up and dealing with all the expectations of ops WRT to developing tooling on top of tooling.

hiddenwaffle · 3 years ago

My company expects SRE to be synonymous with cloud administrator, acting as the gateway to doing anything AWS-related.

They've also asked us to troubleshoot for database backup processes, write Crystal Reports on user activity, and be Slack admins. Some of the developers believe that we do manual production deployments of their builds (we don't). I get requests all the time from users asking if I can make accounts on various systems for them.

I was given a programming test for this job :(

dub · 3 years ago

Agreed about having SRE fully integrated into feature teams in terms of day-to-day work, seating charts, team building events, etc.

From a management perspective I think it still makes sense for SRE to have a different reporting structure. Feature teams generally aren't rewarded for investing in long-term code and production health. If you report to a feature team manager there will often be downward pressure to focus less on production health and more on helping features ship quickly.

Giving SRE a degree of independent oversight serves as a system of checks and balances against those feature team pressures.

throwaway894345 · 3 years ago

One of the functions that SRE provides in our organization is building out the platform and solving cross-cutting problems. How do you do that without some reporting structure? Shared ownership and guilds can only go so far since each team has its own priorities and incentives (and will rarely choose to yield engineering time to platform things if they can devote it to product things).

I’m not arguing against an embedded model, but I do think there’s tension that’s not easily resolved.

aluminussoma · 3 years ago

In my division, there is an infrastructure team and product teams. Those platform type projects are handled by the infrastructure team. In a product only team, it will always be difficult to work on such projects.

Ultimately, the answer ties back into the identity question: what is SRE?

SassyGrapefruit · 3 years ago

We use an "embedded model" with separate teams. I run an SRE organization with 10 individuals. At any given moment 3-4 are embedded on teams pursuing some objective. The rest are doing incident response and SRE specific projects from the big pile of sprint work.

We did this because in our case most development problems didn't have a reliability dimension. We build mathematical models. Do I need an SRE involved in the discussion of switching from one statistical approach to another? In most cases no. But when that team deems themselves limited by their current walled garden and want to explore/operationalize new technology then we need an embedded SRE to facilitate that.

kodah · 3 years ago

SWE has an identity crisis. It means too many things to too many people.

Just kidding, our entire industry is filled with titles that mean nothing. I agree, we should fix that.

dilyevsky · 3 years ago

Imo embedded mode is no better - usually just a token ops person that everyone on the team throws over the fence to.

I’d say better signal is if sre has its own separate org structure which roots at the cto/vp of eng level. Soon as you have sres reporting to product eng managers it’s beginning of an end.

pramodbiligiri · 3 years ago

A recent talk I saw suggested that SREs report into the COO (Chief Operations Officer) and not the CTO or VP of Engg. I thought that’s a good idea.

The COO is interested in the stability of ongoing operations whereas the other two roles are about launching new things (Unless I have gotten the definition of COO wrong).

aluminussoma · 3 years ago

In the several examples of separate SRE orgs that I've seen, SRE builds their own little empires with no alignment to the product. A cynic would guess that the goal of SRE management is to hire enough SREs so there can be senior managers and directors and eventually a VP with pay packages to boot.

They come to Product from the mountain-top with their holy scriptures (the Google SRE book), prescribing surface solutions to problems with deep roots (pray it away with SLOs).

SRE 2.0 needs to make the embedded model work.

samstave · 3 years ago

I feel like you haven't heard of BOFH.

Bastard Operator From Hell -- A famous IT trope. Which was typically a sysad for a mainframe/Mini-Computer... Usually on a University Campus....

Think Lazlo from Wierd Science, but an asshole.

I actually worked with a BOFH. (Sadly he had a heart attack on the CalTrain Station in Mountain View, and died.)

He was the AS/400 admin for a place where we manufactured all of SUNs Software (meaning made the SUNos CDs manuals packaging etc) and this guy was a BOFH

Sales folk would call up to the helpdesk and request a password reset.

This one time, he set the users password to "STUPID!" ...

I fired him for that.

He had a heart attack shortly thereafter....

:-(

brundolf · 3 years ago

> I strongly oppose organizations that have SRE separate from Software Engineers

Can confirm, we have an SRE team that's totally separate from engineering and it's a huge and constant bottleneck. We've literally re-architected parts of the application to avoid having to go through SRE for things.

I think this is the point where we're getting too serious about the discipline. We need to be a little more grounded in reality. SRE is basically a highly polished, best-practice, telemetry-and-automation-focused sysadmin. And a sysadmin doesn't really have a say in the business. They can make suggestions, like error budgets. But what about sales contracts? The contracts define what kind of guarantees you have to give and on what timeline. And what about customer experience? How is SRE supposed to quantify if the users are happy if they can't tell the UX and customer support how to build the product to better collect user happiness? Or tell the devs how to design their apps to fail less often?

If you keep optimizing and focusing on all the ways to improve satisfaction, you literally end up an executive director of product. If we really want to make products that users are happy with, we need to teach executives how to achieve it, not SREs. SREs shouldn't be driving the bus from the back row. They should be rotating the tires and checking the fuel and oil gauges and telling the driver when to stop for gas. Let's teach executives how to set everyone up to succeed, and then SRE doesn't have to get all existential just to keep the bus running.

mihaigalos · 3 years ago

> I think this is the point where we're getting too serious about the discipline.

In my opinion, we are not serious enough. The role of the cited sysadmin is dying or dead in the modern company.

As an SRE, you are expected to directly contribute to the product to improve SLOs as well as several other dimensions you quoted (like user happiness).

I've personally performed code refactoring to reduce build job duration slowing down pipelines for _everybody_ in a component I was unfamiliar with. The code review was then performed by a code owner and the PR merged.

> The contracts define what kind of guarantees you have to give and on what timeline.

That's why you commit to ship product increments, not hard deadlines. See the failed launch of healthcare.gov, for example.

> How is SRE supposed to quantify if the users are happy?

The same way a dev would: review with stakeholders, "show-and-tell" where you collect feedback and improve your domain knowledge by focusing on the customer problem instead of your own vision of the solution.

> Or tell the devs how to design their apps to fail less often?

This is precisely why SREs would provide value by direct contribution.

> you literally end up an executive director of product.

We are all ambassadors of the product and can contribute past our designated "boundaries". That's what cross-functional is all about. No, you cannot make business decisions but if it is code, you have very high leverage.

> If we really want to make products that users are happy with, we need to teach executives how to achieve it, not SREs.

This has not happened yet and I suspect it never will.

> SREs shouldn't be driving the bus from the back row [..] Let's teach executives how to set everyone up to succeed [..].

Agreed. Do you have a concrete proposal? Without fundamental organizational change, this is impossible. Conway's law still applies, so does its inverse. Meaning you can break down silos (including at executive level) by migrating from component to cross-functional feature teams.

throwaway892238 · 3 years ago

My proposal would be to form an industry panel made up of experts in each of the discrete roles that make up a typical software product value chain. Ask that panel to produce the following:

1. Adapt industry best practices into a training program to teach executives what is necessary for a software product with happy customers. Very simple and to the point: here are the components of the value chain, here is their common problems, here is what each job needs to do to prevent them, here is how they have to work together. The purpose isn't to micro-manage each department, but to know how to identify when they aren't operating correctly and how to put people in place who know #2 below.

2. Take the above and break it down for each part of the value chain. Have the managers/directors of each part learn what their own department's best practice requires, its pitfalls, strategies. Teach them how inter-departmental overlap is required for happy customers, how departments should set each other up for success and collaborate, how "ownership" is tenuous and demands collaboration and compromise. They're all tributaries of a river.

Charge money for people to go through that training and get a certificate. The money goes back into the whole thing. This becomes a baseline that shows you understand how to work on software products with best practice geared towards customer satisfaction.

spondyl · 3 years ago

> If we really want to make products that users are happy with, we need to teach executives how to achieve it, not SREs. SREs shouldn't be driving the bus from the back row. They should be rotating the tires and checking the fuel and oil gauges and telling the driver when to stop for gas.

To extend on this analogy a bit, the theory is that you might have business OKRs that are driven by SLOs.

For example, you might have a wishy washy goal of having happy customers. If you're in the shipping business, one aspect of being happy with the service would be to receive deliveries on time (or earlier!) so you might set an OKR where the objective is "Deliveries are made in a timely manner" where key results might be the number of deliveries made on time and inversely, the number of deliveries that are not returned.

By pitting those two metrics (speed and state of the delivery) together, you can ideally capture that an increase in speedy deliveries is not just achieved by people throwing packages out of moving vehicles as one or the other will decline.

Of course, simply setting these metrics doesn't mean very much but this is where we can drop down to a lower level and make use of SLOs. An SLO is just some number with a target after all, it doesn't have to be technical by any means.

So, how can we help drive our goal of timely deliveries? Well, we might set an OKR that says something like "99% of vehicles should have their oil checked every 3 months".

On the face of it, that has nothing at all to do with vehicle deliveries but inspecting historical data might determine that failed deliveries were often correlated with poorly serviced vehicles so by ensuring they're serviced, we can indirectly improve customer happiness by something that seems entirely unrelated.

This is all in theory of course and assumes your business is able to communicate across layers and share some sort of overall vision which is very hard! It also assumes someone (in the middle?) is able to bridge the gap to co-locate SLOs with OKRs and what not.

Anyway, that's how I understand the "dream state" but have I ever seen it actually accomplished? I can't say I have!

dilyevsky · 3 years ago

> How is SRE supposed to quantify if the users are happy

They can easily tell if the users aren’t happy which is usually same time their pagers go off

It’s natural, in the swirling chaos of the past few years, to take a step back and wonder just where everything is going. Though I’ve certainly been doing my fair share of that, I’ve been also thinking about defining precisely where things are. Answering that question for SITE RELIABILITY ENGINEERING (SRE) is not easy.

chazu · 3 years ago

90% of SREs and SRE managers haven't read the SRE book(s).

99.9% of folks hiring SREs or starting SRE teams haven't read the SRE book.

The SRE book (and its sequels) say quite plainly what SRE is and isn't. They also say that not every org is going to be exactly like google so no, "we're not google" isn't an excuse.

the E in SRE is for engineering. As in software engineering. SREs are software engineers. Or should be. If your SREs don't know basic SWE principles, they're not SREs. If your org isn't applying software engineering principles to minimizing operational complexity at scale, your org isn't doing SRE.

I'm constantly shocked by how hard these things are to grasp, even for most SREs. If the problems I (occasionally) get to solve weren't more interesting than most regular product work, I'd get out of "SRE" entirely.

I think this myth exists because Google was (is?) famously obsessed with SWE. But if you actually read the SRE books and look at the actual discipline of SRE ("what's the difference between SWE and SRE?"), SRE is quite blatantly just operations management. The website is a power plant, and the SRE runs the power plant. You don't build parts to run a power plant, you use software (as in manipulate/control/operate) to run it. You act quickly when the numbers go out of line, you write reports and control how much power is going in and out, respond to surges and dips, etc.

For whatever reason, Google decided to tell people that the same person who's building the klaxon and the concrete wall and the pipes for the power plant, and the person who's operating the power plant, are one in the same. But that's clearly bunk. Building a part and running a system are completely different disciplines, and anyone who does both will only be half good at both. Humans are shit at multitasking and there's few true polymaths out there. Show me a master programmer and I'll show you an amateur woodworker.

I also don't believe software engineering principles will help you reduce operational complexity. If anything, software engineering tends to either make things either inefficient or subtly complicated. Reducing operational complexity comes from the discipline of operations, which isn't engineering. Non-tech companies have known about these distinctions for like a hundred years. Deming applied scientific rigor and analysis to come up with better practices, but he didn't have to design any widgets to do it.

thethethethe · 3 years ago

> For whatever reason, Google decided to tell people that the same person who's building the klaxon and the concrete wall and the pipes for the power plant, and the person who's operating the power plant, are one in the same. But that's clearly bunk. Building a part and running a system are completely different disciplines, and anyone who does both will only be half good at both

Depending on the team, SREs can absolutely involved with "building the system", especially the klaxon ;) Examples include designing and implementating metrics used to make make decisions in business logic and or exposed to customers/users, writing routing components like mixers and proxies, developing data pipelines, etc. At Google many SRE teams build and run entire multi-tenant systems with no pure SWEs involved at all.

Healthy SRE teams should be spending 20% of their time on operations. On my team its actually the devs who do most of the operations work. They take the pager during business hours and we route most maintenance tickets to them.

tom-_- · 3 years ago

The website is NOT a power plant, it's just code. In software, "operations management" is basically infrastructure automation, incident response and build and release. All of these require some software development or at least code literacy and familiarity with software development practices. If there's large overlap in technical skill between the operators and the builders, then it makes more sense to see them as the same but focussing on different problems.

everforward · 3 years ago

> I also don't believe software engineering principles will help you reduce operational complexity.

This isn't a goal of SRE, in my opinion nor in anything I can recall reading. The goal of applying software engineering principles is to accept the increased complexity in exchange for a reduced operational burden.

There's layers of that effect, and the right one depends on largely on your operational burden. Sysadmins shun complexity, so systems are simple but doing mass updates requires a lot of manpower. DevOps embraces some complexity like Ansible or manually orchestrated containers making it easier to do mass updates, but still a burden. SRE embraces complexity, in exchange for a dramatic reduction in manual effort on many tasks.

The idea is that at certain scales (or reliability requirements), it becomes cheaper to hire a small number of expensive people that can manage complex systems than it is to hire a large number of people each managing a simple system.

Software engineering arises because it can effectively trade complexity for reduced operational burdens in exactly the areas you want. You don't have to migrate to a new infrastructure orchestration tool, you can just write an orchestration tool on top of what's there (which I've actually seen done). Was it perfect? No. Was it cheaper than migrating a half million containers to Kubernetes? Yes.

Operations management tends to be very inflexible. They have a set of tools, and anything outside those tools is either a no go or will require replacing an old tool at the cost of months of effort.

its not that this is hard to grasp. it's not. in fact, many of the people i've consulted on RE have read all or parts of these books. IMO, it's mostly comes down to selective interpretation.

it's like telling a homeowner that they need to spend $1000/year on an annual maintenance item to prevent a _possible_ $15k repair bill every five years.

For some, $1000/yr is too expensive for them. So they take their bets or skimp. (People who think you can do SRE without being SWEs because they "can't code")

For others, $1000/yr is affordable, but because the $15k bill is "unlikely", they skimp. (People who think you can do SRE without being SWEs because even though they can code, they'd rather separate those jobs.)

herodoturtle · 3 years ago

Well said.

> 90% of SREs and SRE managers haven't read the SRE book(s).

> 99.9% of folks hiring SREs or starting SRE teams haven't read the SRE book.

These are free to read online, for those that are wondering:

https://sre.google/books/

mycentstoo · 3 years ago

Somewhat relatedly, I converted from SWE to SRE for three reasons. 1. SWE interviews are torture and no matter what I accomplish at my job, the next job I have to re-prove everything all over again. SRE interviews do not typically have this style of interview (I've heard Meta and Alphabet do however). 2. SRE focuses on lower level abstractions and therefore seems less prone to product oversight which is tiresome. SRE work is typically longer lived and deadlines less tight. 3. SRE jobs seem even more remote friendly than traditional SWE roles because of reason 2.

I'm much better as a SWE but I'm paid more as an SRE and have much more freedom. I don't actually like ops work that much but I'll take that trade off to guarantee the other three advantages above.

All that said, in my short time as an SRE I've noticed that it's very much in an identity crises because there's not really a standard job. Infrastructure is very different company to company and some companies expect SREs to be able to code as well as SWEs whereas others don't expect you know how to code at all.

In my experience, an SRE is someone who's principal goal is to maintain infrastructure and ensure code is able to be developed, built and shipped. In the "better" SRE roles, SREs are able to build platform tooling rather than just ops.

pojzon · 3 years ago

> principal goal is to maintain infrastructure and ensure code is able to be developed, built and shipped

From the SRE vs DevOps definition I just read from the top of google -> this description sounds more like a regular DevOps role instead of SRE.

But IMHO, SRE is just a Google way of naming regular SystemAdmin DevOps role.

I see no difference. I did exactly the same tasks as an SRE and as a Senior DevOps.

Could anyone with 20+ years of experience explain why exactly we still distinguish those ?

SRE is supposed to be a superset of DevOps but how?

AlchemistCamp · 3 years ago

> SRE is supposed to be a superset of DevOps but how?

Steve Yegge has said in his tech talks on YT that DevOps is often "all ops" and not necessarily involving any dev.

flurie · 3 years ago

I remember a common, puerile early adulthood "game" in which one would add the words "in bed" to the text of a fortune cookie. I feel like proclamations about the industry by people who were and are at Google need a similar party trick: add "at Google" to the end of each sentence. SRE is Google's answer to Google's problem; it says so right in the first revelatory header of the introductory chapter of the book: "Google’s Approach to Service Management: Site Reliability Engineering".

Inherent in the text of technical books that reveal what people did is an attempt to persuade the reader why they should do it, too. In some ways this reveals the central, myopic conceit at the heart of Google: if you don't have Google's problems, you should. And I'm not sure why anyone who isn't Google should just accept this premise.

That's not to say that the premise is unacceptable. I'm sure that there are people who want to participate in a thing that looks like Google, except that they'd rather start it from scratch than join the one that already exists. Perhaps those people want to have Google's problems.

Now we have a piece, expanded from a talk, by one of the editors of the books that persuaded organizations that, if only they would squint their eyes a bit and turn their head, they could make the problems of their organization orders of magnitude smaller by so many metrics look like the problems of Google. And now, perhaps, Google's answer to Google's problem wasn't quite right, because it didn't involve...something? It could be philosophy or human-centered design, and I wouldn't be surprised to find Google bereft of either, but it certainly isn't going to be found in a paper by Google (ostensibly, a Google answer to a Google question) that predates the first book by nearly a decade. It isn't going to be found in people for whom these things are afterthoughts rather than guiding principles. Instead, shouldn't we be asking whether we let SRE as a Google answer to a Google problem be a solution outside of Google in the first place?

coward123 · 3 years ago

Thank you for this - you put into words very well all of the problems I had with that book. In fact, I found myself very frustrated with the authors, wondering if they had any familiarity with all the rest of the decades of work that had been done in related fields outside Google.

benlivengood · 3 years ago

A lot of the metrics for outage impact are available, and if you really want to Do Science you can use whatever A/B Canary/experiment framework you have to simulate outages (which, arguably, are not perfect because of down-detectors) on users to understand their behavior.

I tend to agree that most people don't care about transient downtime beyond 2 or 3 nines for reputation. Covid has proven how bad people are at reasoning about 0.5% probabilities so there's essentially no chance that they are noticing (en mass) the difference between 2.5 or 3 nines of reliability. Know your customer, as the article points out; more complex customers will know and measure actual availability because they need it for their own reliability calculations.

This brings me to my final guess, that SLOs are primarily useful because they provide a risk model that upstream and downstream consumers can reason about. SLOs are a way of building even larger distributed and interdependent systems that can still be somewhat reliable. Precisely what SLIs are chosen and the limits are a little less important than the fact that they are published publicly. There's an economy of reliability in large systems and like any market information about the market is key to its efficiency. If everyone just trusted that everyone was doing their best at maintaining dependencies most of those beliefs would be very unanchored.

A side benefit of SLOs is postmortems and communication; people working on codependent products and systems need to talk to each other from time to time and understand the true nature of the interfaces between their systems. Postmortems and SLO negotiations give a concrete framework to do that within, metrics drive quantitative understanding of the systems instead of handwaving.

luxuryballs · 3 years ago

After reading these comments I have even less of a clue what SRE actually is…

1970-01-01 · 3 years ago

It's a telltale sign of a naive blogger. If you think anyone outside your profession will read your writing (or code!), always expand acronyms on first use. I'll fix this one for everyone:

But your question was probably about the job, not the acronym. Carry on.

mr-ron · 3 years ago

SRE is a great toolkit of good things that any team can use. Some super amazing chapters I think everyone should read:

Eliminating toil: https://sre.google/sre-book/eliminating-toil/ Every team tech and non tech everywhere should read this

Incident Management: https://sre.google/workbook/incident-response/

Post Mortems: https://sre.google/sre-book/postmortem-culture/ These 2 chapters are necessary reading.

SLOS: https://sre.google/sre-book/service-level-objectives/ Disclaimer: I think the SLO chapter is good to know the theory around, but SLOs specifically Ive used very few times.

sharadov · 3 years ago

A good read but a lot of topics in that book apply to google scale companies only.

95% of companies will never reach that scale or deal with those issues.

citruscomputing · 3 years ago

Thank you for posting these, I learned a lot reading them.