Speaker:

Modern applications require modern operations, and modern

Speaker:

operations requires a new definition for ownership that

Speaker:

most classical organizations must provide.

Speaker:

Today I continue my discussion on modern ops with Beth Long.

Speaker:

Are you ready? Let's go.

Speaker:

This is the Modern Digital Business Podcast, the technical

Speaker:

leader's guide to modernizing your applications and digital business.

Speaker:

Whether you're a business technology leader or a small business

Speaker:

innovator, keeping up with the digital business revolution is a

Speaker:

must. Here to help make it easier with actionable insights and

Speaker:

recommendations, as well as thoughtful interviews with industry experts.

Speaker:

Lee Atchison in this episode of Modern Digital Business,

Speaker:

I continue my conversation on Modern Operations with my good

Speaker:

friend SRE engineer and operations manager Beth

Speaker:

Long. This conversation, which focuses on service

Speaker:

ownership and measurement, is a continuation of our

Speaker:

conversation on SLAs in Modern Applications.

Speaker:

In a previous episode, we talked about Stosa, and this fits very much into

Speaker:

that idea is the idea that how you organize

Speaker:

your teams so that each team has a certain

Speaker:

set of responsibilities. We won't go into all the details of Stosa, but bottom

Speaker:

line is ownership is critical to the

Speaker:

Stosa model. Ownership is critical towards all DevOps

Speaker:

models. If you own a service, you're responsible

Speaker:

for how that service performs, because other teams are depending on

Speaker:

you to perform what those performance,

Speaker:

what it means to perform. The definition of what it

Speaker:

means to perform is what an SLA is all about.

Speaker:

Yeah. So what does a good SLA look like?

Speaker:

Beth that's a great question. Let's get to the measurement.

Speaker:

It does get into measurement.

Speaker:

That is always a hard question to answer.

Speaker:

If you look at the textbook

Speaker:

discussions of Slis and SLOs and

Speaker:

SLAs in particular, you'll often see references

Speaker:

to a lot of the things that are measurable. So you'll

Speaker:

have your golden signals of error, rate,

Speaker:

latency, saturation. So you have

Speaker:

these things that allow you to say, okay,

Speaker:

we're going to tolerate this many errors,

Speaker:

or this many of this type of error, this much

Speaker:

latency. But all of that is kind of trying

Speaker:

to distill down the customer experience

Speaker:

into these things that can be measured and

Speaker:

put on a dashboard. The term smart goals comes

Speaker:

to mind, right. That I think, is a good

Speaker:

measure. I know the idea of smart goals really hasn't been tied to

Speaker:

SLAs too closely, but I think there's a lot of similarities here. So

Speaker:

smart goals are five specific criteria. They're specific

Speaker:

measurable, attainable,

Speaker:

relevant, and time bound. So

Speaker:

now I think all five of those actually apply here

Speaker:

as well. Too right. When you create your SLAs,

Speaker:

they have to be specific. You can't say, yeah, we'll meet your

Speaker:

needs. That's not a good experience. But

Speaker:

in my mind, a good measurement is something

Speaker:

like, we will maintain

Speaker:

five milliseconds latency on average

Speaker:

for 90% of all requests that come in.

Speaker:

And I also like to put in an assuming.

Speaker:

Assuming you meet these criteria, such

Speaker:

as amount of traffic, the traffic load is

Speaker:

less than X, number of requests permitted or whatever the

Speaker:

criteria is. So in my mind, it's a specific

Speaker:

measurement with bounds for what that

Speaker:

means. Under assumptions. And these are the

Speaker:

assumptions. So something like five

Speaker:

milliseconds average latency for 90% of requests,

Speaker:

assuming the request rate is less than

Speaker:

5000 requests per second,

Speaker:

and assuming both those things occur. And you could also have assuming the

Speaker:

request rate is at least 100 /second because

Speaker:

caching can warming caches can have an effect there too. And things

Speaker:

like that. So you can have both bounded numbers. There

Speaker:

something like that is a very specific it's specific. It's

Speaker:

measurable. All of those numbers I specified are all things you could

Speaker:

measure. They're something you could see. Specific

Speaker:

measurable. You want to make sure they're attainable within

Speaker:

the service. That's your responsibility as the owner of a

Speaker:

service. If another team says, I need

Speaker:

this level of performance, it is your responsibility as the owner. Before

Speaker:

you accept that is to say yes, I can do that. So they have to

Speaker:

be attainable to you. And this actually gets at something very

Speaker:

important in implementing these sorts of things, which is to make sure that

Speaker:

you are starting with goals that are near what you're currently

Speaker:

actually doing and step your way towards

Speaker:

improvement instead of setting impossible goals. And then

Speaker:

punishing teams when they don't achieve something that was so far outside of

Speaker:

their ability. Oh absolutely there's two things that make a

Speaker:

goal bad. One is when the goal is so easy that

Speaker:

it's irrelevant. The other one is when it's so difficult that it's never

Speaker:

set never hit. You should set

Speaker:

goals that are in the case of

Speaker:

SLAs, your goal needs to hit the

Speaker:

SLA 100% of the time, but it

Speaker:

can't be three times what you are ever

Speaker:

going to see. Because giving you plenty of room

Speaker:

to have all sorts of problems because then that doesn't make it relevant to

Speaker:

the consumer of the goal. They need something better than that. That's

Speaker:

where the attainable and that's where relevant comes in. And

Speaker:

relevant is so important because it's so tempting. This is where when

Speaker:

it's the engineers that set those goals those

Speaker:

objectives in isolation you tend to get things that are

Speaker:

measurable and specific and

Speaker:

attainable but not relevant, right? I will

Speaker:

guarantee my service will have a latency of less than

Speaker:

37 seconds for this simple request guaranteed I

Speaker:

can promise you that, right? And the consumer will say

Speaker:

well I'm sorry I need ten milliseconds 37 seconds doesn't

Speaker:

that sounds an absurd number but you and I have both

Speaker:

heard numbers like that right? Where they're so far out of bounds they're

Speaker:

totally irrelevant, they're not worth even discussing.

Speaker:

Yes and a sneakier example would be something

Speaker:

like setting an objective

Speaker:

around how your infrastructure is behaving in ways that

Speaker:

don't translate directly to

Speaker:

the benefit to the customer. If you own a web

Speaker:

service that is serving directly to end

Speaker:

users. And your primary measures of

Speaker:

system health are around

Speaker:

CPU and I

Speaker:

O. Well, those might tell you something about what's

Speaker:

happening, but they are not directly

Speaker:

relevant to the customer. You need to have those on your dashboards for when

Speaker:

you're troubleshooting, when there is a problem, but that's not indicating the health

Speaker:

of the system. Right. So specific measurable

Speaker:

attainable relevant. So relevant

Speaker:

means the consumer of your service has to find them

Speaker:

to be useful. Attainable means that you as provider

Speaker:

of the service, need to be able to meet them. Measurable

Speaker:

means need to be measurable specific.

Speaker:

They can't be general purpose and ambiguous. They have to

Speaker:

be very specific. So all those make sense. Does time bound really apply

Speaker:

here? I think it does, but in the sense

Speaker:

that when you're setting these agreements,

Speaker:

you tend to say, this is my commitment, and

Speaker:

you tend to measure over a span of time and

Speaker:

there is a sense of the clock getting reset.

Speaker:

That's true. We'll handle this much traffic

Speaker:

over this period of time. You're right. That's a form of time bound. I think

Speaker:

when you talk about smart goals, they're really talking about the time

Speaker:

when you'll accomplish the goal. And what we're saying

Speaker:

is the time you accomplish the goal is now. It's

Speaker:

not really a goal, it's an agreement as far

Speaker:

as it's a habit. Rather than a habit.

Speaker:

And that's actually a good point. These aren't goals.

Speaker:

I'm going to try to make this no, this is what you're

Speaker:

going to be performing to and you can change them and improve them over time.

Speaker:

You can have a goal that says I'm going to improve my

Speaker:

SLA over time and make

Speaker:

my SLA twice as good by the state.

Speaker:

That's a perfectly fine goal. But that's what a goal is

Speaker:

versus an SLA, which says your SLA is

Speaker:

something like five millisecond latency

Speaker:

with less than 10,000 requests. And you can say, that's

Speaker:

great, I have a goal to make. It a two millisecond latency

Speaker:

with 5000 requests, and by this time next

Speaker:

quarter, and at that point in time then your SLA is now two

Speaker:

milliseconds. But the SLA is what it is and

Speaker:

what you're agreeing to, committing to now, it's a

Speaker:

failure if you don't meet it right

Speaker:

now. As opposed to a goal, which is what you're striving towards.

Speaker:

Yeah, towards completing something. Right.

Speaker:

One anecdote. That a well known anecdote that I

Speaker:

think is interesting to talk about. Here is

Speaker:

the example that Google gave. This is in the SRE

Speaker:

book of actually

Speaker:

overshooting and having a service that

Speaker:

was too reliable. I can't remember which service it was

Speaker:

off the top of my head, but they actually had a service that they did

Speaker:

not want to guarantee 100% uptime, but they ended up

Speaker:

getting over delivering on quality for a while.

Speaker:

And when that service did fail,

Speaker:

users were incensed because there was sort of this

Speaker:

implicit SLA. Well, it's been performing so well.

Speaker:

And so what I love about that story is that they ended

Speaker:

up deliberately introducing failures into the system

Speaker:

so that users would not become accustomed to too high of

Speaker:

a performance level. And what this

Speaker:

underscores is how much this is about

Speaker:

ultimately the experience of whatever person it is

Speaker:

that needs to use your service. This is not a purely

Speaker:

technical problem. This is very much about understanding

Speaker:

how your system can be maximally healthy

Speaker:

and maximally serve

Speaker:

whoever it is that's using it. So I love that story. I

Speaker:

didn't know that story before, but it plays very well into

Speaker:

the Netflix Chaos Monkey approach to testing. And that is

Speaker:

the idea that the way you ensure

Speaker:

your system as a whole keeps performing is you keep causing it to fail on

Speaker:

a regular basis to make sure that you can handle those failures.

Speaker:

So what the Chaos Monkey does, and I'm sure at some point in time we're

Speaker:

going to do an episode on Chaos Monkey. Matter of fact, we should add it

Speaker:

to our list. What Chaos Monkey is all about is the idea

Speaker:

that you intentionally insert faults into your system

Speaker:

at irregular times so that you can

Speaker:

verify that the

Speaker:

response your application is supposed to have to self heal around the

Speaker:

problems that are occurring can be tested to make sure they

Speaker:

occur. Now, you don't do this in staging, you don't do this in

Speaker:

dev, you do it in production. But you do it in production

Speaker:

during times when people are around. So that if

Speaker:

it does cause a real problem, if you turn off the service

Speaker:

and that causes a real problem and customers are really affected,

Speaker:

everyone's on board and you can solve the problem right away as opposed

Speaker:

to the exact same thing happening by happen chance at

02 00:12:56

00 in the morning when everyone's drowsy and sleeping and

02 00:13:00

not knowing what's going on. You can address the problem right there

02 00:13:04

right then as opposed to later on. And the other

02 00:13:08

thing it helps with is this problem that you were addressing which

02 00:13:11

is getting too

02 00:13:15

used to things working. So if you deploy a new

02 00:13:19

change and let's say I own a service, and one of the

02 00:13:22

things I'm doing service A and I call Service B and I need to

02 00:13:26

expect a service B will fail occasionally, well, I'm going to write

02 00:13:30

code into Service A to do different things. If Service B

02 00:13:33

doesn't work well, what if I introduce an error in that

02 00:13:37

code that I'm not aware of and then I deploy my

02 00:13:41

code? Well it's going to function, it's going to work,

02 00:13:44

everything's going to be fine until Service B fails and Service A is also going

02 00:13:48

to fail. But if Service B is regularly

02 00:13:52

failing, you're going to notice that a

02 00:13:56

lot sooner, perhaps immediately after deployment,

02 00:13:59

and you're going to be able to fix that problem, roll it back if necessary,

02 00:14:03

or roll forward with a fix to it to

02 00:14:06

get the situation resolved. The more

02 00:14:10

chaotic you put code into, the more stable the

02 00:14:13

code is going to be. It's a weird thought

02 00:14:17

to think that way, but the more chaotic a system, the

02 00:14:21

more stable the code that's in that system behaves

02 00:14:25

over the long term. I'm so glad you bring this up. And what I

02 00:14:28

love about this is that we're really touching

02 00:14:32

on similar themes in different contexts

02 00:14:35

because both Chaos Engineering and the DevOps

02 00:14:39

approach are really about

02 00:14:43

understanding that we don't just have a technical system,

02 00:14:46

we have a sociotechnical system. We have this intertwined human and

02 00:14:50

technology system. And so with DevOps, one

02 00:14:54

of the advantages of DevOps is that it changes the behavior of the people

02 00:14:58

who are creating the system itself. Because

02 00:15:01

again, if you're going to deploy code

02 00:15:05

and you know that if something goes wrong, it's going to wake up that person

02 00:15:08

over there that you don't even know.

02 00:15:12

You just build your services differently.

02 00:15:16

You're not as rigorous as

02 00:15:19

when you know you're going to be the one woken up at 02:00 A.m.. And

02 00:15:23

similarly with chaos engineering, if you know that

02 00:15:26

service B is going to fail absolutely in the coming

02 00:15:30

week, you're just going to be like, well, I may as well deal with this

02 00:15:34

now. As opposed to like, well, I'm under deadline. Service b is usually

02 00:15:37

stable. I'm just going to run the risk and we'll deal with it later.

02 00:15:41

So it really drives behavior inserted into

02 00:15:44

systems. Right. And the other thing

02 00:15:48

I love about how you kind of unpacked chaos

02 00:15:52

Engineering is it does work

02 00:15:56

on this very counterintuitive idea that you should be

02 00:15:59

running towards incidents and problems

02 00:16:03

instead of running away from them, you should embrace them.

02 00:16:06

And that will actually help you, as you said,

02 00:16:10

make the system more stable because you

02 00:16:13

are proactively encountering those issues rather

02 00:16:17

than letting them come to you. Yeah, that's absolutely great.

02 00:16:21

That's great. Yeah, you're right.

02 00:16:25

We're not talking about coding. We're talking about social systems here. We're

02 00:16:29

talking about systems of people that happen to include

02 00:16:32

code as opposed to systems of code. And that the vast

02 00:16:36

majority of incidents that happen have a

02 00:16:40

socio component to. It not just a code

02 00:16:43

problem. It's someone who said this is good

02 00:16:47

enough or someone who didn't spend the time

02 00:16:50

to think about whether or not it would be good enough or not and

02 00:16:54

therefore missed something. Right. And these aren't bad

02 00:16:58

people doing bad things. These are good people that are making mistakes that

02 00:17:01

are caused by the environment of which they're

02 00:17:05

working. And that's why environment and

02 00:17:09

systems of people and how they're structured and how they're organized

02 00:17:12

is so important. I keep hearing people

02 00:17:16

say how you

02 00:17:20

organize your companies irrelevant. Right? It shouldn't matter.

02 00:17:24

Nothing could be further from the truth. It matters the

02 00:17:28

way you organize a company.

02 00:17:32

I hate saying it this way because I don't always work in this one, but

02 00:17:34

how clean your desk is is a good indication of how clean the system

02 00:17:38

is. And I don't mean that literally because I've had dirty

02 00:17:42

desks too, but it really is a good indication

02 00:17:45

here. It's how well you organize your

02 00:17:49

environment, how well you organize your team,

02 00:17:52

how well you organize your organization,

02 00:17:57

gives an indication for how well you're going to perform as a

02 00:18:00

company from the standpoint. Yes,

02 00:18:06

when we look at the realm of incidents which

02 00:18:10

are messy and frustrating and scary and expensive,

02 00:18:14

and every tech company knows that they

02 00:18:17

are probably one really

02 00:18:21

bad incident away from going out of

02 00:18:24

business, every company knows

02 00:18:28

that there's that really bad

02 00:18:32

thing that could collapse the whole

02 00:18:36

structure. And so incidents are really high

02 00:18:39

stakes, but

02 00:18:43

that drives us to look for certainty and look for clarity. And

02 00:18:46

so we look to a lot of these things that people have been talking

02 00:18:50

about for years around incident metrics. So you've

02 00:18:54

got your mean time metrics, what's your mean time to resolution

02 00:18:58

or your mean time between failure and it's? This attempt

02 00:19:01

to bring some kind of order

02 00:19:06

and sense to this very scary and chaotic world

02 00:19:10

of incidents. But so

02 00:19:13

many of those, what are now often being called shallow

02 00:19:17

incident metrics end up giving short

02 00:19:21

shrift to what we were just talking about, which is that

02 00:19:25

this is a very complex system.

02 00:19:29

The technology itself is very complex. The

02 00:19:33

sociotechnical system is complex.

02 00:19:36

We're trying to kind of get a handle on

02 00:19:40

how do you surface those complexities and make them

02 00:19:43

intelligible and make them sensible without

02 00:19:47

falling back to some of these shallow metrics. That

02 00:19:52

Niall Murphy, who was back to SRE, one of the authors of

02 00:19:55

the original SRE book, had a paper out recently where he kind

02 00:19:59

of unpacks the ways that these mean time

02 00:20:03

and other shallow metrics aren't

02 00:20:06

statistically meaningful and

02 00:20:09

aren't helping us make good decisions

02 00:20:13

in the wake of these incidents. And so much of what we're talking

02 00:20:17

about is SLAs are how do

02 00:20:20

you make decisions about what work you're going to do and how

02 00:20:24

much you invest in reliability versus new features

02 00:20:28

and incident follow up is so much about what

02 00:20:31

decisions do we make based on what we learned

02 00:20:35

in this event. Yeah, you add a whole new dimension

02 00:20:39

here to the metric discussion here, because

02 00:20:43

it's so easy to think about metrics along the line of

02 00:20:46

how we're performing and when we don't perform, it's a failure

02 00:20:50

oops, but there's a lot of data in the

02 00:20:53

Oops, and you're right. Things like meantime

02 00:20:57

to detect and meantime to resolution. And those are important,

02 00:21:01

but they're very superficial compared to the depth that you

02 00:21:05

can get. And I'm not talking about Joe's team caused five

02 00:21:08

incidents last week. That's a problem for Joe. I'm not talking about

02 00:21:12

that. I'm talking about the

02 00:21:15

undercovering,

02 00:21:18

the sophisticated connection between

02 00:21:22

things that can cause problems to occur.

02 00:21:28

Thank you for tuning in to Modern Digital Business. This

02 00:21:31

podcast exists because of the support of you, my listeners.

02 00:21:35

If you enjoy what you hear, will you please leave a review on Apple

02 00:21:38

podcasts or directly on our website at MDB

02 00:21:42

FM slash Reviews If you'd like to suggest a topic for an

02 00:21:46

episode or you are interested in becoming a guest, please contact

02 00:21:50

me directly by sending me a message at MDB FM

02 00:21:53

contact. And if you'd like to record a quick question or

02 00:21:57

comment, click the microphone icon in the lower right hand corner of our

02 00:22:00

website. Your recording might be featured on a future

02 00:22:04

episode. To make sure you get every new episode when they become

02 00:22:08

available, click subscribe in your favorite podcast player

02 00:22:11

or check out our website at MDB FM. If

02 00:22:15

you want to learn more from me, then check out one of my books, courses

02 00:22:18

or articles by going to Lee Atchison.com and

02 00:22:22

all of these links are included in the show. Notes thank you for

02 00:22:26

listening and welcome to the world of the modern digital business.