Speaker: 00:00:00

Modern applications require modern operations, and modern

Speaker: 00:00:03

operations requires a new definition for ownership that

Speaker: 00:00:07

most classical organizations must provide.

Speaker: 00:00:10

Today I continue my discussion on modern ops with Beth Long.

Speaker: 00:00:14

Are you ready? Let's go.

Speaker: 00:00:18

This is the Modern Digital Business Podcast, the technical

Speaker: 00:00:22

leader's guide to modernizing your applications and digital business.

Speaker: 00:00:26

Whether you're a business technology leader or a small business

Speaker: 00:00:29

innovator, keeping up with the digital business revolution is a

Speaker: 00:00:32

must. Here to help make it easier with actionable insights and

Speaker: 00:00:36

recommendations, as well as thoughtful interviews with industry experts.

Speaker: 00:00:40

Lee Atchison in this episode of Modern Digital Business,

Speaker: 00:00:44

I continue my conversation on Modern Operations with my good

Speaker: 00:00:48

friend SRE engineer and operations manager Beth

Speaker: 00:00:51

Long. This conversation, which focuses on service

Speaker: 00:00:54

ownership and measurement, is a continuation of our

Speaker: 00:00:57

conversation on SLAs in Modern Applications.

Speaker: 00:01:02

In a previous episode, we talked about Stosa, and this fits very much into

Speaker: 00:01:05

that idea is the idea that how you organize

Speaker: 00:01:09

your teams so that each team has a certain

Speaker: 00:01:13

set of responsibilities. We won't go into all the details of Stosa, but bottom

Speaker: 00:01:17

line is ownership is critical to the

Speaker: 00:01:20

Stosa model. Ownership is critical towards all DevOps

Speaker: 00:01:23

models. If you own a service, you're responsible

Speaker: 00:01:27

for how that service performs, because other teams are depending on

Speaker: 00:01:31

you to perform what those performance,

Speaker: 00:01:35

what it means to perform. The definition of what it

Speaker: 00:01:39

means to perform is what an SLA is all about.

Speaker: 00:01:43

Yeah. So what does a good SLA look like?

Speaker: 00:01:48

Beth that's a great question. Let's get to the measurement.

Speaker: 00:01:53

It does get into measurement.

Speaker: 00:01:59

That is always a hard question to answer.

Speaker: 00:02:03

If you look at the textbook

Speaker: 00:02:07

discussions of Slis and SLOs and

Speaker: 00:02:10

SLAs in particular, you'll often see references

Speaker: 00:02:14

to a lot of the things that are measurable. So you'll

Speaker: 00:02:17

have your golden signals of error, rate,

Speaker: 00:02:21

latency, saturation. So you have

Speaker: 00:02:25

these things that allow you to say, okay,

Speaker: 00:02:29

we're going to tolerate this many errors,

Speaker: 00:02:32

or this many of this type of error, this much

Speaker: 00:02:36

latency. But all of that is kind of trying

Speaker: 00:02:40

to distill down the customer experience

Speaker: 00:02:44

into these things that can be measured and

Speaker: 00:02:47

put on a dashboard. The term smart goals comes

Speaker: 00:02:51

to mind, right. That I think, is a good

Speaker: 00:02:55

measure. I know the idea of smart goals really hasn't been tied to

Speaker: 00:02:59

SLAs too closely, but I think there's a lot of similarities here. So

Speaker: 00:03:03

smart goals are five specific criteria. They're specific

Speaker: 00:03:07

measurable, attainable,

Speaker: 00:03:10

relevant, and time bound. So

Speaker: 00:03:14

now I think all five of those actually apply here

Speaker: 00:03:18

as well. Too right. When you create your SLAs,

Speaker: 00:03:21

they have to be specific. You can't say, yeah, we'll meet your

Speaker: 00:03:25

needs. That's not a good experience. But

Speaker: 00:03:29

in my mind, a good measurement is something

Speaker: 00:03:32

like, we will maintain

Speaker: 00:03:36

five milliseconds latency on average

Speaker: 00:03:40

for 90% of all requests that come in.

Speaker: 00:03:44

And I also like to put in an assuming.

Speaker: 00:03:47

Assuming you meet these criteria, such

Speaker: 00:03:50

as amount of traffic, the traffic load is

Speaker: 00:03:54

less than X, number of requests permitted or whatever the

Speaker: 00:03:57

criteria is. So in my mind, it's a specific

Speaker: 00:04:01

measurement with bounds for what that

Speaker: 00:04:05

means. Under assumptions. And these are the

Speaker: 00:04:08

assumptions. So something like five

Speaker: 00:04:12

milliseconds average latency for 90% of requests,

Speaker: 00:04:16

assuming the request rate is less than

Speaker: 00:04:20

5000 requests per second,

Speaker: 00:04:23

and assuming both those things occur. And you could also have assuming the

Speaker: 00:04:27

request rate is at least 100 /second because

Speaker: 00:04:30

caching can warming caches can have an effect there too. And things

Speaker: 00:04:34

like that. So you can have both bounded numbers. There

Speaker: 00:04:38

something like that is a very specific it's specific. It's

Speaker: 00:04:42

measurable. All of those numbers I specified are all things you could

Speaker: 00:04:45

measure. They're something you could see. Specific

Speaker: 00:04:49

measurable. You want to make sure they're attainable within

Speaker: 00:04:52

the service. That's your responsibility as the owner of a

Speaker: 00:04:56

service. If another team says, I need

Speaker: 00:05:00

this level of performance, it is your responsibility as the owner. Before

Speaker: 00:05:04

you accept that is to say yes, I can do that. So they have to

Speaker: 00:05:07

be attainable to you. And this actually gets at something very

Speaker: 00:05:11

important in implementing these sorts of things, which is to make sure that

Speaker: 00:05:15

you are starting with goals that are near what you're currently

Speaker: 00:05:19

actually doing and step your way towards

Speaker: 00:05:22

improvement instead of setting impossible goals. And then

Speaker: 00:05:26

punishing teams when they don't achieve something that was so far outside of

Speaker: 00:05:30

their ability. Oh absolutely there's two things that make a

Speaker: 00:05:33

goal bad. One is when the goal is so easy that

Speaker: 00:05:37

it's irrelevant. The other one is when it's so difficult that it's never

Speaker: 00:05:41

set never hit. You should set

Speaker: 00:05:45

goals that are in the case of

Speaker: 00:05:48

SLAs, your goal needs to hit the

Speaker: 00:05:51

SLA 100% of the time, but it

Speaker: 00:05:55

can't be three times what you are ever

Speaker: 00:05:59

going to see. Because giving you plenty of room

Speaker: 00:06:03

to have all sorts of problems because then that doesn't make it relevant to

Speaker: 00:06:06

the consumer of the goal. They need something better than that. That's

Speaker: 00:06:10

where the attainable and that's where relevant comes in. And

Speaker: 00:06:14

relevant is so important because it's so tempting. This is where when

Speaker: 00:06:17

it's the engineers that set those goals those

Speaker: 00:06:21

objectives in isolation you tend to get things that are

Speaker: 00:06:24

measurable and specific and

Speaker: 00:06:28

attainable but not relevant, right? I will

Speaker: 00:06:31

guarantee my service will have a latency of less than

Speaker: 00:06:35

37 seconds for this simple request guaranteed I

Speaker: 00:06:39

can promise you that, right? And the consumer will say

Speaker: 00:06:43

well I'm sorry I need ten milliseconds 37 seconds doesn't

Speaker: 00:06:47

that sounds an absurd number but you and I have both

Speaker: 00:06:51

heard numbers like that right? Where they're so far out of bounds they're

Speaker: 00:06:54

totally irrelevant, they're not worth even discussing.

Speaker: 00:06:58

Yes and a sneakier example would be something

Speaker: 00:07:01

like setting an objective

Speaker: 00:07:04

around how your infrastructure is behaving in ways that

Speaker: 00:07:08

don't translate directly to

Speaker: 00:07:12

the benefit to the customer. If you own a web

Speaker: 00:07:15

service that is serving directly to end

Speaker: 00:07:18

users. And your primary measures of

Speaker: 00:07:22

system health are around

Speaker: 00:07:26

CPU and I

Speaker: 00:07:29

O. Well, those might tell you something about what's

Speaker: 00:07:32

happening, but they are not directly

Speaker: 00:07:36

relevant to the customer. You need to have those on your dashboards for when

Speaker: 00:07:40

you're troubleshooting, when there is a problem, but that's not indicating the health

Speaker: 00:07:44

of the system. Right. So specific measurable

Speaker: 00:07:47

attainable relevant. So relevant

Speaker: 00:07:51

means the consumer of your service has to find them

Speaker: 00:07:55

to be useful. Attainable means that you as provider

Speaker: 00:07:59

of the service, need to be able to meet them. Measurable

Speaker: 00:08:02

means need to be measurable specific.

Speaker: 00:08:06

They can't be general purpose and ambiguous. They have to

Speaker: 00:08:10

be very specific. So all those make sense. Does time bound really apply

Speaker: 00:08:14

here? I think it does, but in the sense

Speaker: 00:08:17

that when you're setting these agreements,

Speaker: 00:08:22

you tend to say, this is my commitment, and

Speaker: 00:08:26

you tend to measure over a span of time and

Speaker: 00:08:30

there is a sense of the clock getting reset.

Speaker: 00:08:33

That's true. We'll handle this much traffic

Speaker: 00:08:37

over this period of time. You're right. That's a form of time bound. I think

Speaker: 00:08:41

when you talk about smart goals, they're really talking about the time

Speaker: 00:08:44

when you'll accomplish the goal. And what we're saying

Speaker: 00:08:48

is the time you accomplish the goal is now. It's

Speaker: 00:08:52

not really a goal, it's an agreement as far

Speaker: 00:08:55

as it's a habit. Rather than a habit.

Speaker: 00:09:01

And that's actually a good point. These aren't goals.

Speaker: 00:09:06

I'm going to try to make this no, this is what you're

Speaker: 00:09:10

going to be performing to and you can change them and improve them over time.

Speaker: 00:09:14

You can have a goal that says I'm going to improve my

Speaker: 00:09:17

SLA over time and make

Speaker: 00:09:21

my SLA twice as good by the state.

Speaker: 00:09:25

That's a perfectly fine goal. But that's what a goal is

Speaker: 00:09:29

versus an SLA, which says your SLA is

Speaker: 00:09:33

something like five millisecond latency

Speaker: 00:09:36

with less than 10,000 requests. And you can say, that's

Speaker: 00:09:40

great, I have a goal to make. It a two millisecond latency

Speaker: 00:09:44

with 5000 requests, and by this time next

Speaker: 00:09:48

quarter, and at that point in time then your SLA is now two

Speaker: 00:09:52

milliseconds. But the SLA is what it is and

Speaker: 00:09:55

what you're agreeing to, committing to now, it's a

Speaker: 00:09:59

failure if you don't meet it right

Speaker: 00:10:03

now. As opposed to a goal, which is what you're striving towards.

Speaker: 00:10:07

Yeah, towards completing something. Right.

Speaker: 00:10:12

One anecdote. That a well known anecdote that I

Speaker: 00:10:16

think is interesting to talk about. Here is

Speaker: 00:10:20

the example that Google gave. This is in the SRE

Speaker: 00:10:24

book of actually

Speaker: 00:10:28

overshooting and having a service that

Speaker: 00:10:32

was too reliable. I can't remember which service it was

Speaker: 00:10:36

off the top of my head, but they actually had a service that they did

Speaker: 00:10:39

not want to guarantee 100% uptime, but they ended up

Speaker: 00:10:43

getting over delivering on quality for a while.

Speaker: 00:10:46

And when that service did fail,

Speaker: 00:10:50

users were incensed because there was sort of this

Speaker: 00:10:54

implicit SLA. Well, it's been performing so well.

Speaker: 00:10:58

And so what I love about that story is that they ended

Speaker: 00:11:01

up deliberately introducing failures into the system

Speaker: 00:11:05

so that users would not become accustomed to too high of

Speaker: 00:11:09

a performance level. And what this

Speaker: 00:11:12

underscores is how much this is about

Speaker: 00:11:16

ultimately the experience of whatever person it is

Speaker: 00:11:20

that needs to use your service. This is not a purely

Speaker: 00:11:23

technical problem. This is very much about understanding

Speaker: 00:11:27

how your system can be maximally healthy

Speaker: 00:11:31

and maximally serve

Speaker: 00:11:35

whoever it is that's using it. So I love that story. I

Speaker: 00:11:38

didn't know that story before, but it plays very well into

Speaker: 00:11:43

the Netflix Chaos Monkey approach to testing. And that is

Speaker: 00:11:46

the idea that the way you ensure

Speaker: 00:11:50

your system as a whole keeps performing is you keep causing it to fail on

Speaker: 00:11:54

a regular basis to make sure that you can handle those failures.

Speaker: 00:11:58

So what the Chaos Monkey does, and I'm sure at some point in time we're

Speaker: 00:12:01

going to do an episode on Chaos Monkey. Matter of fact, we should add it

Speaker: 00:12:04

to our list. What Chaos Monkey is all about is the idea

Speaker: 00:12:07

that you intentionally insert faults into your system

Speaker: 00:12:14

at irregular times so that you can

Speaker: 00:12:20

verify that the

Speaker: 00:12:23

response your application is supposed to have to self heal around the

Speaker: 00:12:27

problems that are occurring can be tested to make sure they

Speaker: 00:12:31

occur. Now, you don't do this in staging, you don't do this in

Speaker: 00:12:34

dev, you do it in production. But you do it in production

Speaker: 00:12:38

during times when people are around. So that if

Speaker: 00:12:42

it does cause a real problem, if you turn off the service

Speaker: 00:12:45

and that causes a real problem and customers are really affected,

Speaker: 00:12:49

everyone's on board and you can solve the problem right away as opposed

Speaker: 00:12:53

to the exact same thing happening by happen chance at

02 00:12:56

00 in the morning when everyone's drowsy and sleeping and

02 00:13:00

not knowing what's going on. You can address the problem right there

02 00:13:04

right then as opposed to later on. And the other

02 00:13:08

thing it helps with is this problem that you were addressing which

02 00:13:11

is getting too

02 00:13:15

used to things working. So if you deploy a new

02 00:13:19

change and let's say I own a service, and one of the

02 00:13:22

things I'm doing service A and I call Service B and I need to

02 00:13:26

expect a service B will fail occasionally, well, I'm going to write

02 00:13:30

code into Service A to do different things. If Service B

02 00:13:33

doesn't work well, what if I introduce an error in that

02 00:13:37

code that I'm not aware of and then I deploy my

02 00:13:41

code? Well it's going to function, it's going to work,

02 00:13:44

everything's going to be fine until Service B fails and Service A is also going

02 00:13:48

to fail. But if Service B is regularly

02 00:13:52

failing, you're going to notice that a

02 00:13:56

lot sooner, perhaps immediately after deployment,

02 00:13:59

and you're going to be able to fix that problem, roll it back if necessary,

02 00:14:03

or roll forward with a fix to it to

02 00:14:06

get the situation resolved. The more

02 00:14:10

chaotic you put code into, the more stable the

02 00:14:13

code is going to be. It's a weird thought

02 00:14:17

to think that way, but the more chaotic a system, the

02 00:14:21

more stable the code that's in that system behaves

02 00:14:25

over the long term. I'm so glad you bring this up. And what I

02 00:14:28

love about this is that we're really touching

02 00:14:32

on similar themes in different contexts

02 00:14:35

because both Chaos Engineering and the DevOps

02 00:14:39

approach are really about

02 00:14:43

understanding that we don't just have a technical system,

02 00:14:46

we have a sociotechnical system. We have this intertwined human and

02 00:14:50

technology system. And so with DevOps, one

02 00:14:54

of the advantages of DevOps is that it changes the behavior of the people

02 00:14:58

who are creating the system itself. Because

02 00:15:01

again, if you're going to deploy code

02 00:15:05

and you know that if something goes wrong, it's going to wake up that person

02 00:15:08

over there that you don't even know.

02 00:15:12

You just build your services differently.

02 00:15:16

You're not as rigorous as

02 00:15:19

when you know you're going to be the one woken up at 02:00 A.m.. And

02 00:15:23

similarly with chaos engineering, if you know that

02 00:15:26

service B is going to fail absolutely in the coming

02 00:15:30

week, you're just going to be like, well, I may as well deal with this

02 00:15:34

now. As opposed to like, well, I'm under deadline. Service b is usually

02 00:15:37

stable. I'm just going to run the risk and we'll deal with it later.

02 00:15:41

So it really drives behavior inserted into

02 00:15:44

systems. Right. And the other thing

02 00:15:48

I love about how you kind of unpacked chaos

02 00:15:52

Engineering is it does work

02 00:15:56

on this very counterintuitive idea that you should be

02 00:15:59

running towards incidents and problems

02 00:16:03

instead of running away from them, you should embrace them.

02 00:16:06

And that will actually help you, as you said,

02 00:16:10

make the system more stable because you

02 00:16:13

are proactively encountering those issues rather

02 00:16:17

than letting them come to you. Yeah, that's absolutely great.

02 00:16:21

That's great. Yeah, you're right.

02 00:16:25

We're not talking about coding. We're talking about social systems here. We're

02 00:16:29

talking about systems of people that happen to include

02 00:16:32

code as opposed to systems of code. And that the vast

02 00:16:36

majority of incidents that happen have a

02 00:16:40

socio component to. It not just a code

02 00:16:43

problem. It's someone who said this is good

02 00:16:47

enough or someone who didn't spend the time

02 00:16:50

to think about whether or not it would be good enough or not and

02 00:16:54

therefore missed something. Right. And these aren't bad

02 00:16:58

people doing bad things. These are good people that are making mistakes that

02 00:17:01

are caused by the environment of which they're

02 00:17:05

working. And that's why environment and

02 00:17:09

systems of people and how they're structured and how they're organized

02 00:17:12

is so important. I keep hearing people

02 00:17:16

say how you

02 00:17:20

organize your companies irrelevant. Right? It shouldn't matter.

02 00:17:24

Nothing could be further from the truth. It matters the

02 00:17:28

way you organize a company.

02 00:17:32

I hate saying it this way because I don't always work in this one, but

02 00:17:34

how clean your desk is is a good indication of how clean the system

02 00:17:38

is. And I don't mean that literally because I've had dirty

02 00:17:42

desks too, but it really is a good indication

02 00:17:45

here. It's how well you organize your

02 00:17:49

environment, how well you organize your team,

02 00:17:52

how well you organize your organization,

02 00:17:57

gives an indication for how well you're going to perform as a

02 00:18:00

company from the standpoint. Yes,

02 00:18:06

when we look at the realm of incidents which

02 00:18:10

are messy and frustrating and scary and expensive,

02 00:18:14

and every tech company knows that they

02 00:18:17

are probably one really

02 00:18:21

bad incident away from going out of

02 00:18:24

business, every company knows

02 00:18:28

that there's that really bad

02 00:18:32

thing that could collapse the whole

02 00:18:36

structure. And so incidents are really high

02 00:18:39

stakes, but

02 00:18:43

that drives us to look for certainty and look for clarity. And

02 00:18:46

so we look to a lot of these things that people have been talking

02 00:18:50

about for years around incident metrics. So you've

02 00:18:54

got your mean time metrics, what's your mean time to resolution

02 00:18:58

or your mean time between failure and it's? This attempt

02 00:19:01

to bring some kind of order

02 00:19:06

and sense to this very scary and chaotic world

02 00:19:10

of incidents. But so

02 00:19:13

many of those, what are now often being called shallow

02 00:19:17

incident metrics end up giving short

02 00:19:21

shrift to what we were just talking about, which is that

02 00:19:25

this is a very complex system.

02 00:19:29

The technology itself is very complex. The

02 00:19:33

sociotechnical system is complex.

02 00:19:36

We're trying to kind of get a handle on

02 00:19:40

how do you surface those complexities and make them

02 00:19:43

intelligible and make them sensible without

02 00:19:47

falling back to some of these shallow metrics. That

02 00:19:52

Niall Murphy, who was back to SRE, one of the authors of

02 00:19:55

the original SRE book, had a paper out recently where he kind

02 00:19:59

of unpacks the ways that these mean time

02 00:20:03

and other shallow metrics aren't

02 00:20:06

statistically meaningful and

02 00:20:09

aren't helping us make good decisions

02 00:20:13

in the wake of these incidents. And so much of what we're talking

02 00:20:17

about is SLAs are how do

02 00:20:20

you make decisions about what work you're going to do and how

02 00:20:24

much you invest in reliability versus new features

02 00:20:28

and incident follow up is so much about what

02 00:20:31

decisions do we make based on what we learned

02 00:20:35

in this event. Yeah, you add a whole new dimension

02 00:20:39

here to the metric discussion here, because

02 00:20:43

it's so easy to think about metrics along the line of

02 00:20:46

how we're performing and when we don't perform, it's a failure

02 00:20:50

oops, but there's a lot of data in the

02 00:20:53

Oops, and you're right. Things like meantime

02 00:20:57

to detect and meantime to resolution. And those are important,

02 00:21:01

but they're very superficial compared to the depth that you

02 00:21:05

can get. And I'm not talking about Joe's team caused five

02 00:21:08

incidents last week. That's a problem for Joe. I'm not talking about

02 00:21:12

that. I'm talking about the

02 00:21:15

undercovering,

02 00:21:18

the sophisticated connection between

02 00:21:22

things that can cause problems to occur.

02 00:21:28

Thank you for tuning in to Modern Digital Business. This

02 00:21:31

podcast exists because of the support of you, my listeners.

02 00:21:35

If you enjoy what you hear, will you please leave a review on Apple

02 00:21:38

podcasts or directly on our website at MDB

02 00:21:42

FM slash Reviews If you'd like to suggest a topic for an

02 00:21:46

episode or you are interested in becoming a guest, please contact

02 00:21:50

me directly by sending me a message at MDB FM

02 00:21:53

contact. And if you'd like to record a quick question or

02 00:21:57

comment, click the microphone icon in the lower right hand corner of our

02 00:22:00

website. Your recording might be featured on a future

02 00:22:04

episode. To make sure you get every new episode when they become

02 00:22:08

available, click subscribe in your favorite podcast player

02 00:22:11

or check out our website at MDB FM. If

02 00:22:15

you want to learn more from me, then check out one of my books, courses

02 00:22:18

or articles by going to Lee Atchison.com and

02 00:22:22

all of these links are included in the show. Notes thank you for

02 00:22:26

listening and welcome to the world of the modern digital business.