Modern applications require modern operations, and modern
Speaker:operations requires a new definition for ownership that
Speaker:most classical organizations must provide.
Speaker:Today I continue my discussion on modern ops with Beth Long.
Speaker:Are you ready? Let's go.
Speaker:This is the Modern Digital Business Podcast, the technical
Speaker:leader's guide to modernizing your applications and digital business.
Speaker:Whether you're a business technology leader or a small business
Speaker:innovator, keeping up with the digital business revolution is a
Speaker:must. Here to help make it easier with actionable insights and
Speaker:recommendations, as well as thoughtful interviews with industry experts.
Speaker:Lee Atchison in this episode of Modern Digital Business,
Speaker:I continue my conversation on Modern Operations with my good
Speaker:friend SRE engineer and operations manager Beth
Speaker:Long. This conversation, which focuses on service
Speaker:ownership and measurement, is a continuation of our
Speaker:conversation on SLAs in Modern Applications.
Speaker:In a previous episode, we talked about Stosa, and this fits very much into
Speaker:that idea is the idea that how you organize
Speaker:your teams so that each team has a certain
Speaker:set of responsibilities. We won't go into all the details of Stosa, but bottom
Speaker:line is ownership is critical to the
Speaker:Stosa model. Ownership is critical towards all DevOps
Speaker:models. If you own a service, you're responsible
Speaker:for how that service performs, because other teams are depending on
Speaker:you to perform what those performance,
Speaker:what it means to perform. The definition of what it
Speaker:means to perform is what an SLA is all about.
Speaker:Yeah. So what does a good SLA look like?
Speaker:Beth that's a great question. Let's get to the measurement.
Speaker:It does get into measurement.
Speaker:That is always a hard question to answer.
Speaker:If you look at the textbook
Speaker:discussions of Slis and SLOs and
Speaker:SLAs in particular, you'll often see references
Speaker:to a lot of the things that are measurable. So you'll
Speaker:have your golden signals of error, rate,
Speaker:latency, saturation. So you have
Speaker:these things that allow you to say, okay,
Speaker:we're going to tolerate this many errors,
Speaker:or this many of this type of error, this much
Speaker:latency. But all of that is kind of trying
Speaker:to distill down the customer experience
Speaker:into these things that can be measured and
Speaker:put on a dashboard. The term smart goals comes
Speaker:to mind, right. That I think, is a good
Speaker:measure. I know the idea of smart goals really hasn't been tied to
Speaker:SLAs too closely, but I think there's a lot of similarities here. So
Speaker:smart goals are five specific criteria. They're specific
Speaker:measurable, attainable,
Speaker:relevant, and time bound. So
Speaker:now I think all five of those actually apply here
Speaker:as well. Too right. When you create your SLAs,
Speaker:they have to be specific. You can't say, yeah, we'll meet your
Speaker:needs. That's not a good experience. But
Speaker:in my mind, a good measurement is something
Speaker:like, we will maintain
Speaker:five milliseconds latency on average
Speaker:for 90% of all requests that come in.
Speaker:And I also like to put in an assuming.
Speaker:Assuming you meet these criteria, such
Speaker:as amount of traffic, the traffic load is
Speaker:less than X, number of requests permitted or whatever the
Speaker:criteria is. So in my mind, it's a specific
Speaker:measurement with bounds for what that
Speaker:means. Under assumptions. And these are the
Speaker:assumptions. So something like five
Speaker:milliseconds average latency for 90% of requests,
Speaker:assuming the request rate is less than
Speaker:5000 requests per second,
Speaker:and assuming both those things occur. And you could also have assuming the
Speaker:request rate is at least 100 /second because
Speaker:caching can warming caches can have an effect there too. And things
Speaker:like that. So you can have both bounded numbers. There
Speaker:something like that is a very specific it's specific. It's
Speaker:measurable. All of those numbers I specified are all things you could
Speaker:measure. They're something you could see. Specific
Speaker:measurable. You want to make sure they're attainable within
Speaker:the service. That's your responsibility as the owner of a
Speaker:service. If another team says, I need
Speaker:this level of performance, it is your responsibility as the owner. Before
Speaker:you accept that is to say yes, I can do that. So they have to
Speaker:be attainable to you. And this actually gets at something very
Speaker:important in implementing these sorts of things, which is to make sure that
Speaker:you are starting with goals that are near what you're currently
Speaker:actually doing and step your way towards
Speaker:improvement instead of setting impossible goals. And then
Speaker:punishing teams when they don't achieve something that was so far outside of
Speaker:their ability. Oh absolutely there's two things that make a
Speaker:goal bad. One is when the goal is so easy that
Speaker:it's irrelevant. The other one is when it's so difficult that it's never
Speaker:set never hit. You should set
Speaker:goals that are in the case of
Speaker:SLAs, your goal needs to hit the
Speaker:SLA 100% of the time, but it
Speaker:can't be three times what you are ever
Speaker:going to see. Because giving you plenty of room
Speaker:to have all sorts of problems because then that doesn't make it relevant to
Speaker:the consumer of the goal. They need something better than that. That's
Speaker:where the attainable and that's where relevant comes in. And
Speaker:relevant is so important because it's so tempting. This is where when
Speaker:it's the engineers that set those goals those
Speaker:objectives in isolation you tend to get things that are
Speaker:measurable and specific and
Speaker:attainable but not relevant, right? I will
Speaker:guarantee my service will have a latency of less than
Speaker:37 seconds for this simple request guaranteed I
Speaker:can promise you that, right? And the consumer will say
Speaker:well I'm sorry I need ten milliseconds 37 seconds doesn't
Speaker:that sounds an absurd number but you and I have both
Speaker:heard numbers like that right? Where they're so far out of bounds they're
Speaker:totally irrelevant, they're not worth even discussing.
Speaker:Yes and a sneakier example would be something
Speaker:like setting an objective
Speaker:around how your infrastructure is behaving in ways that
Speaker:don't translate directly to
Speaker:the benefit to the customer. If you own a web
Speaker:service that is serving directly to end
Speaker:users. And your primary measures of
Speaker:system health are around
Speaker:CPU and I
Speaker:O. Well, those might tell you something about what's
Speaker:happening, but they are not directly
Speaker:relevant to the customer. You need to have those on your dashboards for when
Speaker:you're troubleshooting, when there is a problem, but that's not indicating the health
Speaker:of the system. Right. So specific measurable
Speaker:attainable relevant. So relevant
Speaker:means the consumer of your service has to find them
Speaker:to be useful. Attainable means that you as provider
Speaker:of the service, need to be able to meet them. Measurable
Speaker:means need to be measurable specific.
Speaker:They can't be general purpose and ambiguous. They have to
Speaker:be very specific. So all those make sense. Does time bound really apply
Speaker:here? I think it does, but in the sense
Speaker:that when you're setting these agreements,
Speaker:you tend to say, this is my commitment, and
Speaker:you tend to measure over a span of time and
Speaker:there is a sense of the clock getting reset.
Speaker:That's true. We'll handle this much traffic
Speaker:over this period of time. You're right. That's a form of time bound. I think
Speaker:when you talk about smart goals, they're really talking about the time
Speaker:when you'll accomplish the goal. And what we're saying
Speaker:is the time you accomplish the goal is now. It's
Speaker:not really a goal, it's an agreement as far
Speaker:as it's a habit. Rather than a habit.
Speaker:And that's actually a good point. These aren't goals.
Speaker:I'm going to try to make this no, this is what you're
Speaker:going to be performing to and you can change them and improve them over time.
Speaker:You can have a goal that says I'm going to improve my
Speaker:SLA over time and make
Speaker:my SLA twice as good by the state.
Speaker:That's a perfectly fine goal. But that's what a goal is
Speaker:versus an SLA, which says your SLA is
Speaker:something like five millisecond latency
Speaker:with less than 10,000 requests. And you can say, that's
Speaker:great, I have a goal to make. It a two millisecond latency
Speaker:with 5000 requests, and by this time next
Speaker:quarter, and at that point in time then your SLA is now two
Speaker:milliseconds. But the SLA is what it is and
Speaker:what you're agreeing to, committing to now, it's a
Speaker:failure if you don't meet it right
Speaker:now. As opposed to a goal, which is what you're striving towards.
Speaker:Yeah, towards completing something. Right.
Speaker:One anecdote. That a well known anecdote that I
Speaker:think is interesting to talk about. Here is
Speaker:the example that Google gave. This is in the SRE
Speaker:book of actually
Speaker:overshooting and having a service that
Speaker:was too reliable. I can't remember which service it was
Speaker:off the top of my head, but they actually had a service that they did
Speaker:not want to guarantee 100% uptime, but they ended up
Speaker:getting over delivering on quality for a while.
Speaker:And when that service did fail,
Speaker:users were incensed because there was sort of this
Speaker:implicit SLA. Well, it's been performing so well.
Speaker:And so what I love about that story is that they ended
Speaker:up deliberately introducing failures into the system
Speaker:so that users would not become accustomed to too high of
Speaker:a performance level. And what this
Speaker:underscores is how much this is about
Speaker:ultimately the experience of whatever person it is
Speaker:that needs to use your service. This is not a purely
Speaker:technical problem. This is very much about understanding
Speaker:how your system can be maximally healthy
Speaker:and maximally serve
Speaker:whoever it is that's using it. So I love that story. I
Speaker:didn't know that story before, but it plays very well into
Speaker:the Netflix Chaos Monkey approach to testing. And that is
Speaker:the idea that the way you ensure
Speaker:your system as a whole keeps performing is you keep causing it to fail on
Speaker:a regular basis to make sure that you can handle those failures.
Speaker:So what the Chaos Monkey does, and I'm sure at some point in time we're
Speaker:going to do an episode on Chaos Monkey. Matter of fact, we should add it
Speaker:to our list. What Chaos Monkey is all about is the idea
Speaker:that you intentionally insert faults into your system
Speaker:at irregular times so that you can
Speaker:verify that the
Speaker:response your application is supposed to have to self heal around the
Speaker:problems that are occurring can be tested to make sure they
Speaker:occur. Now, you don't do this in staging, you don't do this in
Speaker:dev, you do it in production. But you do it in production
Speaker:during times when people are around. So that if
Speaker:it does cause a real problem, if you turn off the service
Speaker:and that causes a real problem and customers are really affected,
Speaker:everyone's on board and you can solve the problem right away as opposed
Speaker:to the exact same thing happening by happen chance at
02 00:12:56
00 in the morning when everyone's drowsy and sleeping and
02 00:13:00
not knowing what's going on. You can address the problem right there
02 00:13:04
right then as opposed to later on. And the other
02 00:13:08
thing it helps with is this problem that you were addressing which
02 00:13:11
is getting too
02 00:13:15
used to things working. So if you deploy a new
02 00:13:19
change and let's say I own a service, and one of the
02 00:13:22
things I'm doing service A and I call Service B and I need to
02 00:13:26
expect a service B will fail occasionally, well, I'm going to write
02 00:13:30
code into Service A to do different things. If Service B
02 00:13:33
doesn't work well, what if I introduce an error in that
02 00:13:37
code that I'm not aware of and then I deploy my
02 00:13:41
code? Well it's going to function, it's going to work,
02 00:13:44
everything's going to be fine until Service B fails and Service A is also going
02 00:13:48
to fail. But if Service B is regularly
02 00:13:52
failing, you're going to notice that a
02 00:13:56
lot sooner, perhaps immediately after deployment,
02 00:13:59
and you're going to be able to fix that problem, roll it back if necessary,
02 00:14:03
or roll forward with a fix to it to
02 00:14:06
get the situation resolved. The more
02 00:14:10
chaotic you put code into, the more stable the
02 00:14:13
code is going to be. It's a weird thought
02 00:14:17
to think that way, but the more chaotic a system, the
02 00:14:21
more stable the code that's in that system behaves
02 00:14:25
over the long term. I'm so glad you bring this up. And what I
02 00:14:28
love about this is that we're really touching
02 00:14:32
on similar themes in different contexts
02 00:14:35
because both Chaos Engineering and the DevOps
02 00:14:39
approach are really about
02 00:14:43
understanding that we don't just have a technical system,
02 00:14:46
we have a sociotechnical system. We have this intertwined human and
02 00:14:50
technology system. And so with DevOps, one
02 00:14:54
of the advantages of DevOps is that it changes the behavior of the people
02 00:14:58
who are creating the system itself. Because
02 00:15:01
again, if you're going to deploy code
02 00:15:05
and you know that if something goes wrong, it's going to wake up that person
02 00:15:08
over there that you don't even know.
02 00:15:12
You just build your services differently.
02 00:15:16
You're not as rigorous as
02 00:15:19
when you know you're going to be the one woken up at 02:00 A.m.. And
02 00:15:23
similarly with chaos engineering, if you know that
02 00:15:26
service B is going to fail absolutely in the coming
02 00:15:30
week, you're just going to be like, well, I may as well deal with this
02 00:15:34
now. As opposed to like, well, I'm under deadline. Service b is usually
02 00:15:37
stable. I'm just going to run the risk and we'll deal with it later.
02 00:15:41
So it really drives behavior inserted into
02 00:15:44
systems. Right. And the other thing
02 00:15:48
I love about how you kind of unpacked chaos
02 00:15:52
Engineering is it does work
02 00:15:56
on this very counterintuitive idea that you should be
02 00:15:59
running towards incidents and problems
02 00:16:03
instead of running away from them, you should embrace them.
02 00:16:06
And that will actually help you, as you said,
02 00:16:10
make the system more stable because you
02 00:16:13
are proactively encountering those issues rather
02 00:16:17
than letting them come to you. Yeah, that's absolutely great.
02 00:16:21
That's great. Yeah, you're right.
02 00:16:25
We're not talking about coding. We're talking about social systems here. We're
02 00:16:29
talking about systems of people that happen to include
02 00:16:32
code as opposed to systems of code. And that the vast
02 00:16:36
majority of incidents that happen have a
02 00:16:40
socio component to. It not just a code
02 00:16:43
problem. It's someone who said this is good
02 00:16:47
enough or someone who didn't spend the time
02 00:16:50
to think about whether or not it would be good enough or not and
02 00:16:54
therefore missed something. Right. And these aren't bad
02 00:16:58
people doing bad things. These are good people that are making mistakes that
02 00:17:01
are caused by the environment of which they're
02 00:17:05
working. And that's why environment and
02 00:17:09
systems of people and how they're structured and how they're organized
02 00:17:12
is so important. I keep hearing people
02 00:17:16
say how you
02 00:17:20
organize your companies irrelevant. Right? It shouldn't matter.
02 00:17:24
Nothing could be further from the truth. It matters the
02 00:17:28
way you organize a company.
02 00:17:32
I hate saying it this way because I don't always work in this one, but
02 00:17:34
how clean your desk is is a good indication of how clean the system
02 00:17:38
is. And I don't mean that literally because I've had dirty
02 00:17:42
desks too, but it really is a good indication
02 00:17:45
here. It's how well you organize your
02 00:17:49
environment, how well you organize your team,
02 00:17:52
how well you organize your organization,
02 00:17:57
gives an indication for how well you're going to perform as a
02 00:18:00
company from the standpoint. Yes,
02 00:18:06
when we look at the realm of incidents which
02 00:18:10
are messy and frustrating and scary and expensive,
02 00:18:14
and every tech company knows that they
02 00:18:17
are probably one really
02 00:18:21
bad incident away from going out of
02 00:18:24
business, every company knows
02 00:18:28
that there's that really bad
02 00:18:32
thing that could collapse the whole
02 00:18:36
structure. And so incidents are really high
02 00:18:39
stakes, but
02 00:18:43
that drives us to look for certainty and look for clarity. And
02 00:18:46
so we look to a lot of these things that people have been talking
02 00:18:50
about for years around incident metrics. So you've
02 00:18:54
got your mean time metrics, what's your mean time to resolution
02 00:18:58
or your mean time between failure and it's? This attempt
02 00:19:01
to bring some kind of order
02 00:19:06
and sense to this very scary and chaotic world
02 00:19:10
of incidents. But so
02 00:19:13
many of those, what are now often being called shallow
02 00:19:17
incident metrics end up giving short
02 00:19:21
shrift to what we were just talking about, which is that
02 00:19:25
this is a very complex system.
02 00:19:29
The technology itself is very complex. The
02 00:19:33
sociotechnical system is complex.
02 00:19:36
We're trying to kind of get a handle on
02 00:19:40
how do you surface those complexities and make them
02 00:19:43
intelligible and make them sensible without
02 00:19:47
falling back to some of these shallow metrics. That
02 00:19:52
Niall Murphy, who was back to SRE, one of the authors of
02 00:19:55
the original SRE book, had a paper out recently where he kind
02 00:19:59
of unpacks the ways that these mean time
02 00:20:03
and other shallow metrics aren't
02 00:20:06
statistically meaningful and
02 00:20:09
aren't helping us make good decisions
02 00:20:13
in the wake of these incidents. And so much of what we're talking
02 00:20:17
about is SLAs are how do
02 00:20:20
you make decisions about what work you're going to do and how
02 00:20:24
much you invest in reliability versus new features
02 00:20:28
and incident follow up is so much about what
02 00:20:31
decisions do we make based on what we learned
02 00:20:35
in this event. Yeah, you add a whole new dimension
02 00:20:39
here to the metric discussion here, because
02 00:20:43
it's so easy to think about metrics along the line of
02 00:20:46
how we're performing and when we don't perform, it's a failure
02 00:20:50
oops, but there's a lot of data in the
02 00:20:53
Oops, and you're right. Things like meantime
02 00:20:57
to detect and meantime to resolution. And those are important,
02 00:21:01
but they're very superficial compared to the depth that you
02 00:21:05
can get. And I'm not talking about Joe's team caused five
02 00:21:08
incidents last week. That's a problem for Joe. I'm not talking about
02 00:21:12
that. I'm talking about the
02 00:21:15
undercovering,
02 00:21:18
the sophisticated connection between
02 00:21:22
things that can cause problems to occur.
02 00:21:28
Thank you for tuning in to Modern Digital Business. This
02 00:21:31
podcast exists because of the support of you, my listeners.
02 00:21:35
If you enjoy what you hear, will you please leave a review on Apple
02 00:21:38
podcasts or directly on our website at MDB
02 00:21:42
FM slash Reviews If you'd like to suggest a topic for an
02 00:21:46
episode or you are interested in becoming a guest, please contact
02 00:21:50
me directly by sending me a message at MDB FM
02 00:21:53
contact. And if you'd like to record a quick question or
02 00:21:57
comment, click the microphone icon in the lower right hand corner of our
02 00:22:00
website. Your recording might be featured on a future
02 00:22:04
episode. To make sure you get every new episode when they become
02 00:22:08
available, click subscribe in your favorite podcast player
02 00:22:11
or check out our website at MDB FM. If
02 00:22:15
you want to learn more from me, then check out one of my books, courses
02 00:22:18
or articles by going to Lee Atchison.com and
02 00:22:22
all of these links are included in the show. Notes thank you for
02 00:22:26
listening and welcome to the world of the modern digital business.