Gary Williams:

The thing is, if you want to do a team building exercise,

Gary Williams:

forget all the assault courses and other things they have, you do get

Gary Williams:

the team together and do a restore.

Gary Williams:

Some of the best...

Gary Williams:

seriously.

W. Curtis Preston:

It's a bit like, the trust exercises where you lean

W. Curtis Preston:

backwards and catches you .It's like that.

W. Curtis Preston:

Hi, and welcome to backup.

W. Curtis Preston:

Central's Restore it All podcast.

W. Curtis Preston:

I'm your host.

W. Curtis Preston:

W.

W. Curtis Preston:

Curtis Preston, AKA Mr.

W. Curtis Preston:

Backup and I have with me, my table saw safety, enthusiast, Prasanna Malaiyandi.

W. Curtis Preston:

How's it going Prasanna?

Prasanna Malaiyandi:

I'm good, Curtis.

Prasanna Malaiyandi:

I don't know if I'd call myself a safety enthusiast, but

W. Curtis Preston:

You don't believe in safety.

Prasanna Malaiyandi:

no, not at all.

Prasanna Malaiyandi:

Plus I think you could say I'm a bad influence on you seeing, how much

Prasanna Malaiyandi:

equipment you've now started to accrue.

W. Curtis Preston:

Yeah, last night I watched, I don't know.

W. Curtis Preston:

I'm going to say two solid hours of just table saw safety videos.

Prasanna Malaiyandi:

Yeah, but it is good for you to refresh your

Prasanna Malaiyandi:

mind on what table saw safety means.

W. Curtis Preston:

Yeah.

W. Curtis Preston:

you do recall that table saw is the reason that this finger is missing

W. Curtis Preston:

the, or this hand is missing end.

W. Curtis Preston:

I'm missing the end of the middle finger on my left hand

W. Curtis Preston:

for those of you listening.

W. Curtis Preston:

so it's actually really hard for me to watch some of those videos.

Prasanna Malaiyandi:

Is it like, when you're doing like driver's

Prasanna Malaiyandi:

education learning to drive, they show what is that red asphalt.

Prasanna Malaiyandi:

Was that the name of the movie where it's like accidents happen and.

W. Curtis Preston:

Blood on the asphalt, I think is what that one's called.

W. Curtis Preston:

I do remember that one, but this one, there's one where a guy actually

W. Curtis Preston:

shows in the video,, he doesn't have the board completely clear the blade

W. Curtis Preston:

when he takes his hand off of it.

W. Curtis Preston:

And it, the blade grabs the board and tosses it essentially at his groin, area.

W. Curtis Preston:

And the thing is when you watch it, he looks at it one frame at a time.

W. Curtis Preston:

And the board goes from being on the other side of the blade to his groin

W. Curtis Preston:

in less than a frame of the video.

W. Curtis Preston:

And, so that's, one 30th of a second, probably.

W. Curtis Preston:

yeah.

W. Curtis Preston:

And he's like, don't do that.

W. Curtis Preston:

but yeah, it's been interesting, but the thing that's got me super

W. Curtis Preston:

excited right now, has been this new video editing or just editing tool.

W. Curtis Preston:

It's both video and audio and, and it's this thing called,

Prasanna Malaiyandi:

Descript,

W. Curtis Preston:

Descript.

W. Curtis Preston:

Yeah.

W. Curtis Preston:

And it's just.

Prasanna Malaiyandi:

you sounded so excited when you texted me.

W. Curtis Preston:

Oh, my God.

W. Curtis Preston:

it's hard to describe how amazing this tool is, where you input the, in my

W. Curtis Preston:

case, I'm in, I'm actually, because we're using video clips of these episodes.

W. Curtis Preston:

I'm inputting the video and I edit the video and then I excerpt

W. Curtis Preston:

the audio for the audio excerpts.

W. Curtis Preston:

, It's made mainly for talking head videos like these, right?

W. Curtis Preston:

Or audio and you input the audio or video, it does, automated transcription,

W. Curtis Preston:

which gets about 95% accurate.

W. Curtis Preston:

And then you go through and you obviously, you can correct the things that it

W. Curtis Preston:

got wrong, but the really amazing part is if you start a sentence and you

W. Curtis Preston:

change your mind, or you have the, a lot of words going up to that sentence.

W. Curtis Preston:

All you have to do is highlight those words in the document and

W. Curtis Preston:

it cuts them out of the video.

Prasanna Malaiyandi:

It's like magic

W. Curtis Preston:

It's like magic.

W. Curtis Preston:

And then if that's not enough magic, the part that I'm super excited

W. Curtis Preston:

about trying is sometimes you say one word when you meant to say another.

Prasanna Malaiyandi:

that never happens to you, Curtis.

W. Curtis Preston:

Like the podcast I was editing yesterday...

W. Curtis Preston:

. It was you and I talking about 365 and you don't want this

W. Curtis Preston:

to happen on your worst day.

W. Curtis Preston:

That's what I meant to say.

W. Curtis Preston:

But for some reason I said last day, so with this tool, first

W. Curtis Preston:

off, I train it with my voice.

W. Curtis Preston:

I literally speak into the microphone, a bunch of stuff.

W. Curtis Preston:

It can then synthesize my voice.

W. Curtis Preston:

And I can select that word and change the word last to worst,

W. Curtis Preston:

and it will put my voice there, a synthesized version of my voice.

Prasanna Malaiyandi:

So here's a question, Curtis, do we actually

Prasanna Malaiyandi:

need to have this podcast anymore?

Prasanna Malaiyandi:

Or can we just have not even just type it out.

Prasanna Malaiyandi:

Can we just have something auto-generate based on all of our past podcasts and

Prasanna Malaiyandi:

just have it start creating new podcasts.

W. Curtis Preston:

It'll just be a recording that says

W. Curtis Preston:

3, 2, 1 rule over and over.

Prasanna Malaiyandi:

No, but it's you know how they have, they've trained AI

Prasanna Malaiyandi:

to now do paintings and things like that.

Prasanna Malaiyandi:

I wonder if we could basically have,

W. Curtis Preston:

AI based.

W. Curtis Preston:

yeah.

W. Curtis Preston:

first I get it.

W. Curtis Preston:

I have to get all the audio and then feed that into a thing.

W. Curtis Preston:

Yeah.

W. Curtis Preston:

We don't need you and me anymore.

Prasanna Malaiyandi:

Exactly.

W. Curtis Preston:

How hard is it to just say backup your stuff, backup all the

W. Curtis Preston:

stuff and make sure you test your backups?

Prasanna Malaiyandi:

And then you just do it based off of whatever's

Prasanna Malaiyandi:

trending on Twitter and the data protection, security space.

Prasanna Malaiyandi:

And it comes up with a new podcast episode for us.

W. Curtis Preston:

That may have already happened.

W. Curtis Preston:

Who knows?

W. Curtis Preston:

You don't know this is an auto-generated video and auto-generated audio, who

W. Curtis Preston:

knows, but speaking of testing backups, I was thinking about this concept, as

W. Curtis Preston:

long as you don't test your backups, your backup is both a complete success

W. Curtis Preston:

and a complete failure, which reminds me of, the concept of Schrodinger's cat.

Prasanna Malaiyandi:

I like the former, rather than thinking

Prasanna Malaiyandi:

about the latter, but that's

W. Curtis Preston:

Yeah, but,

Prasanna Malaiyandi:

Speaker:

rather than the realist.

W. Curtis Preston:

So you're familiar with the concept of Schrodinger's cat, right?

Prasanna Malaiyandi:

Speaker:

Based on TV shows, movies,

W. Curtis Preston:

Okay.

W. Curtis Preston:

Yeah.

W. Curtis Preston:

So it's just a concept, as I understand the concept that you have this cat in

W. Curtis Preston:

a box, and as long as you don't look in the box, the cat is both alive and dead.

W. Curtis Preston:

But once you look in the box, you will know that the cat is alive or dead.

W. Curtis Preston:

That's the concept of Schrodinger's cat.

W. Curtis Preston:

And the reason why this is relevant today is that we have the author of

a blog called Schrodinger's Backup:

Speaker:

when good documentation goes bad.

a blog called Schrodinger's Backup:

Speaker:

He's been in the IT industry almost as long as I have.

a blog called Schrodinger's Backup:

Speaker:

He comes to us from the UK.

a blog called Schrodinger's Backup:

Speaker:

Welcome to the podcast, Gary Williams.

Gary Williams:

Thank you and thank you for the invite.

W. Curtis Preston:

I saw that title.

W. Curtis Preston:

And I was like, I gotta get this guy on the podcast.

Prasanna Malaiyandi:

Speaker:

Curtis was so excited.

Prasanna Malaiyandi:

Speaker:

Gary, you have no idea.

Prasanna Malaiyandi:

Speaker:

This is like one of his favorite topics.

Gary Williams:

Thank you.

Gary Williams:

I don't know if I coined the term.

Gary Williams:

I have seen it used since I'd like to think I coined the term,

Gary Williams:

but I don't know for certain,

W. Curtis Preston:

why not?

Gary Williams:

it might be something that I heard and I just copied because

Gary Williams:

it's just sounds really cool when it's perfectly accurate, I think.

Gary Williams:

It was all three or four companies ago.

Gary Williams:

The lessons we learned still definitely apply today, but this

Gary Williams:

happened about three companies back.

Gary Williams:

So about 10 years ago.

W. Curtis Preston:

So what was your role at the time?

Gary Williams:

So my role at the time was a senior network engineer or senior

Gary Williams:

support engineer, something like that.

W. Curtis Preston:

OK, And you had the, the gall to, to ask about backups.

Gary Williams:

No, I didn't.

Gary Williams:

I was overconfident with our backups, let's say so we had the backup

Gary Williams:

software, I think it was backup exec.

Gary Williams:

And, we had all the servers being backed up.

Gary Williams:

We had everything going to dual tapes.

Gary Williams:

The tapes were going off site.

Gary Williams:

Everything was working.

W. Curtis Preston:

Jewel, jewel tapes?

Gary Williams:

Dual tapes.

Gary Williams:

We actually had the backups, the software was writing

Gary Williams:

effectively RAID-1 one backups.

Gary Williams:

So it was writing to two tapes.

W. Curtis Preston:

Oh, duel, capes.

W. Curtis Preston:

Okay.

W. Curtis Preston:

I heard, for some reason I heard Jewel.

W. Curtis Preston:

I don't know why.

Gary Williams:

It's the English accent.

Gary Williams:

And yeah.

Gary Williams:

So it's going to two tapes simultaneously.

Gary Williams:

So the idea was that even if a tape broke, or if something happens to the

Gary Williams:

backup and we weren't entirely sure of, or you couldn't restore from one of the

Gary Williams:

tapes, you could then get the other tape and use that tape to do the restore.

Gary Williams:

So we had all that stuff going on.

Gary Williams:

And we got all the emails and of course we're getting the emails

Gary Williams:

saying all the backups are good, everything must be absolutely fine.

Gary Williams:

Why would we test them?

Gary Williams:

Why we're busy enough with other tickets and other stuff going on and projects.

Gary Williams:

We haven't got time to test them.

Gary Williams:

What's the point?

Gary Williams:

We know they work.

Prasanna Malaiyandi:

And so it looks like you were doing all the

Prasanna Malaiyandi:

right things in terms of setting up backups, Following the 3, 2, 1 rule.

Prasanna Malaiyandi:

Right?

Prasanna Malaiyandi:

Making sure your copies were offsite and.

W. Curtis Preston:

Yeah.

Prasanna Malaiyandi:

I think that's probably better than maybe

Prasanna Malaiyandi:

like 70% of the people out there.

Prasanna Malaiyandi:

Who try to do backups.

Prasanna Malaiyandi:

You're like doing the right things.

Prasanna Malaiyandi:

You're like, oh, I'm good to go.

Gary Williams:

Yeah, absolutely.

Gary Williams:

As I say, we had the emails, we even checked the emails.

Gary Williams:

I think we even had a shared folder or something like that, where all

Gary Williams:

the backups emails went, and if one of us saw that the folder had

Gary Williams:

an unread one going, we check it.

Gary Williams:

If there was an error, someone would get a ticket, it would get sorted out.

Gary Williams:

If the error went on for several days, there would be a conversation.

Gary Williams:

We will get these things fixed.

Gary Williams:

where's the problem.

Gary Williams:

We know our backups are good.

W. Curtis Preston:

So you, you were a.

W. Curtis Preston:

You were, I don't know.

W. Curtis Preston:

I don't know what to call it, but so instead of being a proponent

W. Curtis Preston:

of testing the backups, you were a proponent of oh, everything's fine.

Gary Williams:

Unfortunately at that time.

Gary Williams:

Yes, I was, sitting there and quite fat, dumb and happy going.

Gary Williams:

We've got the emails, the backups work.

Gary Williams:

We know they work.

Gary Williams:

Where's the problem.

Gary Williams:

I didn't see any issue here at all.

W. Curtis Preston:

For what it's worth.

W. Curtis Preston:

I had a similar point in my career and there was a time.

W. Curtis Preston:

I remember when I was at a company, I won't give the actual name of the

W. Curtis Preston:

company, but I will just say it's a very, well-known electronics manufacturer.

W. Curtis Preston:

and I had helped him set up their backup system and I wasn't

W. Curtis Preston:

there just to do the backups.

W. Curtis Preston:

I was there to do sysadmin stuff.

W. Curtis Preston:

And they were a mess.

W. Curtis Preston:

th this was a, it was a small department in this bigger, electronics company.

W. Curtis Preston:

It was an interesting department.

W. Curtis Preston:

They called it.

W. Curtis Preston:

Simulation modeling and research.

W. Curtis Preston:

So it was a revolutionary idea at the time of the idea of modeling,

W. Curtis Preston:

like in a computer, what would happen if you drop this device?

W. Curtis Preston:

And so they were doing this in a computer.

W. Curtis Preston:

It was a fascinating new at the time, new field of science.

W. Curtis Preston:

So I was there to fix a whole bunch of problems.

W. Curtis Preston:

One of which, for example, was that every workstation, it was all

W. Curtis Preston:

Unix workstations, and every person had root on their workstation.

W. Curtis Preston:

And that was the first thing I was going to fix.

W. Curtis Preston:

But I also set up their backup system and, the backups worked.

W. Curtis Preston:

So I assumed the restores would work and it was some time.

W. Curtis Preston:

I was there long enough that I went, I actually, at some

W. Curtis Preston:

point needed to do a restore.

W. Curtis Preston:

And I found out that those tape drives were really good at writing data.

W. Curtis Preston:

And they were completely incapable of reading data.

W. Curtis Preston:

Again, I don't want.

W. Curtis Preston:

I'm sure there was something wrong with these drives, but

W. Curtis Preston:

they were IBM 3590 drives.

W. Curtis Preston:

Normally IBM drives are top of the line or whatever, but there was something wrong

W. Curtis Preston:

with these drives that I was completely.

W. Curtis Preston:

So I guess what I'm saying is you're not alone.

W. Curtis Preston:

even me who, I've spent my career in this, although honestly, that's

W. Curtis Preston:

that event is on the list of things that I think back to when.

Prasanna Malaiyandi:

Yeah.

W. Curtis Preston:

when I try to get other people to do it.

Gary Williams:

Absolutely same with me.

Gary Williams:

the backups that we were taking, as I say, we were only a small

Gary Williams:

team and we had all the emails.

Gary Williams:

We had everything in place.

Gary Williams:

We had the two tape libraries doing the backups.

Gary Williams:

So we thought we were in a really good position because we had

Gary Williams:

everything working the way it should.

Gary Williams:

We even had documentation for how all this stuff was put together.

Gary Williams:

we actually had to consultancy come in and help us put all this stuff together.

Gary Williams:

Because at the time I was working for a financial institution, we

Gary Williams:

had to have certain boxes ticked, and we had those boxes ticked

Gary Williams:

because we have the documentation.

Gary Williams:

We had the backups, they were going off site.

Gary Williams:

They were going off site.

Gary Williams:

They were being looked after for us.

Gary Williams:

We even recalled tapes to make sure we could do the process

Gary Williams:

and no tapes were getting lost.

Gary Williams:

So we did that level of testing, but what we never actually tested was

Gary Williams:

actually restoring the data itself.

Gary Williams:

And it was a bit of an epiphany when we actually had someone come

Gary Williams:

into the team who a brand new to IT.

Gary Williams:

Had never worked in IT before.

Gary Williams:

Always wanted to work in IT.

Gary Williams:

Was actually employed in the business in a completely different role.

Gary Williams:

And then he actually said to me, one day, I'd like to move into IT.

Gary Williams:

I thought he was joking.

Gary Williams:

It turns out no, he was actually serious.

Gary Williams:

He was an ex-finance person wanting to move into IT.

Gary Williams:

So he applied internally, he got the job and he started with us and he started

Gary Williams:

looking through some old tickets and he was saying things like, why did you

Gary Williams:

do such and such a change this way?

Gary Williams:

So there's a whole education thing going on there.

Gary Williams:

And that's when he asked the question.

Gary Williams:

When did you test the backups?

Gary Williams:

What do you mean test them.

Gary Williams:

We've got the emails.

Gary Williams:

Look, here, you can see the service.

Gary Williams:

Here's the tape drives.

Gary Williams:

Here's the tapes.

Gary Williams:

We record the tape.

Gary Williams:

Yeah, sure.

Gary Williams:

But when did you restore something?

Gary Williams:

And I, something I won't actually forget because there was this look,

Gary Williams:

there's only four of us in the IT team.

Gary Williams:

We were a really small team for a company of about 300 and there's

Gary Williams:

this look going around the whole office and everyone's going well, we

Gary Williams:

haven't actually tested them have we?

Prasanna Malaiyandi:

It's like a light bulb goes off and it's yeah.

Prasanna Malaiyandi:

It's Ooh.

Gary Williams:

We like, hang on.

Gary Williams:

yeah, we should probably test one of they shouldn't we.

Gary Williams:

Okay.

Gary Williams:

what should we test and looking back on it, it was a really insane moment

Gary Williams:

just to think that we've had easy.

Gary Williams:

I think what actually had the emails was coming in for over a year.

Gary Williams:

And yes, we'd had the odd backup failure where something a time there, or there

Gary Williams:

was a fault with one of the tape drives.

Gary Williams:

These tape drives were quite old.

Gary Williams:

So they actually had physical SCSI cables that would sometimes play up.

Gary Williams:

So you had to make sure the SCSI cables were all firmly

Gary Williams:

attached, the terminator was in.

Gary Williams:

The good old days.

Gary Williams:

And.

Prasanna Malaiyandi:

never had to deal with restores?

W. Curtis Preston:

Yeah.

W. Curtis Preston:

And course you had both active and passive, terminators as well.

Gary Williams:

Yeah, exactly.

Gary Williams:

we did actually have to do some restores, but we had, a storage array and the

Gary Williams:

storage provider let us do snapshots.

Gary Williams:

So 99% of the restores that we needed.

Gary Williams:

Just copy and paste from the snapshot.

Gary Williams:

Not a problem.

Gary Williams:

You deleted that file not a problem.

Gary Williams:

There it is.

Gary Williams:

If something was deleted from a desktop, the common response was, we

Gary Williams:

don't back things up on your desktop.

Gary Williams:

Sorry.

Gary Williams:

That's tough.

Gary Williams:

If you want it backed up, put it onto the server, put it into your

Gary Williams:

home drive or something like that.

Gary Williams:

It will get backed up.

Gary Williams:

So that was the general understood consensus because it was a small company.

Gary Williams:

Most of the time, this wasn't an issue, and as I say, people deleted a file.

Gary Williams:

I remember one time we had an Excel file.

Gary Williams:

That was a real pain because of all these financial macros.

Gary Williams:

And we restored that from a snapshot.

Gary Williams:

And it was still corrupt and we had to go back a week or so, we

Gary Williams:

managed to get the file back and it was working and we actually said

Gary Williams:

it and I remember it quite well.

Gary Williams:

We said it within the team.

Gary Williams:

that was lucky.

Gary Williams:

We might have actually asked to get the tapes on site and do a restore from

Gary Williams:

the tapes, but the snapshot worked.

Gary Williams:

Everything's fine, you know yeah.

Prasanna Malaiyandi:

Now you've decided, okay, we haven't tested.

Prasanna Malaiyandi:

Maybe we should actually try doing the test.

Prasanna Malaiyandi:

How did you decide what to test?

Gary Williams:

funny enough, it was a new guy.

Gary Williams:

the discussion was actually, okay, you're the person sitting

Gary Williams:

there looking through the tickets.

Gary Williams:

You're looking through the documentation.

Gary Williams:

You're new to it all.

Gary Williams:

You want us to prove to you that the restore process works.

Gary Williams:

We know it does.

Gary Williams:

Pick something.

Gary Williams:

And then he sat there and he went, How about the exchange server?

Prasanna Malaiyandi:

Speaker:

Swinging for the fences!

Gary Williams:

Fine.

Gary Williams:

So we thought, okay, fine.

Gary Williams:

we'll get the tapes back on site.

Gary Williams:

We'll do the restore.

Gary Williams:

We'll prove that the backups work and we can go back to what we're normally doing.

Gary Williams:

all the project work, that kind of thing.

Gary Williams:

We could spend a day on this.

Gary Williams:

It will be good for us.

Gary Williams:

Not a problem.

Gary Williams:

We even went to the documentation and got the documentation out and said,

Gary Williams:

look, we've got the documentation.

Gary Williams:

The tapes are coming in.

Gary Williams:

This is going to be easy.

Gary Williams:

And it wasn't.

Prasanna Malaiyandi:

Of course not.

Prasanna Malaiyandi:

So when you decided to do the restore.

Prasanna Malaiyandi:

Did you bring down your production or were you like, I'm going to

Prasanna Malaiyandi:

restore this into a safe spot and

Gary Williams:

Yeah, we couldn't bring down production because the nature

Gary Williams:

of the business was that we needed to keep the server up and running.

Gary Williams:

We actually had a spare server and I think we're maybe had two spare servers.

Gary Williams:

VMs were just starting to come on the scene and we actually

Gary Williams:

had a spare server racked.

Gary Williams:

And the idea was that if we had a server failure, we could take the

Gary Williams:

physical discs out of one server.

Gary Williams:

Put it into another server power it on, be back running.

Gary Williams:

this is also before the days of re replicas.

Gary Williams:

They were, again, just coming out on a lot of software was super expensive and

W. Curtis Preston:

You're giving me flashbacks, Gary.

Gary Williams:

the good old days.

W. Curtis Preston:

Yeah.

Gary Williams:

We had this physical server and it had plenty

Gary Williams:

of disc space to handle this.

Gary Williams:

So we said, okay, Let's we've not actually even powered this server on.

Gary Williams:

I don't even think, I think maybe it was powered on when

Gary Williams:

we bought it and that was it.

Gary Williams:

So we said we should test that server out anyway.

Gary Williams:

Yeah.

Gary Williams:

Let's power it on.

Gary Williams:

Let's get the data restored to that server and bring exchange up.

Gary Williams:

We can bring it up in an isolated network.

Gary Williams:

Do some very basic tests on it, because it was a small team.

Gary Williams:

We had access to the networking guys.

Gary Williams:

I'll say networking guys.

Gary Williams:

We did a little bit of networking age and there was one guy who did a lot

Gary Williams:

of the really key networking, tasks.

Gary Williams:

So none of that was a problem.

Gary Williams:

We didn't have to wait months for tickets or to get done or anything like that.

Gary Williams:

So we set up this isolated network, we got the tapes on site and we

Gary Williams:

started doing the restore and that's when it all went horribly wrong.

Prasanna Malaiyandi:

So who was doing the restore?

Gary Williams:

I, if I recall, it was actually our help desk guy.

Gary Williams:

We S we said to him, look, you came up with this.

W. Curtis Preston:

You put a lot on this guy.

W. Curtis Preston:

It was his idea.

W. Curtis Preston:

And you're like, what, if you think testing backups is so

W. Curtis Preston:

important, why don't you do it?

Gary Williams:

Pretty much .We did put it on him.

Gary Williams:

cause it was his idea.

Gary Williams:

And we said, look, this is a really good exercise for you to do again.

Gary Williams:

Unfortunately, I'm going to put my hands up to this.

Gary Williams:

It's a bad thing to have done, but we said, I'm a senior IT person.

Gary Williams:

I know the backups are good.

Gary Williams:

here you go.

Gary Williams:

Here's the tapes.

Gary Williams:

Here's the documentation.

Gary Williams:

See you later and off he goes and he comes back.

Gary Williams:

I think it was about two, three hours later, something like that.

Gary Williams:

And he went, I can't get this working.

W. Curtis Preston:

Yeah.

Gary Williams:

What do you mean you can't get it working.

Gary Williams:

What's the problem.

Gary Williams:

And I don't actually recall what the problems, all the problems were, but

Gary Williams:

I know that the server itself didn't have enough disc space, even though

Gary Williams:

it was supposed to have the disc space, because the documentation said,

Gary Williams:

you need partition sizes like this.

Gary Williams:

And it actually changed since then.

Gary Williams:

And we didn't realize, and that was really the start of a lot of problems.

W. Curtis Preston:

Yeah.

W. Curtis Preston:

first off I will say that even though.

W. Curtis Preston:

the way you got there.

W. Curtis Preston:

I like the way you did it.

W. Curtis Preston:

what to say, even though the way you got there was wrong, the fact that

W. Curtis Preston:

you, the fact that you had this person.

W. Curtis Preston:

do it, who wasn't the person, that made the documentation.

W. Curtis Preston:

That's actually something I push pretty heavily.

W. Curtis Preston:

And it's an idea that came from back in my days when I was at a bank

W. Curtis Preston:

and we very much did test restores.

W. Curtis Preston:

first off we didn't have snapshots.

W. Curtis Preston:

We didn't have any of that stuff.

W. Curtis Preston:

And we had 10,000 employees and any one of them was allowed to

W. Curtis Preston:

call into the help desk and ask for a restore on any given day.

W. Curtis Preston:

And, so we would get 10 to 15 restores a day.

W. Curtis Preston:

So we tested pretty regular, but the thing that we buy in that degree, but

W. Curtis Preston:

the thing that we had to test in the way that you did were these large

W. Curtis Preston:

server restores, we did a DR test and it was an absolute imperative

W. Curtis Preston:

from the powers that be was that.

W. Curtis Preston:

Curtis wrote the documentation.

W. Curtis Preston:

Curtis cannot be the person actually doing the test.

W. Curtis Preston:

Curtis needs to be standing back there, listening closely to the problems that

W. Curtis Preston:

are happening, but, w which, which was actually kind of nice, although

W. Curtis Preston:

it's nerve wracking to be the person who wrote the documentation and then

W. Curtis Preston:

sitting there watching someone, you think you've answered all the questions,

W. Curtis Preston:

but it's not like in this case, you.

W. Curtis Preston:

you had the classic example of the documentation might've been

W. Curtis Preston:

correct, but it was out of date.

Gary Williams:

It was correct at the time, the irony is very similar with you.

Gary Williams:

I didn't actually write the documentation.

Gary Williams:

It was written by the contractors and consultants that came on.

Gary Williams:

Actually signed off on the documentation saying, yes, all

Gary Williams:

the version numbers are correct.

Gary Williams:

And I think I'd done a couple of updates.

Gary Williams:

And then we'd had other changes and the other people

Gary Williams:

had forgoten or I'd forgotten.

Gary Williams:

Probably I'd forgotten to update the documentation because we

Gary Williams:

were busy only a small team.

Gary Williams:

And so things very slowly on, not just that document, but on every other

Gary Williams:

document that we had about the environment become out of date and it was this

Gary Williams:

snowball of errors that had crept in.

Gary Williams:

And the thing that we realized is actually having no documentation

Gary Williams:

would have been better because the documentation was lying to us.

Gary Williams:

this poor guy is sitting there going, I followed steps three, four,

Gary Williams:

and five, but I can't do step six because step five doesn't work.

Gary Williams:

What do you mean?

Gary Williams:

It doesn't work.

Gary Williams:

And that's when we found that there was a service pack that

Gary Williams:

was missing from exchange.

Gary Williams:

So it couldn't go any further and it just kept on building and building like this.

Prasanna Malaiyandi:

That is an interesting problem.

Prasanna Malaiyandi:

How do you keep your documentation up to date as you're making these

Prasanna Malaiyandi:

changes and making sure everyone across the environment knows like where the

Prasanna Malaiyandi:

documentation is and all the rest of that.

Gary Williams:

today, we use a Wiki solution for all of our documentation.

Gary Williams:

The idea behind that of course, is the Wiki is so easy to edit.

Gary Williams:

But you still don't or sometimes you still don't.

Gary Williams:

You make a note, I'll do that tomorrow or next week.

Gary Williams:

So there is still the exact same risk.

Gary Williams:

And even in my current place, we've seen this with certain, we do testing as well.

Gary Williams:

We do a lot more testing now than, anywhere I've ever worked before.

Gary Williams:

And even with a lot of the modern systems with Amazon.

Gary Williams:

Backups to S3 and all this kind of stuff.

Gary Williams:

We still test to make sure that everything's correct,

Gary Williams:

that we know what we're doing.

Gary Williams:

That those Wiki pages are fully up to date.

Gary Williams:

we did some AD restore testing not so long ago and we found, not major errors,

Gary Williams:

but there was a couple of little issues there with the restore process, which

Gary Williams:

just needed a few corrections in the documentation, just, as like a permissions

Gary Williams:

era type of thing where we couldn't actually get access to the bucket.

Gary Williams:

So we had to make some changes there.

Gary Williams:

So even with all the modern backup software.

Gary Williams:

It's still so important.

W. Curtis Preston:

I talked about those DR tests that we did back in the day and.

W. Curtis Preston:

The, and the fact that we always had someone who wasn't me doing the

W. Curtis Preston:

tests, and frequent listeners to the podcast will have heard this before.

W. Curtis Preston:

But if we define a successful restore, as we got from A to Z without having to ask

W. Curtis Preston:

Curtis, what does this line mean, not a single one of the restores was successful.

W. Curtis Preston:

so if Curtis ever got, blown up and, whatever, the chances of a restore

W. Curtis Preston:

going completely without a hitch was, zero, which is why you talked

W. Curtis Preston:

about updating, there's always little things that you have to update.

W. Curtis Preston:

I would suggest that original documentation.

W. Curtis Preston:

and again, take this for what it's worth to anybody who's listening.

W. Curtis Preston:

the first mistake was writing the documentation in a way that

W. Curtis Preston:

it can easily get outdated.

W. Curtis Preston:

our exchange server is 75.

W. Curtis Preston:

Terra...

W. Curtis Preston:

r ight.

W. Curtis Preston:

that's a problem.

W. Curtis Preston:

So if you're going to hand that to a restore documentation, what it should say

W. Curtis Preston:

is before beginning the restore, go look at the size of the backups, And figure

W. Curtis Preston:

out how big the current exchange server is, and then size the volume accordingly.

W. Curtis Preston:

yeah, that, that line wouldn't have gone out of date as quickly.

W. Curtis Preston:

it is a real challenge by the way.

W. Curtis Preston:

this idea of what it's like to update documentation, by the way, back in

W. Curtis Preston:

the day we were using Wordperfect.

W. Curtis Preston:

Yeah.

W. Curtis Preston:

And I remember the official company standard was WordPerfect,

W. Curtis Preston:

because we could use it on, we had Unix versions of WordPerfect.

W. Curtis Preston:

By the way, curses spaced WordPerfect.

W. Curtis Preston:

Not this fancy Windows.

W. Curtis Preston:

what you'd see is what you get editing stuff.

W. Curtis Preston:

This was text on a screen.

W. Curtis Preston:

and I remember getting in a fight over.

W. Curtis Preston:

There was this one guy that was new and he wanted to use Word

W. Curtis Preston:

because nobody used WordPerfect.

W. Curtis Preston:

And we were like, we don't care.

W. Curtis Preston:

We use WordPerfect here for our documentation.

W. Curtis Preston:

And if you want your documentation to fit into our documentation,

W. Curtis Preston:

you will use Wordperfect.

W. Curtis Preston:

And you will like it.

Gary Williams:

I remember our first days of moving across

Gary Williams:

the world where you had the.

Gary Williams:

Word had the ability to mimic WordPerfect key presses.

Gary Williams:

So you could transition easily.

Gary Williams:

Good old days.

W. Curtis Preston:

Good old days, but I think what you're doing now with the Wiki,

W. Curtis Preston:

I think that's a much better approach.

Gary Williams:

It is.

Gary Williams:

There's permissions list behind it, obviously, so that not everyone

Gary Williams:

can get access to it, but it's the right people can get access.

Gary Williams:

but what it means is everyone in the team can get access.

Gary Williams:

They can all update.

Gary Williams:

It.

Gary Williams:

There's a history as well.

Gary Williams:

So the other thing that we didn't have is the backup of the documentation

Gary Williams:

was on the server we were backing up.

Prasanna Malaiyandi:

Oh,

Gary Williams:

Exactly.

Gary Williams:

So we, all we had was that documentation and looking back on it, we made

Gary Williams:

quite a few mistakes like this.

Gary Williams:

We had the, let's say we had the documentation on the file server.

Gary Williams:

So if the file server was lost.

Gary Williams:

How did you get your documentation?

Gary Williams:

And it was, again, something that the helpdesk guy pointed out to us.

Gary Williams:

How did you get your documentation?

Gary Williams:

That's fine, actually.

Gary Williams:

How would we.

Prasanna Malaiyandi:

Sometimes it's an outside perspective or

Prasanna Malaiyandi:

someone's Hey, how are you actually going to get this stuff done?

Gary Williams:

Something I think it's really important to know is at

Gary Williams:

the time I was a senior IT person.

Gary Williams:

There's a colleague of mine who was senior and we had a network guy.

Gary Williams:

All of us, were reasonably senior.

Gary Williams:

This guy was a junior.

Gary Williams:

He'd been working in finance for three or four years beforehand.

Gary Williams:

And then he'd just moved into IT.

Gary Williams:

And he had such a fresh perspective on everything that it really opened our eyes.

Gary Williams:

And that was the day I learned that it doesn't matter if you got 50

Gary Williams:

years IT experience or five minutes.

Gary Williams:

There's always something you can learn from someone.

Gary Williams:

And sometimes the most valuable thing you can learn is from someone

Gary Williams:

who is very new to the team, fresh eyes, fresh perspective.

Gary Williams:

It's invaluable.

Prasanna Malaiyandi:

100% agree.

W. Curtis Preston:

There, there is a perspective that you can only

W. Curtis Preston:

gain by being completely ignorant.

W. Curtis Preston:

He could have been not junior to IT in this case.

W. Curtis Preston:

He was, but even if he's a senior IT person, but he's joining your organization

W. Curtis Preston:

for the first time, another way, you look at this person when they ask for things

W. Curtis Preston:

of like, when they ask stupid questions, so how often do we, test our backups here?

W. Curtis Preston:

And you're like, we don't do that.

Gary Williams:

with my current place, any new person we get into our IT team, we

Gary Williams:

literally do that sort of thing with them.

Gary Williams:

Now where we say, have a look through the tickets.

Gary Williams:

You've got any questions.

Gary Williams:

Ask, have a look through the Wiki again.

Gary Williams:

You've got any questions ask because.

Gary Williams:

There's so many things in there.

Gary Williams:

There's like the whole corporate culture and there's corporate acronyms.

Gary Williams:

And if they don't know what they are, we've just found a problem

Gary Williams:

because if there's one acronym we have this, I, my brain's gone.

Gary Williams:

Sorry.

Gary Williams:

there's one acronym that we have, that's very similar to an IT acronym.

Gary Williams:

I can't remember what it is off the top of my head.

Gary Williams:

Yeah.

Gary Williams:

But when you look at it, you think the, IT term because you're an IT person,

Gary Williams:

but it actually means the corporate.

Gary Williams:

so there's that kind of thing.

Gary Williams:

it's always important to spell out these acronyms at the start of any

Gary Williams:

documentation so that everyone knows this is what you are referring to.

Prasanna Malaiyandi:

Especially

W. Curtis Preston:

it's Prasanna's job on, on the podcast.

W. Curtis Preston:

If anybody ever brings up, an acronym that, they don't spell out,

W. Curtis Preston:

Prasanna's, always making them spell it

Prasanna Malaiyandi:

out.

Prasanna Malaiyandi:

Yep.

Prasanna Malaiyandi:

I'm like, what does that really mean?

Prasanna Malaiyandi:

Please tell me.

Gary Williams:

And this is the thing.

Gary Williams:

You can walk into a meeting with all the IT acronyms and every IT

Gary Williams:

person sitting there will probably think it's something different.

Gary Williams:

I think DC is a good one because DC's direct current data center.

Gary Williams:

Things like that.

Gary Williams:

And this is the sort of thing that we've experienced several times, a few different

Gary Williams:

companies I've worked for, and it's always valuable to get that new person's insight.

Gary Williams:

Because they don't know the corporate terminology, they don't

Gary Williams:

know the corporate acronyms.

Gary Williams:

So it's worth getting them on board and going through all this stuff because

Gary Williams:

they've got this fresh insight before they learn that stuff and they can spot these

Gary Williams:

problems before they become a problem.

W. Curtis Preston:

I just realized I haven't thrown out our

W. Curtis Preston:

usual disclaimer, Prasanna and I work for different companies.

W. Curtis Preston:

I work for Druva and he worked for Zoom.

W. Curtis Preston:

And this is not a podcast of either company.

W. Curtis Preston:

And the opinions that you hear are ours.

W. Curtis Preston:

Please rate this podcast at ratethispodcast.com/restore.

W. Curtis Preston:

And if you, are like our guest here today, Gary who, just you're an IT person

W. Curtis Preston:

out there, and you want to talk about your favorite subject to, or if you know

W. Curtis Preston:

what, maybe if you don't understand why

Prasanna Malaiyandi:

Come challenge, Mr.

Prasanna Malaiyandi:

Backup.

W. Curtis Preston:

Some crazy person would actually like them then, come on

W. Curtis Preston:

here related topics, cybersecurity, data privacy, a number of related topics.

W. Curtis Preston:

We'd love to have you on as a guest and, and reach out

W. Curtis Preston:

to me at wcurtispreston@gmail or at @wcpreston on Twitter.

W. Curtis Preston:

And we'll get you on here.

W. Curtis Preston:

So, um, how did it turn.

W. Curtis Preston:

With your, with your restore.

Gary Williams:

So eventually we got there, we actually got the

Gary Williams:

exchange server fully restored with correctly, the documentation.

Gary Williams:

and I think it took three or four days, something like that.

Gary Williams:

And the thing is, if you want to do a team building exercise, forget all the

Gary Williams:

assault courses and other things they have, you do get the team together and

Gary Williams:

do a restore some of the best seriously.

W. Curtis Preston:

It's a bit like, the trust exercises where you lean

W. Curtis Preston:

backwards and catches you it's like that.

Gary Williams:

I've also never seen so many whiteboards being used to

Gary Williams:

describe issues and draw diagrams of how things hung together.

Gary Williams:

And it was actually really good.

Gary Williams:

And I will admit we ended up putting some projects, not exactly on pause,

Gary Williams:

but we put them to one side as all of us started getting involved in

Gary Williams:

this restore, because we realized we actually had a very serious problem.

Gary Williams:

I'll be honest.

Gary Williams:

We gave the help desk guy, this junior guy to IT the documentation.

Gary Williams:

And we did expect him to trip over a few things.

Gary Williams:

He's a new person, some of the terminology is new, fine, not a problem.

Gary Williams:

We know we're there to help.

Gary Williams:

What we didn't expect was us to trip over the same issues.

Gary Williams:

We honestly thought that, like you were saying earlier, Curtis, that he

Gary Williams:

was going to ask us some questions.

Gary Williams:

We could do some updates to the documentation, do it again,

Gary Williams:

and everything would be fine.

Gary Williams:

But we didn't expect to get stumped by our own documentation.

Gary Williams:

And unfortunately we actually did, we're sitting there going

Gary Williams:

through the documentation going well, hang on a minute.

Gary Williams:

we know the, the password is in this password safe and that password

Gary Williams:

should work, but something had changed or I think at one point would

Gary Williams:

actually, changed the security model.

Gary Williams:

So it was requiring stronger passwords.

Gary Williams:

So you couldn't actually use a password that was on the backup.

Gary Williams:

You had to go and reset an account.

Gary Williams:

And it was lots of.

Gary Williams:

It was nothing seriously, wrong with a backup as such.

Gary Williams:

And there's nothing seriously wrong with the documentation,

Gary Williams:

but it was lots of little things that just piled up and piled up.

Gary Williams:

And every time we took a couple of steps forward, we thought, that's it.

Gary Williams:

We've got this solved, we'll get this restored.

Gary Williams:

And then we got it all up and running and got the server running and

Gary Williams:

exchange server service wouldn't start.

Gary Williams:

couldn't figure out why.

Gary Williams:

I think that one took us a day to go through and we ended up having

Gary Williams:

to run some additional commands.

Gary Williams:

And finally, we got there, we got it all up and running.

Gary Williams:

And I still remember, I think it was actually like a Friday or something

Gary Williams:

we're sitting there in the office and went, yeah, that was a really good

Gary Williams:

question know, can we restore the data?

Gary Williams:

Thank you for asking it.

Gary Williams:

we had a bit of a celebration over that one.

W. Curtis Preston:

I would say that, I like what you were saying

W. Curtis Preston:

about, it sounded like there was a lot of collaboration.

W. Curtis Preston:

It sounds like there's a lot of whiteboards going on

W. Curtis Preston:

and you were learning a lot.

W. Curtis Preston:

I would argue that the reason that was the case is that you

W. Curtis Preston:

weren't doing it under duress.

W. Curtis Preston:

You were doing this as a test.

W. Curtis Preston:

if your exchange had been down for three or four days, that would have

W. Curtis Preston:

been a very different experience.

Gary Williams:

Completely.

Gary Williams:

It's something that we actually discussed, that Friday afternoon, we've got the

Gary Williams:

exchange server up and running and the conversation was what happens if

Gary Williams:

this happens for real, because sure.

Gary Williams:

We got the backup restored.

Gary Williams:

We know that the backup is good.

Gary Williams:

Do you mean I was told to stay and it was good, but the restore process wasn't good.

Gary Williams:

And I think we focused way too much on the backup itself and

Gary Williams:

not the restore at that point.

Gary Williams:

I said we had that conversation and it was a matter of what would happen.

Gary Williams:

And we knew that we were a small company.

Gary Williams:

We knew we would have the CEO down in the office, screaming at us.

Gary Williams:

I need this back.

Gary Williams:

We can't conduct business and I'll be honest that day.

Gary Williams:

We got a healthy lot of respect, both for the backups, for documentation

Gary Williams:

and the accuracy of documentation and for the server itself.

Gary Williams:

Because we knew that the company at that point, the company relied on email so

Gary Williams:

much that if that server did disappear, and we took that long to get back up and

Gary Williams:

running the loss, the financial loss to the company and the reputational loss

Gary Williams:

to the company would have been huge.

Gary Williams:

And that also actually helped form some push forward for additional resilience

Gary Williams:

in , like, the servers and moving more towards things like virtual machines,

Gary Williams:

so that we had the ability to clone and do other bits and pieces, because

Gary Williams:

we could use that as an experience.

Gary Williams:

It's look, this is how long potentially worst case scenario it will take.

Gary Williams:

It shouldn't because we're learning and we need to do this a lot more often.

Gary Williams:

We need to allocate time to do this.

Gary Williams:

And the beauty was again, being such a small company.

Gary Williams:

We actually had the ear of a couple of directors, so you could

Gary Williams:

put this case forward and they were really receptive to it.

W. Curtis Preston:

I want to tack on something you said there.

W. Curtis Preston:

the fact that you and I have been in that timeframe.

W. Curtis Preston:

young kids today, they don't understand what it was like back then, when you had

W. Curtis Preston:

no resiliency, you had no redundancy.

W. Curtis Preston:

You had nothing.

W. Curtis Preston:

So we had a server that a server had a disk drive.

W. Curtis Preston:

We didn't have mirroring.

W. Curtis Preston:

We didn't have

Gary Williams:

RAID.

Gary Williams:

Although we had very, we had rightful life.

Gary Williams:

We got, we were really market.

W. Curtis Preston:

we didn't, when I was back in the day, we

W. Curtis Preston:

literally were installing data directly on individual disk drives.

W. Curtis Preston:

I think we might've had redundant power supplies on the servers that

W. Curtis Preston:

we were using, and that was it.

W. Curtis Preston:

And so the loss of any one of those components could take the server.

W. Curtis Preston:

Right.

W. Curtis Preston:

And, and now nowadays we move forward to the days of virtualization and

W. Curtis Preston:

that you can just, if there's a little problem with this server, you just

W. Curtis Preston:

move your VM over to another server.

W. Curtis Preston:

In fact, you can V motion at and storage V motion, and you can

W. Curtis Preston:

move it while it's running, which continues to boggle my brain.

Gary Williams:

likewise.

W. Curtis Preston:

And also the devices that would, that so

W. Curtis Preston:

many of us have grown used to.

W. Curtis Preston:

I, at home I pretty much live a solid state life.

W. Curtis Preston:

My TiVo has a solid state hard drive.

W. Curtis Preston:

and so those are so much more reliable than the moving part

W. Curtis Preston:

drives that you and I grew up on.

W. Curtis Preston:

and I think as a result, they don't have

W. Curtis Preston:

the respect that you need to do to test backups the way you should.

W. Curtis Preston:

I don't know.

W. Curtis Preston:

Just a quick editor's note.

W. Curtis Preston:

In the next section, Gary is going to mention something called iLO and iDRAC.

W. Curtis Preston:

And he, we forgot to have him define it.

W. Curtis Preston:

So I'm doing that now.

W. Curtis Preston:

They are systems from Dell and HP, the integrated Dell remote access

W. Curtis Preston:

controller and HP integrated lights out.

W. Curtis Preston:

They're both systems that help increase the uptime of the server by notifying

W. Curtis Preston:

you of potential failures or issues.

W. Curtis Preston:

Back to your podcast.

Gary Williams:

One of the things that we still do today, and this is

Gary Williams:

probably me being paranoid coming from that environment, we didn't

Gary Williams:

get alerts on a service if a disk failed, because it didn't really know.

Gary Williams:

The ILOs and iDRACs were way too expensive for us to have at that point.

Gary Williams:

So daily server room checks go around.

Gary Williams:

Is there any flashing lights that shouldn't be flashing?

Gary Williams:

And we still do that in our data centers today.

Gary Williams:

And we still do that with some of our machines.

Gary Williams:

We've actually got this philosophy in place now where if a machine is up for

Gary Williams:

more than 30 days, it needs to be rebooted because we don't know if it's reboot safe.

Gary Williams:

So we're starting to put uptime alarms in.

Gary Williams:

Certainly on Windows.

Gary Williams:

Linux is a bit different, but with Windows, when it hits a 30-day point.

Gary Williams:

If we get an uptime alarm, it means that there's possibly

Gary Williams:

a patching issue with that.

Gary Williams:

We should get an alarm from the patching system as well.

Gary Williams:

So we go off and we check.

Gary Williams:

but the other thing we do something similar with Linux as well.

Gary Williams:

We're trying to get all the Linux servers rebooted because generally

Gary Williams:

with those, we can patch them hot, but we still want to get them rebooted.

Gary Williams:

Are they reboot safe?

Gary Williams:

Because if we do lose power or machine crashes, it's great having all that stuff

Gary Williams:

there, but if it doesn't reboot, we've got a problem and we may have a backup if

Gary Williams:

that backup is inherited that corruption or that problem we're in a bad place.

Gary Williams:

So we do try to make sure that we've got, these servers rebooted on a fairly regular

Prasanna Malaiyandi:

Speaker:

actually very interesting.

Prasanna Malaiyandi:

Speaker:

I never thought about that About the fact that you need to reboot the systems

Prasanna Malaiyandi:

Speaker:

and just make sure is a hardware and dos and everything else could be.

Gary Williams:

Absolutely.

Gary Williams:

The other thing that we've done is we've actually turned up time

Gary Williams:

on his head now in the old days.

Gary Williams:

Now these uptime figures of two years, three years, we'll put on the

Gary Williams:

internet and it's look at my up time.

Gary Williams:

Now it's the other way around.

Gary Williams:

It's like, yeah, there's an uptime of 45 days.

Gary Williams:

Oh, look at my uptime.

Gary Williams:

That's bad.

Gary Williams:

We need to get this rebooted.

Gary Williams:

And check it is reboot safe.

Gary Williams:

Trying to find reboot windows sometimes is a bit difficult, even with all the

Gary Williams:

resilience . Just take systems down.

Gary Williams:

but we do have some sort of bargaining going on with various teams where

Gary Williams:

we do try and reboot the systems at least once a month, just to make

Gary Williams:

sure that they are reboot safe.

W. Curtis Preston:

So help me understand that phrase.

W. Curtis Preston:

W what do you mean when you say reboot safe?

Gary Williams:

So reboot safe is simply that if it's potentially a change can be

Gary Williams:

made to a machine, that means a machine when it reboots is going to crash, or

Gary Williams:

there's going to be a problem where it can't complete the boot corrupted

Gary Williams:

boot loader or something like that.

Gary Williams:

We've seen issues in the past where.

Gary Williams:

Microsoft update has corrupted the bootloader.

Gary Williams:

So when you go to reboot, it doesn't restart properly.

Gary Williams:

So we've actually got the term reboot safe, which just means I know

Gary Williams:

if I have to reboot that server, I don't have to worry about it.

Gary Williams:

It will come up.

Gary Williams:

You're printing system will start all the services that need to start will start,

Gary Williams:

because we've had issues in the past where certain key services don't start.

Gary Williams:

So we get a ticket.

Gary Williams:

Can you please reboot this machine?

Gary Williams:

Sure.

Gary Williams:

Reboot it, you walk off.

Gary Williams:

You think it's done, but the services don't start.

Gary Williams:

Now, the alerting will alert on that.

Gary Williams:

But in the meantime, you potentially still down for a bit longer than you need to be.

Gary Williams:

So we do these tests where we just want something sure all the services

Gary Williams:

that need to start actually start.

Gary Williams:

And it comes up completely clean and working exactly how it should.

W. Curtis Preston:

you're giving me, Yeah.

W. Curtis Preston:

And by the way, I agree with you with this idea of, the occasional

W. Curtis Preston:

reboots and I agree that it's, that it's a practice that has gone by

W. Curtis Preston:

the wayside by a lot of people.

W. Curtis Preston:

And I remember, I can remember the first time I left my,

W. Curtis Preston:

this is before I got the Mr.

W. Curtis Preston:

Backup.

W. Curtis Preston:

moniker and I got a different moniker and I'll explain it in a minute.

W. Curtis Preston:

I was at a large oil and gas company and no one had administered the data

W. Curtis Preston:

center, like a real sysadmin in years.

W. Curtis Preston:

And so I was going in there and I was doing crazy things like

W. Curtis Preston:

installing the latest patch set.

W. Curtis Preston:

And this was, these were a Solaris systems and, it required a reboot

W. Curtis Preston:

in order to, to install the patches.

W. Curtis Preston:

And what was happening was I was like, 0 for 10, in terms

W. Curtis Preston:

of I would install a patch.

W. Curtis Preston:

I would reboot the server and it wouldn't come back.

W. Curtis Preston:

And so I picked up the nickname crash, because that's what I was

W. Curtis Preston:

just, I was literally, it's like the cure is worse than the disease.

W. Curtis Preston:

So it's we need to do this, but I was doing, I was proactively

W. Curtis Preston:

doing damage to the environment.

W. Curtis Preston:

By doing the things I was doing, what I did get really good at though is restoring

W. Curtis Preston:

their environment because it kept,

W. Curtis Preston:

so what it turned out, the things that were really.

W. Curtis Preston:

I don't know uh in trouble were the disks themselves, because we actually

W. Curtis Preston:

powered down the servers for some of them.

W. Curtis Preston:

And that's when things really went awry because the disk

W. Curtis Preston:

drives had never been turned off.

W. Curtis Preston:

And then, yeah.

W. Curtis Preston:

And then, they wouldn't come back on.

W. Curtis Preston:

So I had to get all new disk drives and then, and then do the restore, but yeah.

Prasanna Malaiyandi:

Yeah.

Gary Williams:

Yeah, but even with the virtual machines, we still like to

Gary Williams:

reboot them and to make sure all the services that should come up do come up.

Gary Williams:

We've even in some cases taken that paranoia to the next level where we'll

Gary Williams:

do a reboot test before we install a patch or before we do something,

Gary Williams:

just to make sure that it's not, that patch that has caused a problem.

Gary Williams:

Now, we generally don't do that for the Microsoft patches, but we do

Gary Williams:

that for certain application patches.

Gary Williams:

And it's almost a sanity check.

Gary Williams:

Because that way, if there is a problem, we know that it is that patch

Gary Williams:

that has caused a problem and not something lurking from beforehand.

Prasanna Malaiyandi:

Going back to the article you wrote, Gary, one of the

Prasanna Malaiyandi:

things I liked in it was you talked about this spreadsheet, if you will, that

Prasanna Malaiyandi:

track sort of assets that were backed up and you had a methodology that you

Prasanna Malaiyandi:

called out in the article in terms of how long you would wait before something

Prasanna Malaiyandi:

had to be tested, Or how the longest something could go without being tested.

Prasanna Malaiyandi:

And there were certain things that were critical in your environment that sort

Prasanna Malaiyandi:

of had to be done more periodically.

Gary Williams:

Yeah.

Gary Williams:

So what we did is.

Gary Williams:

we had a spreadsheet, the list of all the backups anyway, and one of the

Gary Williams:

things we tried to do was make sure that there was no clashing backups.

Gary Williams:

So the exchange server would get backed up at say, 10:00 PM.

Gary Williams:

The file server get backed up at 11:00 PM.

Gary Williams:

That kind of thing, because otherwise we found there was a

Gary Williams:

lot of issues on the network and latency and all this kind of thing.

Gary Williams:

So we wanted to stagger the backups as much as possible.

Gary Williams:

But what we did was we actually added a column to that spreadsheet that said.

Gary Williams:

Restore last tested, documentation last updated, that kind of thing.

Gary Williams:

So that we new when the backups were tested and we knew when that

Gary Williams:

documentation was last updated.

Gary Williams:

And what we do is we actually have, there was a formula in

Gary Williams:

it that would color the cells.

Gary Williams:

And if it was all green, everything's fine.

Gary Williams:

We've done a recent test.

Gary Williams:

I think recent was like six months, 12 months, something like that.

Gary Williams:

and if anything was over outside of that window, it would go red.

Gary Williams:

So I think the exchange was every six months, the active

Gary Williams:

directory was once a year.

Gary Williams:

The file server was I think we would restore a folder or a file

Gary Williams:

every month, something like that.

Gary Williams:

and we did this quite a lot and we actually slowed down some

Gary Williams:

of the tapes going off site for things like the file server.

Gary Williams:

So we could do a backup a couple of days later, you do a restore test,

Gary Williams:

update the date in the documentation.

Gary Williams:

We know that's good.

Gary Williams:

Send the tape off-site and that's actually funny enough.

Gary Williams:

That was a financial reason as well because of the cost

Gary Williams:

of sending the tapes offsite.

Gary Williams:

but yeah, we started to do that and we started to get quite good

Gary Williams:

at being able to do these restores.

Gary Williams:

We were even able to get some additional hardware and we even

Gary Williams:

starting to do some tests where we're restoring to virtual machines.

Gary Williams:

Because doing that process.

Gary Williams:

We found we could get them up and running a lot quicker.

Gary Williams:

We had a bit more room to breathe and we could have a

Gary Williams:

much better virtual environment.

Gary Williams:

And then we've got into some other really clever stuff where we had a physical

Gary Williams:

domain controller and a virtual domain controller, and we tested fail-over

Gary Williams:

and all this, we got really advanced

W. Curtis Preston:

So you're saying that, the green column was

W. Curtis Preston:

actually the color of that column was automatically determined by By

Gary Williams:

the age of the last test.

W. Curtis Preston:

That's pretty cool

Prasanna Malaiyandi:

Conditional formatting, Curtis in Excel.

Gary Williams:

that's it.

W. Curtis Preston:

you're probably better at Excel that I am,

W. Curtis Preston:

but, Gary, this has been great.

W. Curtis Preston:

I, I love this story.

W. Curtis Preston:

I love that it, like the other story we had, where, I don't know if you

W. Curtis Preston:

listen to the podcast at all, Gary, but we had an episode where someone, they

W. Curtis Preston:

tested their backups by essentially deleting their entire data center.

Prasanna Malaiyandi:

Paul van Dyke episode 135.

Gary Williams:

Wow.

Gary Williams:

I haven't heard that one.

Gary Williams:

I have heard some of the others and I have to say I'm a fan.

W. Curtis Preston:

And it was that one that would just, it

W. Curtis Preston:

hurt to, to listen to his story.

W. Curtis Preston:

And it was, he agrees that it was a really dumb idea.

W. Curtis Preston:

It did eventually work out, but it it took him awhile.

Gary Williams:

I can imagine.

Gary Williams:

I I just remember the pain of the exchange server and whilst I've not had

Gary Williams:

a repeat of that pain since, because.

Gary Williams:

The software is better these days, the restores are a lot quicker and you do

Gary Williams:

have a lot more options to play with.

Gary Williams:

we still have that pain from time to time when trying to do certain restores

Gary Williams:

and testing the environment out.

Gary Williams:

So I am still not that brave to do something like that, but, yeah,

Gary Williams:

I think we're getting there and.

W. Curtis Preston:

Not brave be the word I would use, but.

Gary Williams:

Now we have talks about bringing in things like the chaos

Gary Williams:

monkey and taking down things, but yeah, that's a test for another day.

W. Curtis Preston:

Yeah.

W. Curtis Preston:

thanks Prasanna for your usual great questions

Prasanna Malaiyandi:

Always and nice chatting with you, Gary.

Prasanna Malaiyandi:

That was fun.

Gary Williams:

Thank you.

W. Curtis Preston:

and, thanks to the listeners again.

W. Curtis Preston:

this is you're why we're here.

W. Curtis Preston:

You're why we sit here and talk to us.

Prasanna Malaiyandi:

Curtis.

Prasanna Malaiyandi:

And we'll talk to each other anyway.

Prasanna Malaiyandi:

It doesn't matter.

W. Curtis Preston:

Yeah.

W. Curtis Preston:

Yeah, exactly.

W. Curtis Preston:

We'll probably be talking about table saws or video editing tools,

W. Curtis Preston:

but, anyway, remember to subscribe so that you can restore it all.