What do model airplanes have to do with avoiding application failures? It's Tech Tapas Tuesday, let's go. I learned to fly radio controlled airplanes when I was a kid. And one of the most important rules. I remember was always keep your airplane, at least two mistakes high. You see when you're learning to fly a model airplane, especially when you begin to attempt acrobatics, you learn this lesson quickly because mistakes equal altitude. You make a mistake, you lose altitude. As you can imagine, losing too much altitude makes for a very bad day for you and your airplane. So what does this have to do with avoiding application failures? Well, keeping your plane, at least two mistakes high means staying high enough so that you can recover from two mistakes made at the same time. Imagine you're flying your plane and you make a mistake. You lose altitude. While you're trying to recover from the mistake, you have to do a number of tricky maneuvers, such as trying to level the plane out, slow it down and turn it into the wind. These are critical tasks you need to perform to save your plane. What happens if you make a mistake while you're performing those tasks? You need to make sure you are still high enough so that the second mistake doesn't result in a crash. The same rule of thumb applies when building highly available high-scale web applications. Say your application has a problem and your website goes down in the middle of the night. After getting paged you find yourself in a war room, what the impacted developors, product owners and other team members trying to figure out what to do. You try one thing, then another, than another desperately trying to fix the problem that caused the application failure in the first place. This is a high stress situation. One in which it's easy for people to make mistakes, including potentially catastrophic mistakes. I was once in one of these war rooms, when an engineer suddenly put their head down on the table and moaned, oh no! You see, the engineer had just typed a command that was designed to fix a problem, but instead of typing the correct command, the engineer type the command that caused a major failure of a critical database, making the entire situation substantially worse. It was at that moment that our struggling model airplane, our entire company's application, our entire reason for existence as a company was in serious trouble and headed for a crash. So how can you make sure you're keeping two mistakes high when you're running a modern digital application? To help avoid damaging application failures, you can start by making sure you have processes, rules, and procedures to use during critical problem scenarios that are designed to help the situation without introducing even worse problems. For example. First during critical downtime responses, don't allow a single lone engineer to execute commands on any production system. Overly stressed engineers can make simple mistakes that can lead to even bigger problems instead require that all commands are reviewed by at least one other engineer before they submit the command to be executed. The simple two-step process can help your team avoid making catastrophic mistakes. Second, create standard processes and procedures for solving various common problem scenarios. Often these are called playbooks, or runbooks make sure to use these playbooks during critical period. This gives everyone clear steps to follow and reduces the likelihood of your team. Making additional mistakes. And finally, look for and avoid cascading or double dependent problems. A double dependent problem is a set of problems that combine to make the situation worse than any of the individual single problems were themselves. It's like leaving your garage door opener in your car in the driveway overnight, then forgetting to lock your car door. Either of those two mistakes by itself, isn't a big problem. But when they occur together, you're inviting big trouble. Finding double dependent problems can be a challenge, but when you do locate them, they're critical to resolve since they can cause small problems to become large problems quickly.