When I started my small series about migrating systems, I commented, “anyone working in technology long enough has had to deal with a system migration.” I am just as sure that anyone working in the technology field long enough has also taken down production.
Ways I have broken production
Spinning down a disk drive. This is the first thing I can remember doing that took down a production site. I was only a couple of years out of school, and this was my first real job in the technology sector. This was back in the mid-80’s when you had those massive multi-platter disk drives. You would literately press a button on the front of the cabinet that housed the disk, and that would start the disk spinning down.
I was working for a pretty large company, so we had both a production system and a test system. Unfortunately, all of the equipment looked the same, and the systems were physically sitting right next to each other. I was in the computer room by myself, and I did mean to spin down a disk drive. But it was supposed to be on the test system and not the production. Oops!
It was one of those things, like it always is, where you know what you did the instant that you do it. You get that feeling in the pit of your stomach because you know that you just did something awful, and there isn’t anything you can do to stop it.
This took place in the middle of a workday, and it affected the production line. I was hoping in my heart that I could spin the disk back up, and no one would notice. But I knew in my brain that wasn’t going to happen. I was in the process of spinning the disk back up when everyone came running into the computer room to see why all heck was breaking loose in the manufacturing area.
Malformed SQL query. This is probably one of the easiest things to do. And I have heard a lot of people share their experience in creating catastrophe with this method. For me, I deleted everything in a particular table:
delete * from ImportantTable;
Once again, I meant to do this. But once again, I thought I was on the test system instead of the production. To be clear, this was many years and a couple of companies after spinning down the disk incident. Afterward, I made sure that a terminal session I opened up on the production system had a different background color than a session I opened on the test system.
One of the more common ways I have heard of people making SQL mistakes is issuing a delete statement without the associated where clause.
Fix what you have broken and own up to it
It is possible that you can do the wrong thing and affect production without actually taking down your production system. You may even be able to recover from a mistake without anyone noticing.
I am a firm believer that if you muck something up, it is your responsibility to “muck it back down.” When you make a mistake, you generally know what you need to do to remedy the problem. Whenever I had to recover from one of my mistakes, the job that I would execute would have the name of MIBD, which stood for Muck It Back Down (actually, I may have named the job FIBD, but you get my meaning). If someone saw a job running on the system with that name, then they knew that I had been up to no good.
But I would also make sure that the people that needed to know what happened knew. So if there were repercussions that I wasn’t aware of, people would understand what happened and why. Making sure others knew was my way of making sure everyone understood that it was an honest mistake and not done for nefarious reasons. Getting out in front of things like this is an essential part of keeping a job.
It is also vital that you learn from your mistakes so you will not repeat them. You should use it as a learning experience for yourself and as a teaching experience for others.
The LAD management theory
Several years ago, after an incident had happened in production, I had a discussion with a co-worker who shared with me his theory of how to manage when things went unexpectedly bad. He called this his LAD approach. And it is composed of three essential parts:
- First of all, you Look surprised.
- Secondly, you Act concerned.
- And third, you Deny everything.
I have to admit that I have kept that in the back of my mind ever since. Even if you are not in a management role, you still need to know how to manage people and circumstances. You also need to develop situational awareness and know which direction to try and steer events. I guess this is just what is called experience.
I would love to hear other ways people have taken down production and how they have MIBD’ed it. Please leave a comment below.