January 16, 2004

Single Point of Failure

Whoo hoo! I finished a major part of my floor reconfiguration project this week. I freed up another fiber optics switch today and now I can move it to the other end of the floor. Of course moving it will take another two weeks because I have to put in an electrical power request.

Here's the procedure:

1. Submit a request to Facilities using the internet.

2. The electrician will come out and I will explain to him what needs to be done. i.e. Remove a power drop from one PDU (Power Distribution Unit) and reconnect it to a PDU at the other end of the floor.

3. The electrician will then contact me with the price.

4. I will forward the price on to the FBC (Financial Bean Counter) along with the reason for the request.

5. The FBC will then (hopefully) approve it and forward on to my CDSMŽ (Clueless Dipshit Manager) who will then (hopefully) approve it.

6. The FBC will then send an account code with a major code, a minor code, and a sub minor code (I am not making this up.) to Facilities.

7. The electrician will do the work.

8. I will move the switch to the other end of the floor and plug it in.

Over the past two weeks I was able to move over 20 fiber optic cables which connected to 6 processors and three DASD subsystems. I was able to do this without impacting any users. The reason I was able to do it was that I designed the hardware configuration to not have what engineers call a single point of failure.

Holy Crap! I'm channeling Den Beste here!

When I designed the hardware configurations for these processors, I ensured that there were multiple paths to every device. That way, if sumpin' happened, like a programmer disconnecting a channel, the system would still run OK.

MVS (the operating system - actually z/OS now) wasn't completely happy, but over the years IBM has built a lot of recovery stuff into MVS. What happens if I just pull out the cable without varying the channel offline is MIH (Missing Interrupt Handler) gets involved.

The IOS (I/O Scheduler) sends an I/O request to the CSS (Channel Subsystem). If the path is bad (because a programmer unplugged the channel) the CSS will not get an I/O complete interrupt back from the device. The CSS tells the IOS who kicks off MIH and redrives the request. If it fails again, the CSS marks that path as bad in the UCW (subchannel) and IOS marks the path as bad in the UCB (Unit Control Block). That path will not be used again until it is successfully varied back on. How do I know this? I used to teach this shit.

Of course, the right way to do this is to vary the channel offline before unplugging it, but the workload was light this week so I let MIH do its job. The users didn't even notice. And, since I didn't have a single point of failure, MIH, IOS, and CSS took care of things for me.

My friend Brian has been in town the past two weeks and he laid all the cables for me. Even though I did the original hardware configuration, I was not involved in the cabling so a lot of it wasn't labled. I took a chance on unplugging some of the cables in the I/O control units and I guessed right. All the stuff we ran is now labled correctly at both ends. After plugging everything in, I had to change the hardware configurations on all six processors. This affected over 30 systems with multiple users. No complaints. Mission accomplished. I am good at what I do. I'm modest too.

I still have to do two more switches. I hope to finish by the end of the quarter.

But talking about a single point of failure is one of the reasons that I am an atheist. Ya see, if there is a God, he's a good artist but a lousy engineer. Case in point, the human body.

The human body was not designed to last past 50 years. Think about it. An athlete is in his prime in his early 30's. After that, it's all downhill. Our ancestors on the savannah in Africa probably did not make it past 50. Their teeth were all gone by then and so was their eyesight in many cases. Plus, they couldn't run fast anymore.

Picture Oog and Moog running across the savannah being chased by a lion.

Oog: Grunt. Grunt. Grunt. Translation - Dude! We can't outrun that lion!

Moog: Grunt. Grunt. Grunt. Translation - Dude! I only have to outrun you!

Remember when we used to think Aunt Emily was wierd so we locked her in the cellar and let her drool all over herself? That was before we knew about Alzheimers. We just thought she was senile.

Diseases that we didn't know about three hundred years ago we know about now because people are living longer. So the human body was designed with planned obsolescence. Not a very good engineering job.

How about single point of failure? Two kidneys. Check. Two lungs. Check. One heart. No. One liver. No. And that brings me to a topic near and dear to my heart: the central nervous system.

The brain is a remarkable engineering design. There are multiple pathways for neurons and some redundancy built in. The spinal cord is not. The spinal cord is a prime example of a single point of failure. You mess up the cord and you're fucked. And there are some other problems.

If you damage a nerve outside the spinal cord, there is a good chance that nerve will regrow. Those nerves are lower motor neurons and they can regenerate. The nerves in the spinal cord are upper motor neurons and they cannot regenerate.

Didja know that you can sever a reptile's spinal cord and it will grow back? But a reptile's cord is not as sophisticated as a mammal's cord. We substituted complexity for the ability to regenerate. We have great manual dexterity. Just look at your fingers.

And you don't even have to sever the spinal cord in mammals. One of my contemporaries at Shepherd Center in Atlanta, where I went through rehab, was shot in the back. The bullet completely missed his spinal cord. Nevertheless, the shock of the bullets passage, caused the cord to swell and hemorrhage internally. He was completely paralyzed below the swelling.

I was semi-lucky. I fractured my spine at T12/L1 (twelfth thoracic, first lumbar). My injury is incomplete which means I have some functionality below my injury. That is why I am able to walk with braces and crutches.

Some smart engineers working with some smart doctors are gonna someday figger out a way to either make the neurons in the spinal cord regenerate or they'll figger out a way to rewire the central nervous system. A doctor performed microsurgery on my hand this summer and reconnected some severed nerves. In twenty or thirty years they'll probably be able to do sumpin' similar with the spinal cord.

I just wish the spinal cord was not a single point of failure. Sloppy engineering.

Posted by denny at January 16, 2004 07:56 PM